Title: | Robust Graph-Based Two-Sample Test |
---|---|
Description: | Useful tools for determining whether two samples are from the same distribution. Utilizes a robust method to address the problematic structure of the similarity graph constructed from high-dimensional data. The method is provided in Yichuan Bai and Lynna Chu (2023) <arXiv:2307.12325>. |
Authors: | Yichuan Bai [aut, cre], Lynna Chu [aut] |
Maintainer: | Yichuan Bai <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1 |
Built: | 2025-02-14 05:12:11 UTC |
Source: | https://github.com/cran/rgTest |
These example contains a dataset, the label of the observations in the dataset, the distance matrix of the dataset using L2 distance, and the edge matrix generated by 5-MST.
example0
example0
An object of class list
of length 4.
data
pooled dataset of two samples sampling from two different t-distributions.
label
label of the observations. 'sample 1' denotes the observations in sample 1. 'sample 2' denotes the observations in sample 2.
distance
the distance matrix of the pooled dataset using L2 distance.
edge
edge matrix generated by 5-MST.
This function returns the distance matrix using L2 distance.
getdis(y)
getdis(y)
y |
dataset of the pooled data |
A distance matrix based on the L2 distance.
data(example0) data = as.matrix(example0$data) # pooled dataset getdis(data)
data(example0) data = as.matrix(example0$data) # pooled dataset getdis(data)
Performs robust graph-based two sample test.
rg.test(data.X, data.Y, dis = NULL, E = NULL, n1, n2, k = 5, weigh.fun, perm.num = 0, test.type = list("ori", "gen", "wei", "max"), progress_bar = FALSE)
rg.test(data.X, data.Y, dis = NULL, E = NULL, n1, n2, k = 5, weigh.fun, perm.num = 0, test.type = list("ori", "gen", "wei", "max"), progress_bar = FALSE)
data.X |
a numeric matrix for observations in sample 1. |
data.Y |
a numeric matrix for observations in sample 2. |
dis |
a distance matrix of the pooled dataset of sample 1 and sample 2. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2 in the pooled dataset. |
E |
an edge matrix representing a similarity graph. Each row represents an edge and records the indices of two ends of an edge in two columns. The indices of observations in sample 1 are from 1 to n1 and indices of observations in sample 2 are from 1+n1 to n1+n2. |
n1 |
number of observations in sample 1. |
n2 |
number of observations in sample 2. |
k |
parameter in K-MST, with default 5. |
weigh.fun |
weighted function which returns weights of each edge and is a function of node degrees. |
perm.num |
number of permutations used to calculate the p-value (default=1000). Use 0 for getting only the approximate p-value based on asymptotic theory. |
test.type |
type of graph-based test. This must be a list containing elements chosen from "ori", "gen", "wei", and "max", with default 'list("ori", "gen", "wei", "max")'. "ori" refers to robust orignial edge-count test, "gen" refers to robust generalized edge-count test, "wei" refers to robust weighted edge-count test and "max" refers to robust max-type edge-count tests. |
progress_bar |
a logical evaluating to TRUE or FALSE indicating whether a progress bar of the permutation should be printed. |
The input should be one of the following:
datasets of the two samples;
the distance matrix of the pooled dataset;
the edge matrix generated from a similarity graph.
Typical usages are:
rg.test(data.X, data.Y, n1, n2, weigh.fun, ...)
rg.test(dis, n1, n2, weigh.fun, ...)
rg.test(E, n1, n2, weigh.fun, ...)
If the data matrices or the distance matrix are used, the similarity graph is generated using K-MST.
A list containing the following components:
asy.ori.statistic |
the asymptotic test statistic using robust original graph-based test. |
asy.ori.pval |
the asymptotic p-value using robust original graph-based test. |
asy.gen.statistic |
the asymptotic test statistic using robust generalized graph-based test. |
asy.gen.pval |
the asymptotic p-value using robust generalized graph-based test. |
asy.wei.statistic |
the asymptotic test statistic using robust weighted graph-based test. |
asy.wei.pval |
the asymptotic p-value using robust weighted graph-based test. |
asy.max.statistic |
the asymptotic test statistic using robust max-type graph-based test. |
asy.max.pval |
the asymptotic p-value using robust max-type graph-based test. |
perm.ori.pval |
the p-value based on permutation using robust original graph-based test. |
perm.gen.pval |
the p-value based on permutation using robust generalized graph-based test. |
perm.wei.pval |
the p-value based on permutation using robust weighted graph-based test. |
perm.max.pval |
the p-value based on permutation using robust max-type graph-based test. |
## Simulated from Student's t-distribution. ## Observations for the two samples are from different distributions. data(example0) data = as.matrix(example0$data) # pooled dataset label = example0$label # label of observations s1 = data[label == 'sample 1', ] # sample 1 s2 = data[label == 'sample 2', ] # sample 2 num1 = nrow(s1) # number of observations in sample 1 num2 = nrow(s2) # number of observations in sample 2 ## Graph-based two sample test using data as input rg.test(data.X = s1, data.Y = s2, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0) ## Graph-based two sample test using distance matrix as input dist = example0$distance rg.test(dis = dist, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0) ## Graph-based two sample test using edge matrix of the similarity graph as input E = example0$edge rg.test(E = E, n1 = num1, n2 = num2, weigh.fun = weiMax, perm.num = 0)
## Simulated from Student's t-distribution. ## Observations for the two samples are from different distributions. data(example0) data = as.matrix(example0$data) # pooled dataset label = example0$label # label of observations s1 = data[label == 'sample 1', ] # sample 1 s2 = data[label == 'sample 2', ] # sample 2 num1 = nrow(s1) # number of observations in sample 1 num2 = nrow(s2) # number of observations in sample 2 ## Graph-based two sample test using data as input rg.test(data.X = s1, data.Y = s2, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0) ## Graph-based two sample test using distance matrix as input dist = example0$distance rg.test(dis = dist, n1 = num1, n2 = num2, k = 5, weigh.fun = weiMax, perm.num = 0) ## Graph-based two sample test using edge matrix of the similarity graph as input E = example0$edge rg.test(E = E, n1 = num1, n2 = num2, weigh.fun = weiMax, perm.num = 0)
This weight function returns the inverse of the arithmetic average of the node degrees of an edge.
weiArith(a, b)
weiArith(a, b)
a |
node degree of one end of an edge |
b |
node degree of another end of an edge |
The weight uses the arithmetic average of the node degrees of an edge.
# For an edge where one end has a node degree of 5 # another end has a node degree of 6 weiArith(6, 5)
# For an edge where one end has a node degree of 5 # another end has a node degree of 6 weiArith(6, 5)
This weight function returns the inverse of the geometric average of the node degrees of an edge.
weiGeo(a, b)
weiGeo(a, b)
a |
node degree of one end of an edge |
b |
node degree of another end of an edge |
The weight uses the geometric average of the node degrees of an edge.
# For an edge where one end has a node degree of 5 # another end has a node degree of 6 weiGeo(6, 5)
# For an edge where one end has a node degree of 5 # another end has a node degree of 6 weiGeo(6, 5)
This weight function returns the inverse of the max node degree of an edge.
weiMax(a, b)
weiMax(a, b)
a |
node degree of one end of an edge |
b |
node degree of another end of an edge |
The weight uses the max node degrees of an edge.
# For an edge where one end has a node degree of 5 # another end has a node degree of 6 weiMax(6, 5)
# For an edge where one end has a node degree of 5 # another end has a node degree of 6 weiMax(6, 5)