Congyuan Yang, Carey E. Priebe, Youngser Park, David J. Marchette
https://arxiv.org/abs/1904.02926
Our problem of interest is to cluster vertices of a graph by identifying underlying community structure. Among various vertex clustering approaches, spectral clustering is one of the most popular methods because it is easy to implement while often outperforming more traditional clustering algorithms. However, there are two inherent model selection problems in spectral clustering, namely estimating both the embedding dimension and number of clusters. This paper attempts to address the issue by establishing a novel model selection framework specifically for vertex clustering on graphs under a stochastic block model. The first contribution is a probabilistic model which approximates the distribution of the extended spectral embedding of a graph. The model is constructed based on a theoretical result of asymptotic normality of the informative part of the embedding, and on a simulation result providing a conjecture for the limiting behavior of the redundant part of the embedding. The second contribution is a simultaneous model selection framework. In contrast with the traditional approaches, our model selection procedure estimates embedding dimension and number of clusters simultaneously. Based on our conjectured distributional model, a theorem on the consistency of the estimates of model parameters is presented, providing support for the validity of our method. Algorithms for our simultaneous model selection for vertex clustering are proposed, demonstrating superior performance in simulation experiments. We illustrate our method via application to a collection of brain graphs.
All the experiments are written in R
, and here is the instruction for getting the results in the paper.
This is exactly the same setting as Figures 8 and 10 in the paper (with p = 0.09) except that the graph is smaller, e.g., n = 100 as opposed to n = 500 in the paper. To run this simple demo, please run these lines in R
prompt:
if (!require(RCurl)) install.packages("RCurl")
library(RCurl)
script <- getURL("https://raw.githubusercontent.com/youngser/dhatkhat/master/R/demo.R", ssl.verifypeer = FALSE)
eval(parse(text = script))
and the result (each plot on its own window) should look like this.
(This may take around 10 minutes on a typical laptop.)
To run the synthetic data simulation in the paper, please follow these steps:
- (optional) Run
main_simulation.R
600 times with different parameter pair(i, mc)
, wherei
ranges from 1 to 6 andmc
ranges from 1 to 100.batch_simulation.sh
is a script for running these jobs in our cluster. To run this code on a standalone machine, please do this on a command line (on the same directory where all the.R
files reside) for each(i, mc)
pair, for example
$ R CMD BATCH --no-save --no-restore '--args 1 1' main_simulation.R main_simulation.out &
which takes several minutes on our local linux server. Each run produces a .RData
file as a partial result (600 files total). The precomputed .RData
files are loaded inside loadData.R
.
-
Run
simulation1.R
to get Figures 1, 2, 3, 5, 6, 7, 9 in the paper. The results are shown in here. -
Run
simulation2.R
to get Figures 4, 8, 10, 11 in the paper. The results are shown in here.
To run the connectome data experiment in the paper, please follow these steps:
- (optional) Run
main_DS01216.R
114 times with different parameterfileIndex
, which ranges from 1 to 114.batch_realdata.sh
is a script for running these jobs in our cluster. To run this code on a standalone machine, please do this on a command line (on the same directory where all the.R
files reside) for eachi
-th run, for example
$ R CMD BATCH --no-save --no-restore '--args 1' main_DS01216.R main_DS01216.out &
which takes several hours on our local linux server. Each run will produces a .RData
file as a partial result (114 files total). The precomputed .RData
files are loaded inside loadData.R
.
- Run
plot_DS01216.R
to get Figures 12, 13 and the results of Table 1 in the paper. The results are shown in here.