ClusTorus is a package for clustering multivariate angular data, especially for protein structure data. ClusTorus provides various clustering algorithms designed with the conformal prediction framework, which can deal with the outliers. The package suggests various methods for fitting the algorithms, and some of them will be introduced soon. The package also provides some simple tools for handling angluar data, such as angular subtraction, computing angular distance, etc. Now, check how to use ClusTorus briefly.

Data Loading and Handling

ClusTorus provides two toy datasets, which are used in Jung, et.al.(2021), “Clustering on the Torus by Conformal Prediction”, Annals of Applied Statistics. The dataset we will use here is sampled from a mixture of \(K = 3\) clusters, where the first cluster is sampled from a spherical normal distribution with size \(n_1 = 100\), the second cluster of size \(n_2 = 350\) is from the uniform distribution on a large “L”-shaped region, and the third cluster of size 50 is sampled from the uniform distribution on the entire \(\mathbb{T}^2\).

library(ClusTorus)
library(tidyverse)

data <- toydata2
head(data)
#>        phi       psi label
#> 1 2.730154 0.2819140     1
#> 2 3.004941 0.7895926     1
#> 3 2.733773 0.9507966     1
#> 4 2.976103 0.6336779     1
#> 5 2.655409 0.6599409     1
#> 6 3.196350 0.6545109     1

data %>% ggplot(aes(x = phi, y = psi, color = label)) + geom_point() +
  scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
  scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
  ggtitle('Data set 2 with true labels')

ClusTorus provides the function on.torus, which converts the radian scale data to be on \([0, 2\pi)^d\), where \(d\) means the number of dimension. In this case, the provided dataset is already on \([0, 2\pi)^d\), and thus we don’t need to use on.torus.

Clustering with Various Options

Now, we are ready to implement clustering algorithms to the data. ClusTorus provides various options for clustering/constructing prediction set, but we will provide only the case for “kmeans - general”. We need to choose hyperparameters: the number of modes or ellipsoids \(J\) and the significance level \(\alpha\). Before choosing the hyperparameter, we will implement the model fitting function icp.torus.score with various hyperparameter options, first.

set.seed(2021)

Jvec <- 5:30
l <- list()

for (j in Jvec){
  l[[j]] <- icp.torus.score(as.matrix(data[, 1:2]),
                            method = "kmeans",
                            kmeansfitmethod = "general",
                            init = "hierarchical",
                            param = list(J = j), 
                            verbose = FALSE)
}

The list l contains the icp.torus objects, which consist of fitted parameters for generating clusters, by varying the hypterparameter \(J\). That is, these objects are optimally fitted ingredients for generating clusters for given \(J\). By specifying the significance level, we can get the clusters. But, how to generate optimal clusters/conformal prediction sets? One may think that the hyperparameter which generates the cluster/prediction set of the minimum volume/area will be the optimum for given significance level. The other may think that we can choose the number of mixture components \(J\) by using information criteria such as AIC or BIC. These approaches are implemented in the function hyperparam.torus; the main arguments of hyperparam.torus is data, icp.torus.objects, and option. Analogously, the argument data is analogously for the data. icp.torus.objects is for the list object whose elements are icp.torus objects, such as the list l generated above. The argument option, which is the most important argument, is for the hyperparameter selection criterion. If option = "elbow", then hyperparam.torus selects \(J\) and \(\alpha\) based on the volume based criterion as mentioned above. If option = "AIC" or option = "BIC", then hyperparam.torus selects \(J\) based on the designated information criterion, and selects the most stable \(\alpha\) in the sense of the number of generated clusters. If option = "risk", then it chooses \(J\) which minimizes the sum of the conformity scores and analogously choose the stable \(\alpha\). We will use option = "risk" in this case, and the following codes show the criterion results(\(J\) versus the evaluated criterion), \(\alpha\) results(\(\alpha\) versus the number of clusters), and chosen \(J\) and \(\alpha\).

output <- hyperparam.torus(data[, 1:2], icp.torus.objects = l, option = "risk")
output$IC.results
#>     J criterion
#> 1   5  4333.011
#> 2   6  4408.142
#> 3   7  4533.556
#> 4   8  4485.194
#> 5   9  4405.042
#> 6  10  4426.742
#> 7  11  4414.529
#> 8  12  4478.793
#> 9  13  4644.154
#> 10 14  4605.019
#> 11 15  4590.788
#> 12 16  4517.246
#> 13 17  4620.975
#> 14 18  4707.848
#> 15 19  4842.194
#> 16 20  4591.106
#> 17 21  4790.956
#> 18 22  5025.929
#> 19 23  4624.688
#> 20 24  4932.752
#> 21 25  4814.853
#> 22 26  4777.052
#> 23 27  4731.694
#> 24 28  4912.512
#> 25 29  4706.583
#> 26 30  5019.500
output$alpha.results
#>    alpha ncluster
#> 1  0.002        1
#> 2  0.004        1
#> 3  0.006        1
#> 4  0.008        1
#> 5  0.010        1
#> 6  0.012        1
#> 7  0.014        1
#> 8  0.016        1
#> 9  0.018        1
#> 10 0.020        2
#> 11 0.022        2
#> 12 0.024        2
#> 13 0.026        2
#> 14 0.028        3
#> 15 0.030        3
#> 16 0.032        4
#> 17 0.034        4
#> 18 0.036        4
#> 19 0.038        4
#> 20 0.040        4
#> 21 0.042        4
#> 22 0.044        4
#> 23 0.046        4
#> 24 0.048        4
#> 25 0.050        4
#> 26 0.052        4
#> 27 0.054        4
#> 28 0.056        4
#> 29 0.058        2
#> 30 0.060        2
#> 31 0.062        2
#> 32 0.064        2
#> 33 0.066        2
#> 34 0.068        2
#> 35 0.070        2
#> 36 0.072        2
#> 37 0.074        2
#> 38 0.076        2
#> 39 0.078        2
#> 40 0.080        2
#> 41 0.082        2
#> 42 0.084        2
#> 43 0.086        2
#> 44 0.088        2
#> 45 0.090        2
#> 46 0.092        2
#> 47 0.094        2
#> 48 0.096        2
#> 49 0.098        2
#> 50 0.100        2
#> 51 0.102        2
#> 52 0.104        2
#> 53 0.106        2
#> 54 0.108        2
#> 55 0.110        2
#> 56 0.112        3
#> 57 0.114        3
#> 58 0.116        3
#> 59 0.118        3
#> 60 0.120        3
#> 61 0.122        3
#> 62 0.124        3
#> 63 0.126        3
#> 64 0.128        3
#> 65 0.130        3
#> 66 0.132        3
#> 67 0.134        3
#> 68 0.136        3
#> 69 0.138        3
#> 70 0.140        3
#> 71 0.142        3
#> 72 0.144        3
#> 73 0.146        3
#> 74 0.148        3
#> 75 0.150        3
output$optim$hyperparam
#>     J alpha 
#> 5.000 0.084

hyperparam.torus also provides the selected model; that is, since icp.torus.score is the function for the model fitting, hyperparam.torus selects the model of the chosen \(J\). output$optim$icp.torus is the selected model, and we will generate the resulting clusters with it.

icp.torus.kmeans <- output$optim$icp.torus
alphahat <- output$optim$hyperparam[2]

Generating Clusters and Visualization of Clustering Results

With cluster.assign.torus, we can generate the cluster for option mixture and kmeans, and for each data point, the label of cluster is assigned. Moreover, for the case of kmeans, we can check how the cluster is generated directly as below:

c_kmeans <- cluster.assign.torus(data[, 1:2], icp.torus.kmeans, level = alphahat)
c_kmeans$kmeans$plot
#> [[1]]