In unsupervised Machine learning the data does not include a measure of success therefore confusion matrix cannot be a method of evaluation.
For example in clustering, if a separate class label data exists (supervised clustering) then of course the resulting cluster assigned to an observation can be compared with that data. There are some metrics, like Homogeneity, Completeness, Adjusted Rand Index, Adjusted Mutual Information, and V-Measure. To compute these metrics, one needs to know the true labels of data-set, so we may test algorithms with classification data-sets to have true labels and then evaluate results.
Otherwise , in unsupervised clustering, techniques such as residual sum-of-squares, purity, the silhouette measure, the
Calinski-Harabasz measure, class-based precision and recall, the normalized mutual
information, variation of information, and graph-sensitive indices should be used in unsupervised clustering.
SILHOUETTE
The silhouette method provides a measure of how similar the data is to the assigned cluster
as compared to other clusters. This is computed by calculating the silhouette value for each
data point, and then averaging the result across the entire data set.
To compute the silhouette value for a single data point you need to compute the average
distance between a data point and all clusters. The average distance between a data point
and all other data points within a cluster is calculated with the following equation:
The silhouette value for a single data point is calculated with the following equation:
In this equation, AverageOut is the minimum average distance between the data point and
data within other clusters, and AverageIn is the average distance between the data point
and other data within the same cluster. The silhouette measure is the average of all the
silhouette values computed for each data point.
The silhouette measure is restricted to the range of [-1, 1], with -1 meaning that no data
are well suited to their assigned clusters, and 1 meaning that all data are well suited to their
assigned clusters.
The silhouette measure is useful because it considers distance to other clusters in addition
to the distance within the cluster. Because of this comparison, the silhouette measure is
suitable for comparing clustering results that contain different numbers of clusters. If there
are too many or too few clusters, the silhouette measure will be closer to zero than if an
appropriate number of clusters is chosen.
Like SSE, the silhouette measure applies best to centroid-based clustering methods. This is
because rule-based methods or hierarchical clustering methods don’t seek to minimize
distances in the same way that centroid-based cluster methods do. Because of this,
methods that are not centroid-based are not appropriately described by the silhouette
measure. An additional area of concern is that the silhouette measure takes much longer to
compute than SSE. The total of differences is equal to the square of the number of total
data points.
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3409-2019.pdf
Videos
Topic | Access |
---|---|
Hierarchical Clustering | Stanford University (15 min) | https://www.youtube.com/watch?v=rg2cjfMsCk4 |
Hierarchical Clustering in R (44 min) | https://www.youtube.com/watch?v=9U4h6pZw6f8 |
The k Means Algorithm | Stanford University (13 min) | https://www.youtube.com/watch?v=RD0nNK51Fp8 |
K Means Clustering: Pros and Cons of K Means Clustering ( 24 min) | https://www.youtube.com/watch?v=YIGtalP1mv0 |
Clustering: K-means and Hierarchical (customer segmentation) (17 minutes) optional | https://www.youtube.com/watch?v=QXOkPvFM6NU |
Articles
Topic | Access |
---|---|
Hierarchical Clustering in R | https://www.datacamp.com/community/tutorials/hierarchical-clustering-R |
Hierarchical Cluster Analysis | https://uc-r.github.io/hc_clustering |
K-Means Clustering in R | https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/ |
K-Means Clustering in R Tutorial | https://www.datacamp.com/community/tutorials/k-means-clustering-r |
K-means Cluster Analysis | https://uc-r.github.io/kmeans_clustering |
https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_nbclust.html
=-=-=-=-=-=-=-=-=-=-=-
library(factoextra)
library(cluster)
#Find the optimum Number of clusters
#Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be #generated.
#fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster #sums of squares, average silhouette and gap statistics.
# one of the following is enough to tell us optimum k
#for visualization , the data frame (df) is better to be a two column data frame because we can later visualize the clusters in a two dimensional space
factoextra::fviz_nbclust(x=df, FUNcluster = kmeans, method = “silhouette”)
factoextra::fviz_nbclust(x=df, FUNcluster = hcut, method = c(“silhouette”))
factoextra::fviz_nbclust(x=df, FUNcluster = kmeans, method = c(“silhouette”))
factoextra::fviz_nbclust(dx=f, FUNcluster = luster::pam, method = c(“silhouette”))
factoextra::fviz_nbclust(x=df, FUNcluster = cluster::clara, method = c(“silhouette”))
########################################################
#to visualize the clusters, create a distance matrix
DistanceMatrix<-factoextra::get_dist(df, method = “euclidean”)
####################################
#create a Hierarchical cluster object
#method: one of
#”ward.D”, “ward.D2”, “single”, “complete”,
#”average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
#stats::hclust performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Hclustermodel<-stats::hclust(DistanceMatrix, method=”ward.D”) #cut it to K=2 #stats::cutree Cuts a Tree into Groups of Data is necessary for Hierarchical Cluster Objects #it is at this step that a cluster is assigned to each point CutModel<-stats::cutree(Hclustermodel, k=2) #CutModel will contain the cluster each point belongs to CutModel # Visualizing plotting the cut cluster plot.new() #factoextra::fviz_cluster Can’t handle an object of class hclust directly therefore we have to use the list (to pass the object) factoextra::fviz_cluster(list(data = df, cluster = CutModel)) |
But it is better to use hcut
#Computes Hierarchical Clustering and Cut the Tree
hcmodel <- factoextra::hcut(HCdfscaled, k = 10, hc_method = “complete”)
#Visualize clusters
factoextra::fviz_cluster(hcmodel, data = subset, geom = “point”, ellipse.type = “convex”, show.clust.cent = T)
#for hierarchical clustering, typically resulting from agnes() or diana()
hc<-cluster::agnes(df, method = “ward”)
cluster::pltree(hc, cex = 0.6, hang = 1, main = “Dendogram Hierachical Cluster made by anges Agglomerative Nesting”)
#####################################
#Kmeans clustering
#Kmeans model model contains the mean of each cluster in a n dimensional space and Each point assigned to a cluster
#Perform k-means clustering on a data matrix.
Kmeans_result<- stats::kmeans(df, centers=3)
#Visualize the clusters
#object = an object of class “partition” created by the functions pam(),
#clara() or fanny() in cluster package;
#”kmeans” [in stats package];
#”dbscan” [in fpc package];
#”Mclust” [in mclust]; “hkmeans”,
#”eclust” [in factoextra].
#Possible value are also any list object with data and cluster components (e.g.: object = list(data = mydata, cluster = myclust)).
factoextra::fviz_cluster(object=Kmeans_result, data = df)
If there are more than two dimensions (variables) in the data frame, fviz_cluster
will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
https://uc-r.github.io/hc_clustering
https://uc-r.github.io/kmeans_clustering
https://stackoverflow.com/questions/65844978/caret-classification-thresholds