#### In unsupervised Machine learning the data does not include a measure of success therefore confusion matrix cannot be a method of evaluation.

For example in clustering, if a separate class label data exists (supervised clustering) then of course the resulting cluster assigned to an observation can be compared with that data. There are some metrics, like Homogeneity, Completeness, Adjusted Rand Index, Adjusted Mutual Information, and V-Measure. To compute these metrics, one needs to know the true labels of data-set, so we may test algorithms with classification data-sets to have true labels and then evaluate results.

Otherwise , in unsupervised clustering, techniques such as residual sum-of-squares, purity, the silhouette measure, the

Calinski-Harabasz measure, class-based precision and recall, the normalized mutual

information, variation of information, and graph-sensitive indices should be used in unsupervised clustering.

#### SILHOUETTE

The silhouette method provides a measure of how similar the data is to the assigned cluster

as compared to other clusters. This is computed by calculating the silhouette value for each

data point, and then averaging the result across the entire data set.

To compute the silhouette value for a single data point you need to compute the average

distance between a data point and all clusters. The average distance between a data point

and all other data points within a cluster is calculated with the following equation:

The silhouette value for a single data point is calculated with the following equation:

In this equation, AverageOut is the minimum average distance between the data point and

data within other clusters, and AverageIn is the average distance between the data point

and other data within the same cluster. The silhouette measure is the average of all the

silhouette values computed for each data point.

The silhouette measure is restricted to the range of [-1, 1], with -1 meaning that no data

are well suited to their assigned clusters, and 1 meaning that all data are well suited to their

assigned clusters.

The silhouette measure is useful because it considers distance to other clusters in addition

to the distance within the cluster. Because of this comparison, the silhouette measure is

suitable for comparing clustering results that contain different numbers of clusters. If there

are too many or too few clusters, the silhouette measure will be closer to zero than if an

appropriate number of clusters is chosen.

Like SSE, the silhouette measure applies best to centroid-based clustering methods. This is

because rule-based methods or hierarchical clustering methods don’t seek to minimize

distances in the same way that centroid-based cluster methods do. Because of this,

methods that are not centroid-based are not appropriately described by the silhouette

measure. An additional area of concern is that the silhouette measure takes much longer to

compute than SSE. The total of differences is equal to the square of the number of total

data points.

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3409-2019.pdf

#### Videos

Topic | Access |
---|---|

Hierarchical Clustering | Stanford University (15 min) | https://www.youtube.com/watch?v=rg2cjfMsCk4 |

Hierarchical Clustering in R (44 min) | https://www.youtube.com/watch?v=9U4h6pZw6f8 |

The k Means Algorithm | Stanford University (13 min) | https://www.youtube.com/watch?v=RD0nNK51Fp8 |

K Means Clustering: Pros and Cons of K Means Clustering ( 24 min) | https://www.youtube.com/watch?v=YIGtalP1mv0 |

Clustering: K-means and Hierarchical (customer segmentation) (17 minutes) optional | https://www.youtube.com/watch?v=QXOkPvFM6NU |

#### Articles

Topic | Access |
---|---|

Hierarchical Clustering in R | https://www.datacamp.com/community/tutorials/hierarchical-clustering-R |

Hierarchical Cluster Analysis | https://uc-r.github.io/hc_clustering |

K-Means Clustering in R | https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/ |

K-Means Clustering in R Tutorial | https://www.datacamp.com/community/tutorials/k-means-clustering-r |

K-means Cluster Analysis | https://uc-r.github.io/kmeans_clustering |

https://search.r-project.org/CRAN/refmans/factoextra/html/fviz_nbclust.html

=-=-=-=-=-=-=-=-=-=-=-

library(factoextra)

library(cluster)

#Find the optimum Number of clusters

#Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be #generated.

#fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster #sums of squares, average silhouette and gap statistics.

# one of the following is enough to tell us optimum k

#for visualization , the data frame (df) is better to be a two column data frame because we can later visualize the clusters in a two dimensional space

factoextra::fviz_nbclust(x=df, FUNcluster = kmeans, method = “silhouette”)

factoextra::fviz_nbclust(x=df, FUNcluster = hcut, method = c(“silhouette”))

factoextra::fviz_nbclust(x=df, FUNcluster = kmeans, method = c(“silhouette”))

factoextra::fviz_nbclust(dx=f, FUNcluster = luster::pam, method = c(“silhouette”))

factoextra::fviz_nbclust(x=df, FUNcluster = cluster::clara, method = c(“silhouette”))

########################################################

#to visualize the clusters, create a distance matrix

DistanceMatrix<-factoextra::get_dist(df, method = “euclidean”)

#### ####################################

#create a Hierarchical cluster object

#method: one of

#”ward.D”, “ward.D2”, “single”, “complete”,

#”average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

#stats::hclust performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Hclustermodel<-stats::hclust(DistanceMatrix, method=”ward.D”) #cut it to K=2 #stats::cutree Cuts a Tree into Groups of Data is necessary for Hierarchical Cluster Objects #it is at this step that a cluster is assigned to each point CutModel<-stats::cutree(Hclustermodel, k=2) #CutModel will contain the cluster each point belongs to CutModel # Visualizing plotting the cut cluster plot.new() #factoextra::fviz_cluster Can’t handle an object of class hclust directly therefore we have to use the list (to pass the object) factoextra::fviz_cluster(list(data = df, cluster = CutModel)) |

#### But it is better to use hcut

#Computes Hierarchical Clustering and Cut the Tree

hcmodel <- factoextra::hcut(HCdfscaled, k = 10, hc_method = “complete”)

#Visualize clusters

factoextra::fviz_cluster(hcmodel, data = subset, geom = “point”, ellipse.type = “convex”, show.clust.cent = T)

#for hierarchical clustering, typically resulting from agnes() or diana()

#### hc<-cluster::agnes(df, method = “ward”)

#### cluster::pltree(hc, cex = 0.6, hang = 1, main = “Dendogram Hierachical Cluster made by anges Agglomerative Nesting”)

#### #####################################

#### #Kmeans clustering

#### #Kmeans model model contains the mean of each cluster in a n dimensional space and Each point assigned to a cluster

#### #Perform k-means clustering on a data matrix.

#### Kmeans_result<- stats::kmeans(df, centers=3)

### #Visualize the clusters

#object = an object of class “partition” created by the functions pam(),

#clara() or fanny() in cluster package;

#”kmeans” [in stats package];

#”dbscan” [in fpc package];

#”Mclust” [in mclust]; “hkmeans”,

#”eclust” [in factoextra].

#Possible value are also any list object with data and cluster components (e.g.: object = list(data = mydata, cluster = myclust)).

factoextra::fviz_cluster(object=Kmeans_result, data = df)

If there are more than two dimensions (variables) in the data frame, `fviz_cluster`

will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

https://uc-r.github.io/hc_clustering

https://uc-r.github.io/kmeans_clustering

https://stackoverflow.com/questions/65844978/caret-classification-thresholds