This is an old revision of the document!
Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data.
This provides very insightful information about the data's internal organisation, possible groups, their number and distribution, and other internal regularities that might be used to better understand the data content. One might consider grouping customers by income estimate to explain the clustering better. It is very natural to assume some threshold values of 1KEUR per month, 10KEUR per month etc. However:
It is obvious that, most probably, customers' behaviour depends on factors like occupation, age, total household income, and others. While the need for considering other factors is obvious, grouping is not – how exactly different factors interact to decide which group a given customer belongs to. That is where clustering exposes its strength – revealing natural internal structures of the data (customers in the provided example).
In this context, a cluster refers to a collection of data points aggregated together because of certain similarities 1). Within this chapter, two different approaches to clustering are discussed:
In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should generally be numerical, assuming an Euclidian space.
The first method discussed here is one of the most commonly used – K-Means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, which represents a distance from the given point of the group to the group's centre representing a group's prototype - centroid. The result of the clustering is N points grouped into K clusters, where each point has assigned a cluster index, which means that the distance from the point of the cluster centroid is closer than the distance to any other centroids of other clusters. Distance measure employs Euclidian distance, which requires scaled or normalised data to avoid the dominance of a single dimension over others. The algorithm steps schematically are represented in the following figure:
In the figure:
Steps 4-6 are repeated until changes in cluster positions do not change or changes are insignificant. The distance is measured using Euclidian distance:
, where:
Example of initial data and assigned cluster marks with cluster centres after running the K-means algorithm:
Unfortunately, the K-Menas algorithm does not possess automatic mechanisms to select the number of clusters K, i.e., the user must set it. Example of setting different numbers of cluster centres:
In K-means clustering, a practical method – the Elbow method is used to select a particular number of clusters. The elbow method is based on finding the point at which adding more clusters does not significantly improve the model's performance. As explained, K-means clustering optimises the sum of squared errors (SSE) or squared distances between each point and its corresponding cluster centroid. Since the optimal number of clusters (NC) is not known initially, it is wise to increase the NCs iteratively. The SSE decreases as the number of clusters increases because the distances to the cluster centres also decrease. However, there is a point where the improvement in SSE diminishes significantly. This point is referred to as the “elbow” 2).
Steps of the method:
Since the method requires iteratively running the K-means algorithm, which might be resource-demanding, a selection of data might be employed to determine the NC first and then used to run the K-means on the whole dataset.
Limitations:
The figure above demonstrates more and less obvious “elbows”, where users could select the number of clusters equal to 3 or 4.
The Silhouette Score is a metric used to evaluate the quality of a clustering result. It measures how similar an object (point) is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from −1 to +1, where higher values indicate better-defined clusters 3).
The Silhouette score considers two main factors for each data point:
The silhouette score for a point i is then calculated as:
,where:
Steps of the method:
Limitations:
An example is provided in the following figure:
The user should look for the highest score, which in this case is for the 3-cluster option.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) employs density measures to mark points in high-density regions and those in low-density regions – the noise. Because of this natural behaviour of the algorithm, it is particularly useful in signal processing and similar application domains. 4).
One of the essential concepts is the point's p neighbourhood, which is the set of points reachable within the user-defined distance eps (epsilon):
, where:
The algorithm treats different points differently depending on density and neighbouring points distribution around the point – its neighbourhood:
afdafafa ssdgsdfgs dsdgfsdfgsg sg sdg fsdfgs dfg sdfgsg