Options in agglomerative clustering

Agglomerative hierarchical clustering is a versatile technique that offers several options for customization. By understanding these options, you can tailor the algorithm to your specific data and requirements.

Linkage Criteria

The choice of linkage criterion determines how the distance between clusters is calculated. Different linkage criteria can lead to different clustering results. Common linkage criteria include:

  • Single-linkage: The distance between two clusters is defined as the minimum distance between any two data points in the clusters. This can lead to elongated clusters, as it is sensitive to outliers.  
  • Complete-linkage: The distance between two clusters is defined as the maximum distance between any two data points in the clusters. This can lead to more compact clusters, as it is less sensitive to outliers.  
  • Average-linkage: The distance between two clusters is defined as the average distance between all pairs of data points in the clusters. This is a balanced approach that can produce reasonable results in many cases.  
  • Centroid-linkage: The distance between two clusters is defined as the distance between their centroids. This can be computationally efficient for large datasets.

Distance Metrics

The choice of distance metric determines how the distance between data points is calculated. Common distance metrics include:

  • Euclidean distance: The straight-line distance between two points in Euclidean space.
  • Manhattan distance: The sum of the absolute differences between corresponding coordinates of two points.  
  • Minkowski distance: A generalization of Euclidean and Manhattan distance that allows for different powers of the differences.
  • Cosine similarity: Measures the angle between two vectors.

Cutoff Criterion

The cutoff criterion determines when the clustering process should stop. Common cutoff criteria include:

  • Number of clusters: Specify the desired number of clusters. The algorithm will stop when there are only that many clusters remaining.
  • Maximum distance: Set a maximum distance threshold. The algorithm will stop when the distance between the two closest clusters exceeds this threshold.
  • Dendrogram height: Examine the dendrogram and identify a suitable cutoff level based on the natural breaks in the tree.

Other Options

Some additional options available in agglomerative hierarchical clustering include:

  • Weighted linkage: Assign weights to data points or features to emphasize or de-emphasize their importance in the clustering process.
  • Data preprocessing: Normalize or standardize the data to ensure that features have comparable scales.
  • Outlier detection: Use techniques like outlier detection to identify and remove outliers before clustering.

By carefully considering these options, you can tailor agglomerative hierarchical clustering to your specific data and requirements, ensuring that you obtain meaningful and accurate results.

Step-by-step guide to agglomerative hierarchical clustering
Interpreting dendrograms using hierarchical clustering in Python

Get industry recognized certification – Contact us

keyboard_arrow_up
Open chat
Need help?
Hello 👋
Can we help you?