A meaningful score
Published in · 5 min read · Apr 8, 2021
--
Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data sample into a specific group (cluster).
Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. The hope of the data scientist is that samples in the same cluster have similar properties, features or behaviour. For instance, one could run clustering on cancers’ samples, and the hope is that samples in the same group belong to the same cancer subtype [2].
Once the clustering is run one has to evaluate qualitatively the goodness of the output. First of all, to verify that groups are actually meaningful, and moreover having a score can be the proxy to evaluate different models and choose the best one.
There are many different scores out there, each one with its pros and cons.
One of these measures is the so-called V-measure (or Normalised Mutual Information) score. One of the advantages of this measure is that it can be factorised into two easily to visualise metrics.
This score can be interpretated as an average of other two measures: hom*ogeneity and completeness.
hom*ogeneity
hom*ogeneity measures how much the sample in a cluster are similar. It is defined using the Shannon’s entropy.
Given H(C|K) formula.
It is easy to understand why this is the case. If one looks at the term H(C|K) it contains n𝒸ₖ / nₖ which represents the ratio between the number of samples labelled c in cluster k and the total number of samples in cluster k.
When all samples in cluster k have the same label c, the hom*ogeneity equals 1.
Note that there is a sum over c and k, it doesn’t matter which is the cluster that contains a particular label, it is sufficient that there is at least one. This is useful when running unsupervised methods whose output has nothing to do with the labels.
Here an example of hom*ogeneous clustering. All three clusters contains only one color (feature), in other words all the samples in the same cluster share their label.
Anyway this situation is not optimal since orange samples are split into different clusters . In this example, a feature is not complete.
Completeness
Completeness although measures how much similar samples are put together by the clustering algorithm.
It is easy to understand why this is the case. If one looks at the term H(K|C) it contains n𝒸ₖ / n𝒸 which represents the ratio between the number of samples labelled c in cluster k and the total number of samples labelled c.
When all samples of kind c have been assigned to the same cluster k, the completeness equals 1.
Here an example of a complete clustering. Each color is present only in one group. In this case all points with the same property have been classified together. Note that complete doesn’t imply hom*ogeneous, in fact in each cluster there are multiple colors.
Now we have understood this two metrics it is clear that an optimal algorithm would be able to obtain a partition which is both hom*ogeneous and complete. In this case, all samples with a particular color are put together in a cluster and that cluster contains only them. If the algorithms was unsupervised it should be clear that with this output it has learnt correctly the features.
Otherwise, the worst clustering possible is one which is neither hom*ogeneous or complete. In this case each cluster contains multiple labels and each label is assigned to multiple clusters. In the case both the measures are zero.
Normalised Mutual Information
Finally to obtain a measure of the goodness of our clustering algorithm we can consider the harmonic average between hom*ogeneity and completeness and obtain the V-measure or Normalised Mutual Information (NMI) [1].
This score is a measure between 0–1 that actually quantifies the goodness of the clustering partition. In fact, it requires that both hom*ogeneity h and completeness c are maximised (NMI is 1 when both h and c are 1). Moreover if the clustering doesn’t satisfy any of the two conditions NMI will be zero.
There are many measures of this kind, but thank to its factorisation in the two measures hom*ogeneity and completeness, its interpretability is clear and intuitive.
So, now you have performed a clustering and want to estimate NMI. Sci-kit learn has it implemented so it is really easy to estimate it.
Here a simple example. From this example it is also clear that the score doesn’t depend on the labels’ names, as discussed before it is the same for every label permutation.
from sklearn.metrics.cluster import v_measure_score
>>> v_measure_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> v_measure_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0
- Hirschberg, J.B.; Rosenberg, A. V-Measure: A conditional entropy-based external cluster evaluation. (2007), Proceedings of EMNLP.
- Valle, F.; Osella, M.; Caselle, M. A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data. (2020) Cancers