clustering Module
Clustering algorithms for unsupervised learning tasks.
@author: drusk
-
class pml.unsupervised.clustering.ClusteredDataSet(dataset, cluster_assignments)[source]
A collection of data which has been analysed by a clustering algorithm.
It contains both the original DataSet and the results of the clustering.
It provides methods for analysing these clustering results.
-
__init__(dataset, cluster_assignments)[source]
Creates a new ClusteredDataSet.
- Args:
- dataset: model.DataSet
- A dataset which does not have cluster assignments.
- cluster_assignments: pandas.Series
- A Series with the cluster assignment for each sample in the
dataset.
-
calculate_purity()[source]
Calculate the purity, a measurement of quality for the clustering
results.
Each cluster is assigned to the class which is most frequent in the
cluster. Using these classes, the percent accuracy is then calculated.
- Returns:
- A number between 0 and 1. Poor clusterings have a purity close to 0
while a perfect clustering has a purity of 1.
- Raises:
- UnlabelledDataSetError if the dataset is not labelled.
-
calculate_rand_index()[source]
Calculate the Rand index, a measurement of quality for the clustering
results. It is essentially the percent accuracy of the clustering.
The clustering is viewed as a series of decisions. There are
N*(N-1)/2 pairs of samples in the dataset to be considered. The
decision is considered correct if the pairs have the same label and
are in the same cluster, or have different labels and are in different
clusters. The number of correct decisions divided by the total number
of decisions gives the Rand index, or accuracy.
- Returns:
- The accuracy, a number between 0 and 1. The closer to 1, the better
the clustering.
- Raises:
- UnlabelledDataSetError if the dataset is not labelled.
-
get_cluster_assignments()[source]
Retrieves the cluster assignments produced for this dataset by a
clustering algorithm.
- Returns:
- A pandas Series. It contains the index of the original dataset
with a numerical value representing the cluster it is a part of.
-
pml.unsupervised.clustering.create_random_centroids(dataset, k)[source]
Initializes centroids at random positions.
The random value chosen for each feature will always be limited to the
range of values found in the dataset. For example, if a certain feature
has a minimum value of 0 in the dataset, and maximum value of 9, the
- Args:
- dataset: DataSet
- The DataSet to create the random centroids for.
- k: int
- The number of centroids to create.
- Returns:
- A list of centroids. Each centroid is a pandas Series with the same
labels as the dataset’s headers.
-
pml.unsupervised.clustering.kmeans(dataset, k=2, create_centroids=<function create_random_centroids at 0x32770c8>)[source]
K-means clustering algorithm.
This algorithm partitions a dataset into k clusters in which each
observation (sample) belongs to the cluster with the nearest mean.
- Args:
- dataset: model.DataSet
- The DataSet to perform the clustering on.
- k: int
- The number of clusters to partition the dataset into.
- create_centroids: function
- The function specifying how to create the initial centroids for the
clusters. Defaults to creating them randomly.
- Returns:
- A ClusteredDataSet which contains the cluster assignments as well as the
original data. In the cluster assignments, each sample index is
assigned a numerical value representing the cluster it is part of.