Bio-inspired Subspace clustering algorithms

According to Banzhaf et al. 2006, bio-inspired optimization algorithms could be improved by incorporating knowledge from molecular and evolutionary biology. A promising source of advances in optimization is one of the important phenomena in evolutionary biology: the dynamic evolution of the genome structure. Several studies showed for instance that an evolvable genome structure allows evolution to modify the effects that evolution operators (e.g., mutations) have on individuals, a phenomenon known as evolution of evolution Hindré et al. 2012. We aim to take advantage of evolution of evolution mechanisms to achieve data mining tasks on dynamic data sets. This has been my Ph.D. research topic, more information about this topic is available in my Ph.D. manuscript and my Ph.D. defense presentation. In this webpage I give only brief information about each one of the algorithms that were developped during this project.

Chameleoclust

A first major step in this project was to design and assess Chameleoclust, a subspace clustering evolutionnary algorithm. The subspace clustering task is recognized as more difficult than standard clustering since it requires to identify not only the clusters but also the various subspaces where the clusters hold. We proposed to tackle this problem with Chameleoclust, a bio-inspired algorithm that incorporates a genome with an evolvable structure to allow for evolution of evolution. The results obtained with this algorithm suggest that evolution of evolution, through an evolvable genome structure, is usefull to solve a difficult problem such as subspace clustering. The reader is refered to [Peignier et al. 2015] or [Peignier el al. 2015] for a detailed description of Chameleoclust. This work was presented at the international ACM conference of Genetic and Evolutionary Computation Conference where it received the best paper award in the category Evolutionary Machine Learning.

Chameleoclust+

ChameleoClust+ is an extension of Chameleoclust. It is a bio-inspired algorithm implementing an evolvable genome structure, including several bio-like features such as a variable genome length, both functional and non-functional elements and mutation operators including chromosomal rearrangements. The main purpose of the design of ChameleoClust+ is to take advantage of the large degree of freedom provided by its evolvable structure to detect various number of clusters in subspaces of various dimensions.

Chameleoclust+ runs as a python module implemented internally in C++. The software can be downloaded here or in the webpage of the EvoEvo Project.

KymeroClust

KymeroClust is an algorithm based on more abstract evolutionary mechanisms, that is more efficient while still permitting to adapt the number of clusters and their subspace dimensionalities, and preserving the overall clustering quality. KymeroClust is based on the same phenotypic representation used in Chameleoclust: each individual has a phenotype constituted of a set of core-points, each of them being defined in its own subspace. For both algorithms a core-point can be perceived as a phenotypic trait and the coordinate of a core-point along a given dimension can be seen as a biological function that contributes to a given phenotypic trait.

KymeroClust is organized in two different steps, an initial one called construction phase that operates during the first generations and a stationary phase that operates once the construction step has been completed until the end of the exe- cution. The organisms are initialized with an empty genome and the corresponding phenotype is simply a unique core-point located at zero along the different dimensions (barycenter of the standardized dataset). During the construction phase, whenever a new individual is produced by replication, one duplication-divergence operation is applied, but no deletions are performed. Consequently the genomes of the individuals are progressively built from scratch along generations by successive duplications-divergences until the genomes reach a maximal genome size that has been specified by the user as a parameter. Once the construction phase ends, the algorithm enters into a stationary phase and remains in this phase until the end of the execution. During this phase the gene dele- tion operator is used in addition to the gene duplication-divergence operator (the gene deletion being applied before the gene duplication-divergence operator).

This algorithm has been used to cluster human motion and generate music samples in order to create the EvoMove system, a musical companion for dansers.

KymeroClust also gave birth to a median-subspace clustering algorithm based on a weighted based strategy, called SubCMedians. This algorithm was presented in the 33rd ACM/SIGAPP Symposium On Applied Computing and won the best-paper award in the Information Systems category. A light Knime implementation of this algorithm can be download here, and a C++ library for Python can be downloaded here

SubMorphoStream

An important variant of the subspace clustering problem is to consider objects arriving as streaming data, where subspace clusters change along the stream. In this case, the clustering algorithm must still determine the groups of similar objects (the clusters) together with the subspaces in which these similarities hold (one subspace per cluster), but must also be able to adapt them to the dynamics of the stream. The main kinds of underlying changes in data streams are cluster drift (clusters move in space), cluster inflation/deflation (cluster sizes changing), cluster appearing/disappearing suddenly, subspace modifications (dimensions being added or removed). They require great adaptation ability, especially when several of these kinds of changes are interleaved in the same stream. Interestingly, living organisms inhabit ever changing environments, and consequently different mechanisms, that allow individuals to cope with adaptation to new environments, have evolved. This ability may be enhanced by the acquisition of foreign genetic material through mechanisms such as bacterial competence. Competent bacteria uptake and incorporate genetic material from their extracellular environment, such as genetic material released by organisms of different species. This mechanism may drive a fast adaptation to a new environment since the uptake of a DNA fragment containing an entire functional group of genes may lead to the acquisition of a completely new phenotypic trait in a single step. In parallel, another phenomenon, known as gene amplification, also facilitates the adaptation to new and changing environments. It consists in duplication and deletion of gene copies within tandem arrays, and through dosage- effect it works as a rudimentary form of regulation, being able to cope with inefficient genes (e.g., recently acquired genes).

The use of evolutionary algorithms to address the subspace clustering problem for data streams seem a promising choice. However, even if different algorithms that extend the major families of subspace clustering algorithms have been developed to address this problem, none of them rely on evolutionary mechanisms. SubMorphoStream, is an evolutionary algorithm that combines a mechanism inspired by genetic material uptake together with gene amplification, to compute and update a subspace clustering over a dynamic data stream. SubMorphoStream belongs to the category of the axis-parallel clustering approaches, and searches for globular-shaped clusters, grouping objects around centers.