Busra Tas Kiper: Contemporary developments and applications of unsupervised machine learning methods
Tid: Fr 2024-09-13 kl 13.00
Plats: Hörsal 2, House 2, Albano campus, Stockholm Univ.
Respondent: Busra Tas Kiper , Department of Mathematics, Stockholm University
Opponent: Jonas Wallin (Lund University)
Handledare: Chun-Biu Li (Stockholm University)
Abstract.
This thesis presents state-of-the-art developments in the field of unsupervised learning, particularly in clustering analysis. Unsupervised learning is a branch of machine learning whose task is to discover hidden patterns and relationships in high dimensional data without any labels. It is an important step in providing valuable insights, e.g., the existence of important discrete structures and low-dimensional features, for downstream statistical analyses as well as revealing anomalies. The achievements of this thesis detailed below advance our toolboxes in pattern recognition and anomaly detection that have potential applications in many scientific areas with unstructured and unlabelled data.
Paper I presents the application of unsupervised change point (CP) detection to molecular time series to explain the dynamics of motor proteins. Data-driven non-parametric detection of CP enables an objective identification and modeling of stepping patterns in molecular motors. Beyond CP detection, this study provides further tools to analyze molecular motors, such as the reliable extraction of reaction statistics and establishing a predictive model for the reaction rates. The methods developed and applied in this paper are applicable to time series data from a broad range of scientific fields.
Paper II proposes the Graph-based Fuzzy Density Peak Clustering (GF-DPC) method that comprises comprehensive generalizations of existing density-based clustering methods. The first generalization is employing graph-based methods to estimate densities and capture nonlinearities in the data that enhances the power of detecting clusters with arbitrary shapes. On the other hand, a fuzzy extension is formulated to provide a probabilistic framework to assign data points to clusters. Finally, the identification of cluster centers and the number of clusters is automated in terms of the fuzzy clustering validation index. Compared with other well-known fuzzy clustering methods, the superior performances of GF-DPC in discovering clusters with arbitrary shapes, densities, separations and overlapping are demonstrated using both intuitive examples and real datasets.
Paper III establishes a validation framework versatile for fuzzy clustering, termed the Shape-aware Generalized Silhouette Analysis (SAGSA), based on the silhouette index. In SAGSA, a probabilistic framework is formulated to quantify the degree of cohesion and separation for the detected fuzzy clusters. In addition, graph-based distances are employed in SAGSA to facilitate an accurate validation of nonlinear clustering structures. Most importantly, a 2-dimensional graphical tool, the cohesion-separation (CS) plot, is introduced to enable visual diagnoses of possible problems in the clustering results at the point-wise, cluster-wise and global levels regardless of the dimensionality of the dataset. Finally, we illustrate the effectiveness of SAGSA in cluster validation compared with other commonly used methods in terms of various test examples of clustering challenges, these include clusters with arbitrary shapes, imbalance sizes, overlapping, hierarchical structures, mixed with noises, etc.
Keywords:
Clustering analysis, Fuzzy clustering, Graph-based methods, Clustering validation, Time series analysis, Change point detection.