Open Conference Systems, MISEIC 2019

Font Size: 
Hierarchical Clustering Algorithm Based on Density Peaks using Kernel Function for Thalassemia Classification
Sri Hartini, Zuherman Rustam

Last modified: 2019-10-08

Abstract


Thalassemia is an inherited blood disorder which is among the most common genetic disorders in the world. This disease is prevalent in Mediterranean countries, the Middle East, Central Asia, India, Southern China, and the Far East as well as countries along the north coast of Africa and in South America. The total annual incidence of the patient with this disease symptoms is estimated at 1 in 100,000 throughout the world and 1 in 10,000 people in the European Union. According to Thalassemia International Federation, there are only 200,000 patients with thalassemia major are alive and registered as receiving regular treatment around the world. Thalassemia classification itself is the initial process of executing patient treatment. Therefore, it is important to obtain a precise diagnosis so that the appropriate treatment will give a higher lifetime value of the patient.

 

A lot of these classification studies have been carried out, involving various machine learning techniques. Paokanta et al, for instance, have been used Bayesian Networks (BNs), Multinomial Logistic Regression, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP) and Naïve Bayes with the percentage of accuracy up to 80%. There are also neural network and a decision tree in thalassemia classification using red blood cells, reticulocytes and platelets characteristics. Moreover, according to Complete Blood Count (CBC) test and using artificial intelligence algorithms, Yousefian et al found that red blood cells, hemoglobin, mean corpuscular volume, and hematocrit are important characteristics in the classification of thalassemia.

 

In this paper, we proposed a new method based on kernel, which is modified from hierarchical clustering based on density peaks (HCDP). Rong Zhou et al initially introduced this based on density peaks method to cover the shortcoming of other density peaks-based methods which were not straightforward in setting the parameters. Using the concept of k-nearest neighbor and hierarchical clustering, this method consists of three steps: local density calculation, hierarchy representation, and optimal cluster extraction. Modification of this technique, therefore, is based on Gaussian radial basis kernel function, which means that every distance formula is substituted with the distance of two input vector data in the feature space in that kernel function. The kernel function was expected to be more accurate ability to separate data which cannot be detached linearly.

 

We applied the proposed method on a dataset taken from Harapan Kita Hospital, West Jakarta, Indonesia, consisting of 82 thalassemia and 68 non-thalassemia samples. The instances have numerical characteristic and are described by 10 attributes which are hemoglobin (g/dL), hematocrit (%), white blood cells (1000/µL), basophils (%), eosinophils (%), rod neutrophils (%), segment neutrophils (%), lymphocytes (%), monocytes (%), and platelet counts (1000/µL).

Thereafter, the performance of HCDP with and without kernel function in this paper is compared using the confusion matrix by calculating the accuracy. The results concluded that hierarchical clustering based on density peaks gives approximately 80.69% accuracy. Meanwhile, 85.64% accuracy can be obtained when that method is combined with kernel function. These values are fairly good, but they still need an improvement although both of the methods perform with fast running time which is less than six seconds.

 

For future research, new model or method is possible to develop in order to obtain better performance. Others kernel parameter can also be used for the same purpose. Furthermore, it is possible to apply these methods in bigger dataset hence the model can be examined to encounter the dataset complexity. Better performance is therefore expected to give accurate classification so that better diagnosis can be obtained.


Keywords


Classification; Density Peaks; Hierarchical Clustering; Kernel Function; Thalassemia.