Implementation of K-means Clustering on SIPP-KLING Dashboard Applications

This research focuses on grouping health house (rumah_sehat) data into five clusters, namely Very Unhealthy, Unhealthy, Not Healthy Yet, Healthy, Very Healthy. There were 17 criteria as input parameters for K-Means calculation. These research aims to grouping 8969 houses into the clusters. The results of these clustering can help decision maker (government) to analyze which parts of the houses whose need the attention more, or which areas that lower than healthy standards. The test result shows that there were 3308 Very Healthy, 2496 Healthy, 792 Not healthy Yet, 1706 Unhealthy, and 667 Very Unhealthy houses. The accuracy of this method was found 87.05% using confusion matrix, with precision of 95.64% and 75, 81% , and recall of 83.82% and 92, 98% . Based on ROC the level of diagnostic value accuracy of 87.05% includes good clustering.


INTRODUCTION
SIPP-KLING is an environmental health mapping profile informat ion system for UPT Puskesmas Limo.Th is application aims to facilitate service, health, coaching and even assistance to the community to imp lement PHBS (Clean and Healthy Life Behavior) in the surrounding environment.In the current SIPP-KLING system, there were 2 categories of final results, namely Healthy and Unhealthy which was obtained from the final total value.Because it only has 2 categories, the analysis obtained fro m each village is not specific.So that supervision and evaluation cannot be carried out optimally because the gap in the same region is too wide, hence the houses in the same region (categorie) have different health problems and require different handling.With a lack of analysis it also has an impact on the funds that will be given to each region.The amount of funds channeled will not reach the areas that need it most.
Based on the condition above, there is a need to imp rove the grouping of the SIPP-KLING datasets to get more specific groups .This research use K-Means Clustering to grouping 8969 houses into the cluster, to extracts the information and important patterns of interest in the SIPP-KLING application.K-Means is a most widely used and well studied method in data mining [1].Refer to [2], the clustering analysis is useful to draw meaningful in formation o r drawing interesting patterns form data sets and used in many fields like bio informat ics, pattern recognition, image processing, data mining, market ing, economics, etc., to get the h idden knowledge.

II. K-MEANS CLUS TERING
Clustering is a process of grouping data objects into disjointed group called clusters, so that the data in the same cluster are similar and different to other cluster [3].The main aim o f clustering is to offer a co mbination of imilar objects.Between classification and clustering are confuse to be different, but in classificat ion objects is assigned in predefined classes while in clustering classes is created [1].
The K-Means algorithm uses the process repeatedly to get the cluster database.It takes the desired number of init ial clusters as input and produces the number of end clusters as output.If the algorith m is needed to generate cluster K then there will be an initial K and a final K. The K-Means method will randomly select the k pattern as the starting point of the centroid.The nu mber o f iterations to reach the centroid cluster will be affected by the prospective random in itial centroid cluster where the position of the new centroid does not change.The K value chosen as the initial center will be calculated using the Euclidean Distance formula, which is to find the closest distance between the centroid point and the data / object.Data that has a short distance or closest to the centroid will form a cluster [1].
In its comp letion, the K-Means algorith m will produce the centroid point which is the purpose of the K-Means algorithm.After the iteration of K-Means stops, each object in the dataset becomes a member o f a cluster.The cluster value is determined by looking for all objects to find clusters with the closest distance to the object.The K-means algorithm will group data items in a dataset into a cluster based on the closest distance [4].Fo llo wing are the steps of K-means algorith m [5] : INPUT : Nu mber of desired clusters K Data Object D={d 1,d2,..dn} Step:  randomly elevate K data objects (as in itial centers) from data set D.  Repeat;  Calculate the distance of each data object d i (1 ≤ i ≤ n) fro m all k clusters C j (1≤ j ≤ k) and then assign data object d i to the nearest cluster. For each cluster j (1 ≤ j ≤ k)  Recalcu late the cluster center until no change in the center of clusters.OUTPUT : A set of K clusters

III. DES IGN AND REALIZATION
A. Design Dashboard SIPP-KLING (Profile Information System Mapping Environ mental Health) is a webbased system developed by applying K-means clustering methods in data processing to become informat ion that will be easy to learn knowledge.The actors involved in this applicat ion are the ad min and cadre, the admin manages the SIPP-KLING web dashboard while the cadres are actors who collect the data through the citizen.Th is system serves admin to control the activities of data processing, and facilitate the admin in ma king reports and making decisions .

B. Implementation of K-Means Algorithm
This stage will exp lain the steps to operate the K-Means algorith m manually : 1. Determination of the number of clusters is five (k = 5) to be made, the determination of clusters is based on conceptual observations by experts grouped into five, namely, Healthy, Unhealthy, Unhealthy, Very Healthy, Very Unhealthy.The amount of data used is 8969 SIPP-KLING data and 17 attributes from the criteria for determining the Din kes Health House.The criteria are taken based on interview process with dr.Tiur.The criteria are init ialized by r (r1 up to r17) namely : ceiling, wall, floor, the bed room window, the family roo m window, ventilat ion, kitchen exhaust fan, lighting, toilet, safe drinking water, sewer system, rubbish management, opening the bed room window, opening the family roo m window, keep the house and the yard clean, throw the feces of the baby and toddler into the toilet, and throw the garbage in its place.
2. Determination of the centroid of each cluster can be seen in TABLE 1. Then the distance matrix will be obtained, namely C1, C2, C3, C4 and C5 as follows: Data distance of cluster 1 is:  If the new centroid is different fro m the previous centroid, then the process continues the next step.However, if the new centroid is calculated the same as the previous centroid, then the clustering process is complete.In this calculation, the processes stop at 12 th iteration, like shown in TA BLE 4 and TA BLE 5.After getting the cluster label for each data, the average value is searched by adding up all members of each cluster and dividing the number of members.Because the centroid doesn't change (same as the previous centroid), the clustering process is complete, then the data will be grouped based on the closest distance to the cluster, from 8969 K-Means data managed to group 3308 into Very Healthy categories, 2496 Healthy categories, 792 Unhealthy categories, 1706 Unhealthy categories, and 667 Very Unhealthy categories.

C. System Implementation
Figure 1 is the imp lementation of the main page who displays the SIPP-KLING dashboard.On this page the system displays the total amount of data based on the status category such as whether the place is healthy or not, feasible or not, and many or few deviations.The grouped data is the result of the K-Means process.Figure 2 is an imp lementation of the analysis page for each village, analysis of the location is carried out in each attribute to see which attributes most influence the village.The highest scale on the graph is the scale that most influences and becomes a warning for a village.On the page there are several buttons, the first button to send the status of the database to the database, the second button to send the distance value of the attribute to the database.

D. Testing Data Accuracy
The accuracy of the algorithm is tested using Confusion Matrix [6].The testing aims to determine the performance of the K-Means Clustering algorith m in classifying the data into a predetermined cluster.
Assuming that the status of the K-means is very healthy and healthy compared

IV. CONCLUSION
The imp lementation of K-Means Clustering can be classified the 8969 data into 5 categories of Healthy House namely: Very Healthy, Healthy, Not Healthy, Unhealthy and Very Unhealthy The data obtained from the clustering results of K-means can help to analy ze which parts of a house should be better addressed, or which areas have lower levels of health.The K-Means Clustering accuracy test obtained an accuracy of 87% that included in the category of good clustering.

3 .
Calculate the distance between data and the centroid.Measuring the distance between the data and the centroid used Euclidian Distance formu la (eq 1)  ,  =  −  = (  −   ) 2   =1

Figure 1
Figure 1 Main Page Figure 2 Analysis Page

TABLE 2 .
The result of 1 st Iteration

TABLE 3 .
Average Centroid Value in 1st Repeat

TABLE 4 .
Data Grouping in the 12th Repetition

TABLE 5 .
Centroid Average Value in 12th Repetition

TABLE 6 .
Value of Confussion Matrix Data AnalysisFro m the above test results shows the accuracy of this method is 87, 05%, with precision of 95.64% for healthy class and 75, 81% for unhealthy class, and recall 83.82% for healthy and 92, 98 % to unhealthy.Results accuracy, precision and recall in the grouping of data Healthy House can be seen in theTABLE 7.
a. PrecisionPrecision is the amount of data that is true positive (the amount of positive data that is correctly recognized as positive) is divided by the amount of data that is recognized positively.Fro m the test of precision value that is 95.63% for Healthy House and 75.71% for Unhealthy House.TP / FP + TP) x 100% ↔ 4865 / (F-Measure = 2 / (1/recall + 1/ precision) or F(i,j) = (2 * recall(i,j) * precision(i,j)) / ((precision(i,j) + recall(i,j))

TABLE 7
Accuracy Result