Implementing K-Nearest Neighbour and Naïve Bayes algorithms to predict diabetes diagnosis in Pima Indians patients

8 min readMay 27, 2020

Data Science project by Rosa Caminal Diaz and Serge Naufal

1. Aim

The aim of this study is to evaluate the performance of different classifiers when applied to predict the diagnosis of diabetes on Pima Indians. We will be using different classification algorithms: K-Nearest Neighbour (K-nn) and Naïve Bayes to predict if the individual will test positive for diabetes.

In order to evaluate different classifiers we will be:

Using the stratified cross-validation method
Investigating the effect of feature selection

This study is important because it is a means of accurately diagnosing diabetes based on patients’ personal characteristics and test measurements. Therefore, it will help tackle the issue of the increased probability of having diabetes that is faced by the Native American communities by improving the prediction of the disease.

2. Data

2.1 Description of the dataset

The dataset used in this project is originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases. It was collected to predict whether a patient can be diagnosed with diabetes using prediction by utilising certain diagnostic factors. The dataset was donated by Vincent Sigillito (The Johns Hopkins University) to the UCI Machine Learning Repository.

It contains the data on a total of 768 people and 8 different attributes plus class:

Number of times pregnant
Plasma glucose concentration (2 hours in an oral glucose tolerance test)
Diastolic blood pressure (mm Hg)
Triceps skinfold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)²)
Diabetes pedigree function
Age (years)
Class variable (“yes” or “no”)

2.2 Dataset preparation

The first step to building a successful classifier model is to prepare the data to be used. We first have to look for any missing values in the dataset, and notice that there aren’t any after a quick glance through.

The second step to ensure the best results is to normalise the values in order to avoid bias when creating the model. By normalising the values we make sure that all the attributes are scaled in the same way, meaning that they are all measured in the range [0,1]. This is a very important step. If we don’t normalize the dataset, we could end up with biased results especially as we’re implementing a classifier that uses a distance measure (kNN). The attributes in the dataset are represented in different units (mm, years, kg) therefore normalization is critical in this case to accurately predict the class of new instances.

2.3 Correlation-based Feature Performance

In order to select the most relevant attributes for our model, we need to carry out a feature selection. In order to facilitate this task we use the Correlation-based Feature Selection (CFS) function from Weka.

The aim of CFS is to search for the best set of features (both individually and interacting with the rest of features) to predict the diagnosis. A good set of features is defined by those that are highly correlated with the class (diagnosis result) but are uncorrelated with the rest of attributes. CFS also helps to reduce dimensionality which is usually a problem for classifiers.

After carrying the CFS on Weka, the selected attributes are:

Plasma glucose concentration
2-Hour serum insulin
Body mass index (weight in kg/(height in m)²)
Diabetes pedigree function
Age (years)

3. Results and discussion

We tested each of the classification algorithms with both the dataset without feature selection and the dataset with CFS implemented in order to compare the results.

3.1 Accuracy results

After running several different classifiers in Weka as well as our implementations of kNN and NB, we obtained the following accuracy results (in percentage):

Table 1. Accuracy results of several classifiers tested in Weka

Table 2. Accuracy results of kNN and NB algorithms implemented and tested on MATLAB

We also present the confusion matrices for the above results:

*Table 3. Confusion matrices of kNN and NB tested in Weka*

*Table 4. Confusion matrices of kNN and NB algorithms implemented and tested on MATLAB*

3.2 Discussion on the results of Weka’s implementation

We start by discussing the results obtained on Weka (referencing Tables 1&3 above).

From Table 1, we observe that the accuracy percentages for all classifiers without feature selection are very close if not similar to those with CFS. As we see the same accuracy percentage for classifiers like 1R and 5NN, the confusion matrix in Table 3 shows more detail and some minor differences.

This lines up with our expectations of the feature selection engineering process carried out by Weka, which only chooses the “best” features; those that are highly correlated with the class but are uncorrelated with each other, as mentioned in section 2.

Figure 1. Bar graph illustrating the percentage accuracy of all methods tested on Weka

Now comparing the classifiers’ performances (with and without CFS), we can clearly see on Figure 1 that the classifier with the best accuracy is Support Vector Machine (SVM) at 76.69% with CFS, both with and without feature selection while the least accurate classifier by a big difference is ZeroR in both cases too at 65.10%.

That result was expected as ZeroR is the simplest method of classification and ignores all predictors while just focusing on the majority class so it was just used as a benchmark to compare with other classifiers’ performances.

3.3 Discussion on the results of our implementation

Now, we discuss the results obtained on MATLAB from our implementation of the kNN and NB algorithms (referencing Tables 2 & 4 above).

From Table 2, we also observe that the accuracy percentages for all classifiers without feature selection are very close to those with CFS, as well as to the Weka accuracies from Table 1 for those algorithms.

*Figure 2. Bar graph illustrating the percentage accuracy of our MATLAB implemented methods*

Comparing the classifiers’ performances (with and without CFS), we can see on Figure 2 that the best result was obtained using NB at an accuracy of 76.56% (with CFS) and the worst was obtained using 1NN at an accuracy of 68.23% (before the implementation of CFS).

3.4 Discussion on the effect of CFS

As mentioned above, classifiers with CFS performed equally if not better than classifiers without feature selection as seen on Tables 1&2. We can, therefore, say that feature selection was beneficial as it improved accuracy for every single classification method.

Other advantages of feature selection include the reduction of the training data scan time in the algorithm as there would be less attributes to be processed, and the reduction of dimensionality and subsequently avoiding overfitting.

Figure 3. Graph illustrating the correlation between the attributes plus class of the dataset

CFS selected a subset of the original features, cutting down from the original 8 features to only 5. The removed features were “number of times pregnant”, “diastolic blood pressure” and “triceps skinfold thickness”.

The first feature “Number of times pregnant” seems to make sense as it does not necessarily correlate with the diagnosis of developing diabetes. As for the “diastolic blood pressure”, even if it is a predictor for other cardiovascular problems, it is not the main factor to look out for when diagnosing diabetes, so it also makes sense that it was dropped from the module.

We expected the “triceps skinfold thickness” to be an indicator of diabetes and therefore be utilised in the model. Indeed, skin thickness in places like the abdomen is usually a telltale of obesity which often leads to Diabetes Type 2. Nonetheless, that does not imply that triceps skin thickness is highly correlated with diabetes or uncorrelated with the other kept features (as observed in Figure 3).

4. Conclusion

After the implementation of the kNN and Naïve Bayes classification algorithms on MATLAB and the tests performed on Weka with several classifiers, we were able to evaluate those classifiers and compare them with each other in terms of accuracy using the 10-fold stratified cross-validation method.

The normalized Pima Indians dataset was used with and without feature selection as input to those classifiers, and the results obtained showed that the classifier that performed best (highest percentage of accuracy) for predicting the diagnosis of diabetes is SVM closely followed by Naïve Bayes with an accuracy of 76.60% on average (with CFS). It was also found that the worst method was ZeroR with an accuracy of 65.10% followed by 1NN (our implementation and Weka’s) at 68.5% accuracy on average. CFS proved to be beneficial as we could see that it increased the accuracy for most classifiers while reducing the dimensionality of the dataset.

This study was performed on a dataset representing Pima Indian females aged above 21 years old. This could be a limitation to our findings as a more generalized population (different heritage, no age or gender constraint) might produce other outcomes in regards to the prediction of a diabetes diagnosis.

As such, future work would be recommended with other datasets and a more generalized population. Also, as shown in the results, simple methods perform surprisingly well in comparison to other more complex methods; additional work on Naïve Bayes might be beneficial by trying other probability density functions. Finally, we could involve a statistical test e.g. t-test to better compare the accuracy difference of classifiers or produce other evaluation measures like recall and precision.

5. Reflection

This assignment provided us with a deeper understanding of classification algorithms (after implementing them from scratch on MATLAB) and introduced us to a helpful software for AI work, Weka.

The analysis and implementation work achieved here is a great introduction to supervised machine learning and real AI problems that might use larger datasets with all sorts of attributes. We were able to evaluate classifiers by computing their accuracies before and after a reduction of said attributes, forming an insight into the variations that could improve our algorithms.

6. Bibliography

Burrows, N. Rios, et al. “Prevalence of diabetes among Native Americans and Alaska Natives, 1990–1997: an increasing burden.” Diabetes care 23.12 (2000): 1786–1790.

Hall, Mark Andrew. “Correlation-based feature selection for machine learning.” (1999).

Huntley, A. C., and Jr RM Walter. “Quantitative determination of skin thickness in diabetes mellitus: relationship to disease parameters.” Journal of medicine 21.5 (1990): 257–264.

Saringat, Zainuri, et al. “Comparative analysis of classification algorithms for chronic kidney disease diagnosis.” Bulletin of Electrical Engineering and Informatics 8.4 (2019): 1496–1501.

Sigillito, V. et al. National Institute of Diabetes and Digestive and Kidney Diseases

Research Center, RMI Group Leader Applied Physics Laboratory The Johns Hopkins University Johns Hopkins Road Laurel, MD (1990): 20707 (301) 953–6231