Clustering using k-means in insurance customer segmentation

11 min readMay 18, 2020

1. Business Problem

The chosen problem is one that affects a range of businesses when trying to personalise and specialise their marketing strategy to their customers, in order to have a deeper insight on their activities and predict how new ones will behave.

Customer segmentation is the subdivision of the market (customers) into groups with similar characteristics [1]. The focus of this report will be in customer segmentation in the insurance industry but it could also be applied in a range of industries.

How does the algorithm work?

This problem can be broken down into the following sub-problems:

Identifying the customers
Collect data on their behaviour, likes, preferences…
Divide into groups

a) Analyse the similarities and differences

b) Decide on the number of clusters

c)Assign each customer to one of the clusters

In order to tackle this problem the algorithm works in the following way:

Firstly, it randomly allocates k data points to be centroids (a word used to describe the centre of a cluster). The algorithm uses Euclidean distance to allocate the data point to the closest centroid. Secondly, according to which centroid they are closer to, it classifies the rest of the data points to be part of one of those k clusters.

Once every point belongs to a cluster, the centroids are changed. By using the average of all points in that cluster, the algorithm adjusts the centroid to the average. Finally, using the same system, it allocates every data point to the new k clusters (using the new average centroids). The process is iterated until there is no change in the clusters.

How does it solve the problem? — Hypothetical scenario

A new customer wants to buy life insurance. He comes into the firm and fills in basic demographic information about himself, such as age, gender and employment status, financial data such as income, retirement plans, homeownership status, vehicle ownership and behavioural data (questions about the lifestyle of the individual). This is combined with publicly available information to “characterize the demographic status of a client”.

K-means algorithm addresses the problem linked to how to classify the new customer in terms of what segment does he belong to. Once he can be classified then the k-means algorithm will also help you estimate how much the yearly/monthly quote should be and therefore charge them accordingly.

Functional Requirements

The requirements associated with the algorithm include:

Input of new data points (new clients)
Analysing the dataset of data points (client directory)
Plot the data points
Cluster into groups (identify the market segments)

Non-functional Requirements

Should be able to process the data of large quantities of data points (millions of clients to the firm) in a few minutes
Privacy of the data is key (personal information)
Quality of results should be high, as mischarging a client can have high costs (especially at a large scale)

2. EDA

In order to apply the algorithm to my business problem, I had to search for an appropriate dataset that contains personal variables such as age, income, gender…

However, the data set also had to include some variables related to the type of insurance that I was going to use. In this case, I decided to go for health insurance as there was a lack of available datasets in the insurance sector available online to the general public. Decided to use the insurance.csv found in Kaggle as it only includes 7 variables but has 1338 clients.

By clustering the health insurance clients into similar groups, the insurance company can decide to charge each of those groups a different price instead of charging everyone the same rate, helping boost profits as those who would use the insurance more (prone to health problems) can be charged more.

I tested the code using this data by adapting the existing source code which I was using to implement the variables of age and BMI (as they were the only two continuous variables).

We tested the code using the data by importing different amount of entries every time we read the file. So for example, to check the running time complexity of my code, I adjusted the number of rows read by changing the function to:

data = pd.read_csv(‘insurance.csv’, nrows=100)

3. Implementing k-means

The code used is an adapted version of:

NK, Mubaris. “K-Means Clustering in Python.” Blog by MUBARIS NK. Accessed January 06, 2019. https://mubaris.com/posts/kmeans-clustering/.

It is divided into 6 main parts, clearly defined by 6 main functions (that the original code lacked) to create a clear structure.

importing()

This is where the code is initiated. In this part, all of the libraries/functions needed to run the code (NumPy, pandas, copy, matplotlib…) are exported.

A sample of the data (first 5 rows) is also printed to check that the dataset has been imported correctly. All of the code was changed slightly to work with the new dataset on health insurance.

plotdata()

This is the function where the key data we are using from insurance CSV is imported to python. The two variables we are focusing on are age and BMI. The variable X is created, which is a 3D array containing both values using a file iterator to extract lines from the CSV.

This is when the data is plotted using the pyplot function from the matplotlib library.

Furthermore, it helps us visualise the dataset and check it was imported correctly.

elobowmethod()

This function is dedicated to the elbow method, which will be further explained in the next section.

This function uses the k-means function but not for clustering but to identify the ideal number of clusters the data should be divided into.

def elobowmethod():    
    
    # k means determine k
    distortions = []
    K = range(1,10)
    for k in K:
        kmeanModel = KMeans(n_clusters=k).fit(X)
        kmeanModel.fit(X)
        distortions.append(sum(np.min(cdist(X,kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])# Plot the elbow
    plt.plot(K, distortions, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Distortion')
    plt.title('The Elbow Method showing the optimal k')elobowmethod()

dist()

This function is the euclidean distance calculator used to calculate the average distance between two vectors using the Frobenius norm.

rancentroids()

The number of clusters is set by inputting the desired number of k when the code prompts you to do so, whereas, in the source code, k was a constant. Then, k number of random points are created by randomly allocating an x and y coordinate.

A 3D array is then created containing the initial random centroids (which immediately printed to check the random centroids have been created correctly), and a graph is plotted to visualise the new centroids that have just been created against the data.

In order to improve the source code, axis labels and titles when needed were added to all of the visual representations of the algorithm.

def rancentroids():
    
    global k
    global C
    
    # Number of clusters
    k = int(input("How many clusters (k) should the data by divided into:"))
    
    # X coordinates of random centroids
    C_x = np.random.randint(10, 70, size=k)
    
    # Y coordinates of random centroids
    C_y = np.random.randint(10, 60, size=k)
    
    C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
    print("Random centroids:", C)
    
     
    # Plotting along with the Centroids
    plt.scatter(f1, f2, c='#050505', s=7)
    plt.scatter(C_x, C_y, marker='o', s=100, c='r')
    plt.xlabel('age')
    plt.ylabel('bmi')
    plt.title('Random centroids allocated')rancentroids()

optimalcentroids()

This is the most important function in the code, as it is the one that groups the client data into clusters. In order to do this, it follows the following structure:

Uses the Euclidean distance calculator function to calculate the error (the difference between the centroid and the old centroid).
Once the error rate has been calculated the objective becomes to make it converge to 0, which would mean the optimal centroid has been found. In order to do this: a” while the error is 0” loop is initialised.
Another loop is initialised and this time it goes through the array X (where the age and BMI data is stored) and calculates the distance from each individual point and the random centroids created earlier. Once this is done it chooses the one with minimal distance and assigns it to that specific centroid/cluster
A third loop is initiated to go through the list and calculate the average distance in order to find the value of the new centroids.
The error rate is checked again to see if it has reached 0.
A break statement was added to the original code when the loop counter has reached 50 to avoid looping indefinitely if the error rate is not decreasing.
The last part of the code prints the final output with the optimal centroids and colour-coded clusters

def optimalcentroids():
    
    global points
    global clusters
    global C
    looptimes = 0
    
    # To store the value of centroids when it updates
    C_old = np.zeros(C.shape)
    # Cluster Lables(0, 1, 2)
    clusters = np.zeros(len(X))
    # Error func. - Distance between new centroids and old centroids
    error = dist(C, C_old, None)
    
    # Loop will run till the error becomes zero
    while error != 0:
        #Loop counter
        looptimes = looptimes + 1
        # Assigning each value to its closest cluster
        for i in range(len(X)):
            distances = dist(X[i], C)
            cluster = np.argmin(distances)
            clusters[i] = cluster
        # Storing the old centroid values
        C_old = deepcopy(C)
        # Finding the new centroids by taking the average value
        for i in range(k):
            points = [X[j] for j in range(len(X)) if clusters[j] == i]
            C[i] = np.mean(points, axis=0)
        error = dist(C, C_old, None)
        if looptimes >50:
            break
    print("Times it goes through while loop:", looptimes)  
    
optimalcentroids()

4. Analysing the complexity

The main input to the code is the dataset with clients information, so size depends on how big the pool of customers is. The chosen dataset for this business problem was one with 1338 clients. We tested run time performance with a loop-counter using different sizes of the same dataset:

On average the bigger your dataset, the higher the run time.

Another key input to the algorithm is the number of clusters (k). In order to check what the optimal amount for k would be, we tested our dataset with the elbow method. We used an adapted version for the insurance dataset of the code from the following source:

“Kmeans Elbow Method.” Python (2018). Accessed February 02, 2019. https://pythonprogramminglanguage.com/kmeans-elbow-method/.”

The results from running that code provided us with the following graph which indicates that the ideal number of clusters for the dataset is 3 if all 1338 clients are used in the model:

The number of clusters the data is divided into (k) has a huge influence on the run time of the code. By adding a loop counter to the algorithm, inside the while loop, we were able to measure how many times does the algorithm run through the loop. We then tested this with different values for k:

We stopped at five because after this point the algorithm does not always have an output when you run it (depending on the initial centroids), as it takes too much running time. To solve this problem, a break statement was added if the loop counter inside the while loop reached over 50, as it was found that when k was increased in some occasions the loop never ended.

Ultimately, we could draw the conclusion that the higher the number of clusters to divide your code into, will result in a more time-consuming algorithm as the average times it goes through the while loop increases.

5. Conclusions

Benefits of the algorithm

The main benefits are that it is easy to understand and implement. Furthermore, it does not take much computational power and/or time to be processed.

There are many other alternative approaches to solve customer segmentation for insurance companies. Other clustering methods could be used, for example, hierarchical clustering. The main drawback is that it would require significantly more computational power even if it results in more accurate clusters.

The design principles of the algorithm are overall good but they could be improved by

creating more functions (helping break down the problem), and adding more comments/annotations making it easier to understand and fix in case of any bugs.

This code can be generalised for any business problem with minor changes when importing the data set, names of variables and adjustment to the graphs.

Limitations of the algorithm

The k-means algorithm has a limitation when it comes to clustering for insurance, as it can only work with continuous variables and not with categorical variables. This is an issue, as most of the variables on client information are categorical (gender, smoker…), and those could potentially give an insight on how to cluster the clients and therefore how much to charge them.

The code could be improved if an algorithm that implements 3+ variables per client was implemented as it would provide a more insightful analysis of the clients and would be able to cluster them taking into account more factors, which may be essential to define the price of the health insurance such as income.

Limitations in your test data and the implications for your findings

The test data is limited as it only provides 1334 clients and only a few continuous variables to analyse. Consequently, the findings are not as accurate as they should be for deciding customer segments in health insurance.

Bibliography

[1] D. K. Rigby, (2017) Customer Segmentation. MANAGEMENT TOOLS 2017, An executive’s guide. Bain Brief — Bain & Company

[2] T. Bücker, (2016) Customer Clustering in the Insurance Sector by Means of Unsupervised Machine Learning. NOVA Information Management School

[3] “Numpy.linalg.norm”, Scipy.stats.trim_mean — SciPy V1.1.0 Reference Guide (Accessed February 06, 2019), https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.norm.html#numpy-linalg-norm.

[4]M. Choi, “Medical Cost Personal Datasets.” (2018), RSNA Pneumonia Detection Challenge | Kaggle (Accessed February 07, 2019): https://www.kaggle.com/mirichoi0218/insurance/home.

Code Sources:

“Kmeans Elbow Method.” Python. June 21, 2018. Accessed February 02, 2019. https://pythonprogramminglanguage.com/kmeans-elbow-method/.

NK, Mubaris. “K-Means Clustering in Python.” Blog by MUBARIS NK. Accessed January 06, 2019. https://mubaris.com/posts/kmeans-clustering/.

VanderPlas, Jake. “Three-Dimensional Plotting in Matplotlib.” Frequentism and Bayesianism III: Confidence, Credibility, and Why Frequentism and Science Do Not Mix | Pythonic Perambulations. Accessed February 05, 2019. https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html.