Exploring ratings of TV shows across different streaming platforms using ‘altair’ and ‘seaborn’

8 min readJun 1, 2020

1. Introduction

In this report, we will be exploring the ratings of TV shows across different online platforms through an extensive EDA.

Source of data

The source of the dataset tv_shows.csv is Kaggle [2]. This data has 5611 entries, 10 features and analyses TV shows over 4 different platforms. The data was originally sourced from Reelgood.com and scrapped using Beautiful Soup.

Executive Summary

In this report, we will check the observations, variables and values of our data. It will be divided into 3 parts:

Importing the data
Feature Engineering
Exploratory data analysis
Further exploration of ratings

2. Importing the data

The first step is to import all the libraries we will be using in the report:

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt

Once all the libraries are loaded, it is time to import the data we will be using:

df = pd.read_csv("tv_shows.csv")
df.head()

Figure 1: Table illustrating the first 4 entries of the original dataset

3. Feature Engineering

Next step is to clean the data, in order to get the maximum information it can provide and make sure it is the correct format for the visualisations in EDA. We start by converting some of the columns: Age, Rotten Tomatoes and IMDb.

As seen in Figure 1, the age of each of the shows has a ‘+’ at the end. In order to be treated as an integer, we need to get rid of the plus symbol and covert it from a string to a number.

As for Rotten Tomatoes, the problem similar, as the data entries were measured in percentages and all have a ‘%’. In a similar approach, the percentage symbol was taken out and it was converted into a number too.

Lastly, we converted the IMDb ratings from out of 10 to a percentage, so they were easier to compare to the Rotten Tomatoes scores.

#Converting the percentages to number
df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.rstrip('%').astype('float')
#Removing the "+" sign from age rating
df["Age"] = df["Age"].str.replace("+","")
#Conveting it to numeric 
df['Age'] = pd.to_numeric(df['Age'],errors='coerce')
#Converting IDMb to percentage
df["IMDb"]=df["IMDb"]*10

The next step was to convert the online platform where the show can be streamed into an extra categorical column for purposes of exploring the differences between them and how it affects the score.

The problem was that some of the tv shows are available in more than one streaming platform. We decided the best way to do this was to create a variable names ‘Provider’ which indicates in which platform you are able to stream the show. When there is more than one single platform for a given show, we decided to create the option ‘Multiple’.

def Provider (row):
   if row['Netflix'] == 1 and row['Hulu'] == 0 and row['Prime Video'] == 0 and  row['Disney+']  == 0:
      return 'Netflix'
   if row['Netflix'] == 0 and row['Hulu'] == 1 and row['Prime Video'] == 0 and  row['Disney+']  == 0:
      return 'Hulu'
   if row['Netflix'] == 0 and row['Hulu'] == 0 and row['Prime Video'] == 1 and  row['Disney+']  == 0:
      return 'Prime Video'
   if row['Netflix'] == 0 and row['Hulu'] == 0 and row['Prime Video'] == 0 and  row['Disney+']  == 1:
      return 'Disney+'
   return 'Multiple'df['Provider'] = df.apply (lambda row: Provider(row), axis=1)

On top of all that we also dropped the columns of ‘Unamed:0’ and ‘type’ as they did not provide any information. ‘Unnamed:0’ was just an index that was no longer needed and the ‘type’ was a column of all 1’s.

After finishing the cleaning, this is how the data looks:

Figure 2: Table illustrating the data once it has been cleaned

3. Exploratory data analysis

Numerical data

Year

As seen in Figure 3, most of the content in the platforms have been created in the last 20 years, to be more exact over 50% of the tv shows recorded in this dataset were created between 2015–2020.

It is also interesting to point out that although most of the content is somewhat recent, there is a representation of older shows with over 150 created before 1980, with the oldest shows ‘Space: The New Frontier’ and ‘Gods & Monsters with Tony Robinson’ created in 1901.

plt.figure(figsize=(20,10))
x = sns.countplot(x=”Year”, data=df)
plt.xticks(rotation=90)

Figure 3: Bar chart illustrating the show's date of creation

We then decided to look into the differences between the platform when it comes to the creation dates of their shows. Interestingly, Prime Video is the platform providing most of the older shows, with 10 out of the 15 shows created before 1950.

However, when it comes to newer shows, the leader is Netflix for the last 3 years, with over 60 shows out of the 110 created in 2020 alone.

alt.Chart(df2).mark_bar(opacity=0.7).encode(
    x='Year',
    y=alt.Y('count()', stack=None),
    color="Provider",
).properties(
    width=800,
    height=300
).interactive()

Figure 4: Bar chart illustrating the show’s date of creation by Provider

As for the others, Hulu has most of the shows they provide created between 2010 and 2015; Prime Video interestingly has most of its shows created in 2017 alone with a whopping 287; whereas Disney+ only has 156 shows in total with 75% of them created between 2005–2020.

IMDb

plt.figure(figsize=(16, 6))
sns.scatterplot(data=df['IMDb'], hue='Provider')

Secondly, we decided to look at the IMDb ratings of the shows. Interestingly when plotting the IMDb shows on a scatterplot as seen in Figure 5, they formed 4 distinct clusters. This is due to how the data was entered in the database which can be observed in Figure 6.

plt.figure(figsize=(16, 6))
sns.scatterplot(data=df, x='Unnamed: 0', y='IMDb', hue='Provider')
plt.xlabel('Index')

Figure 6: Scattergraph illustrating the IMDb scores

We then had a look at which were the best and worst rated shows. Figure 7 shows the best 30 shows according to their IMDb ratings whereas Figure 8 shows the worst 30.

plt.figure(figsize=(20,10))
sns.barplot(x="IMDb", y="Title" , data= df.sort_values("IMDb",ascending=False).head(30))

Figure 7: Bar chart illustrating the best-rated shows in Rotten Tomatoes

plt.figure(figsize=(20,10))
sns.barplot(x="IMDb", y="Title" , data= df.sort_values("IMDb",ascending=True).head(30))

Rotten Tomatoes

As for Rotten Tomatoes ratings, we can clearly see there are a lot fewer data entries with only 1011/5611 (18%). Again, we can clearly see the 4 clusters, representing how the data was entered.

Interestingly, Rotten Tomatoes has a higher average of ratings with a mean of 77.5 compared to 71.13 of IMDb. It also awards 109 shows with a rating of 100 compared to 0 in IMDb.

plt.figure(figsize=(16, 6))
sns.scatterplot(data=df, x='Unnamed: 0', y='Rotten Tomatoes', hue='Provider')
plt.xlabel('Index')

Figure 9: Scattergraph illustrating the Rotten Tomatoes scores

Again, we then had a look at which were the best and worst rated shows. Figure 10 shows the best 30 shows according to their IMDb ratings whereas Figure 11 shows the worst 30.

plt.figure(figsize=(20,10))
sns.barplot(x="Rotten Tomatoes", y="Title", data= df.sort_values("Rotten Tomatoes",ascending=False).head(30))

Figure 10: Bar chart illustrating the best-rated shows in Rotten Tomatoes

plt.figure(figsize=(20,10))
sns.barplot(x=”Rotten Tomatoes”, y=”Title”, data= df.sort_values(“Rotten Tomatoes”,ascending=True.head(30))

Figure 11: Bar chart illustrating the worst-rated shows in Rotten Tomatoes

Categorical data

Age

The feature Age is considered numerical although all of its values lie in 4 different options, basically becoming categorical.

Another issue is a large number of Nan in this feature, with only 2620/ 5611 (46.7%) having information on the age restrictions for the show. We can assume that means that those shows have no age restrictions and are considered ‘Family-friendly’.

plt.figure(figsize=(20,10))
ax = sns.countplot(x="Age", data=df) # for Seaborn version 0.7 and more
total = float(len(df))
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(height/total),
            ha="center")

Figure 12: Bar chart illustrating the numbers of shows in each of the age restrictions

As for those shows that do have age restrictions, most of them, 18% have an age restriction of 16+. On the other hand, there are only 4 shows with the age restriction of 13+.

plt.figure(figsize=(15,7))
ax=sns.boxplot(x='Age', y='IMDb', data=df, showmeans=True)

Figure 13: Boxplot illustrating the relationship between the ratings in IMDb and Age restriction

Next, we decided to explore the relationship between age and ratings (we choose IMDb purely cause it has a larger number of entries, therefore being more complete).

We can clearly see in Figure 13 how the shows with an age restriction of 13+ have the lowest rating, but that can be biased by the fact that there are only 4 shows in that category. As for the others, they all seem to have similar averages, meaning that the age restriction could be independent to ratings

Provider

The last feature to explore is the Provider feature we created. When looking at the tv shows of different platforms according to their IMDb ratings, we can see that they all have a similar distribution and mean.

However, it can be seen that the shows provided by multiple platforms have a slightly higher mean (72.89). This is logical, as if something is highly rated, more people will be likely to want to watch it, therefore generating more demand, being an incentive for more platforms to provide it.

When you ignore the shows in the ‘Multiple’ category, the break down the means is the following 71.71 for Prime Video, 71.49 for Netflix, 69.93 for Hulu, 69.11 for Disney+.

plt.figure(figsize=(15,7))
ax=sns.boxplot(x='Provider', y='IMDb', data=df, showmeans=True)

Figure 14: Boxplot illustrating the relationship between the ratings in IMDb and Provider

4. Further exploration of ratings

Finally, we decided to come back to the ratings and do further exploration of the values across the 2 dating websites. The first thing was to compare the ratings of Rotten Tomatoes to those in IMDb.

Figure 15: Scattergraph illustrating the relationship between the ratings in IMDb and Rotten Tomatoes

As seen in Figure 15 and mentioned in the previous section, Rotten Tomatoes, tends to rate the same higher than IMDb.This could be due to the nature of the users in both of the platforms. Rotten Tomatoes is a critics-powered site whereas IMDB is completely user driven [2]. Interestingly, only 24 shows have exactly the same rating in Rotten Tomatoes and IMDb.

Interactive graphs

We then created a few interactive graphs to allow us to explore which of the titles were the ones in the higher ratings of both.

fig = plt.figure(figsize=(16, 6))
df2 = df.iloc[:1000]
alt.Chart(df2).mark_circle(size=60).encode(
    x='Rotten Tomatoes',
    y='IMDb',
    color='Year',
    tooltip=['Title', 'Age','IMDb', 'Rotten Tomatoes', 'Year']
).properties(
    width=800,
    height=300
).interactive()

Figure 16 & 17: Scattergraph illustrating the relationship between the ratings in IMDb and Rotten Tomatoes by Year

fig = plt.figure(figsize=(16, 6))
df2 = df.sample(5000)
alt.Chart(df2).mark_circle(size=30).encode(
    x='Rotten Tomatoes',
    y='IMDb',
    color='Provider',
    tooltip=['Title', 'Age','IMDb', 'Rotten Tomatoes', 'Year']
).properties(
    width=800,
    height=300
).interactive()