Analysing & predicting High School student performance

11 min readMay 17, 2020

The Problem

The Portuguese Student dataset (student-mat.csv & student-por.csv) we will be using for the report focuses on analysing high school students and how external and demographic factors affect their Mathematics and Portuguese end of year grades.

The business problem we will be focusing on is on what schools can do to make sure less of their students fall behind, fail or underscore in their exams. We will be trying to predict which groups students are more likely to struggle with the subjects and be able to predict the student end of year grade by analysing factors such as family, age or where they live and what they do on their free time. Once these groups are identified, this information could be useful for schools as they can use it to target them with additional academic support in order to ensure the best possible results and future endeavours.

Source of data

The source of the data is the UCI data repository. This data seems reliable as it has over 300 entries and analyses students over 2 different schools in Portugal. The data was donated to the repository on the 27 of November 2014 but it was previously recorded for a paper on “Using Data Mining to Predict Secondary School Student Performance” in Porto, Portugal (April 2008)[1].

Target Attribute

We will analyse the student’s results over the whole year in a range of two subjects. We have decided to use “average” as our target attribute which is the G3 (end of year grade) average of both the Portuguese and Math grade

average = (G3.x + G3.y) / 2

This is better as it would give us a more accurate overview that is not specific for one subject but a range of both Science and Humanities.

There is evidence to support that extra support in subjects like mathematics makes a significant difference to the grades of the student in just one year [2].

Preparing the Data

Firstly, in order to calculate “average” we used the final grades “G3” in both the Mathematics and Portuguese database in the following way:

average = (G3.x + G3.y) / 2

Secondly, to prepare the data to carry out correlation tables and predictions we had to adjust some of the values. We were lucky that none of the values were missing in the database and it was all complete (we checked this by just having a look at the database on excel).

The main problem with the data was that the Portuguese grades database and the Mathematics grades database where stored in separate CSV files even though they were both related to the same group of students. So, therefore, in order to analyse the data altogether, we had to combine both databases. We did this by using the merge function and creating a new database “combined” with all the values for both subjects. However, this meant we had to values for “G1”, “G2”, “absence” and “failures”. Therefore we named G1.x and G1.y for maths and Portuguese respectively and proceeded to do the same with all of those variables.

#Merge both datasets into one
combined.df=merge(mgrades.df,pgrades.df,by=c("school","sex","age","address","famsize","higher","Pstatus","Medu","Fedu","Mjob","Fjob","reason","guardian","traveltime","studytime","schoolsup","famsup","paid","activities","nursery","internet","romantic","famrel","freetime","goout","Dalc","Walc","health"))write.csv(combined.df, file = "combined.csv")

In addition, to analyse correlations and predictions, we had to adjust some of the variables such as “schoolsup”, “famsup”, “paid”, “activities”, “nursery”, “higher”, “internet”, “romantic” that had “yes” or “no” as values to binary values “1” for yes and “0” for no.

#Preparing the data - switching "yes" and "no" to "1" and "0"
combined.df$schoolsup<-ifelse(combined.df$schoolsup=='yes', 1,0)

Thirdly, we separated the combined database into “combined.df.training” and “combined.df.test”. We carried this out by randomly selecting a sample 80% of the row indices and storing them as “combined.df.training”. Consequently, we had the other 20% to test the results of accuracy on the prediction models stored in “combined.df.test”.

Understanding the data

The dataset we will be using is made up of the data for 395 students in the Mathematics database “mgrades” and the Portuguese database “pgrades” but only of 201 in the combined database “combined”. It includes variables about basic demographic information (such as age, sex, address..), family information (such as the number of siblings, or whether their parents live together…), lifestyle information and academic information.

The attribute is relevant to the problem because it represents the academic results of the students at the end of Secondary education. It was derived by collecting the results from the end of year secondary school exam taken in Portugal.

Correlation

Based on the heatmap below, we identified attributes likely to be relevant to our problem. Positively correlated variables are in red while negatively correlated are in blue:

average <- combined.df$average
res<-cor(combined.df[,c(14:39)],use="complete.obs")
col<- colorRampPalette(c("blue", "white", "red"))(20)heatmap(x = res, col = col, symm = TRUE)

The most relevant are factors such as failure, travel time, alcohol consumption, absences and romantic relationships.

Then we proceeded to calculate the highest positive correlation (ignoring the other exam results ) and the highest negative correlation according to our target variable “average”.

targetCol <- which(names(combined.df)=="average")
startCol <- which(names(combined.df)=="traveltime")
endCol <- which(names(combined.df)=="G3.y")sort(cor(combined.df[,c(targetCol,startCol:endCol)],use="complete.obs")[1,],decreasing = FALSE) [1:10]

Negative Correlation:

Positive Correlation:

R graphs

We computed an aggregation table in R, with the 5 most representative values.

First to see the differences between Female and Male:

aggregate(cbind(goout, romantic, Walc, studytime, internet) ~ sex, data=combined.df,FUN=mean)

And then to see the differences between the two schools:

aggregate(cbind(goout, romantic, Walc, studytime, internet) ~ school, data=combined.df, FUN=mean)

Then we created some bar charts to see the effect of your parents job on your average:

ggplot(combined.df, aes(x=Mjob, y=average, fill=guardian))+ geom_bar(stat="identity")+ scale_fill_manual(values=c("#4682B4", "#87CEEB", "#6495ED"))+ theme_minimal() + ylim(0, 1000)

This shows that you are likely to have a much higher average if your mother works on other or service sector than for example the health sector or as a teacher.

ggplot(combined.df, aes(x=Fjob, y=average, fill=guardian))+ geom_bar(stat="identity")+ scale_fill_manual(values=c("#4682B4", "#87CEEB", "#6495ED"))+ theme_minimal() + ylim(0, 1000)

This shows that you are likely to have a much higher average if your father works on other or service sector than for example the health sector or as a teacher.

This is very interesting as it would be expected that if your parent is a teacher they would be able to provide support and therefore result in higher grades. However, this is not the case.

Lastly, we plotted an r-part tree. This helped us identify which variables are the most significant to the problem set and their hierarchy:

library(rpart)
library(rpart.plot)rt <- rpart(average ~ address + Fedu + studytime + schoolsup + famsup + higher + internet + romantic + goout, data = combined.df.training)#Plotting the tree
prp(rt, type = 1, extra = 1)

As seen above, the most important factors are if the students want to pursue higher education, their address, study time and school support. As expected, those who want to pursue higher education have higher average grades.

Tableau Graphs

Firstly, we decided to explore the impact of alcohol consumption on the grades as it was one of the variables with a high negative correlation:

As seen in the graph above, we found significant differences between the relationship between alcohol consumption and grades for females and males. Interestingly, in both subjects and the final average males have a negative relationship with alcohol consumption and grades. Therefore the more they drink, the worse their grades. However, the results for the females were not as expected and alcohol consumption had a positive correlation with alcohol consumption.

Then we proceeded to explore two other variables with high correlation to the average: study time and travel time, however, we decided to split this by the reason they choose this school (as this affects the travel time and motivation to study the student has).

This illustrates that the reason why they choose the specific school has an impact on both travel time and study time. This is especially prevalent if the reason they choose the school was other where one extra hour of studying has the biggest impact on grades and one extra minute in travel time has the highest negative impact on the grades too.

In general, this shows that the higher the study time, the higher the average for all choice of school reasons. However, even if it would make sense that the higher the travel time the lower the grades, this is true to all cases but two: if it is the school closest to your house (which would make sense as it is probably your best option) and if you choose the school for its reputation (where the longer the travel time, the higher the grades).

We also wanted to explore the effects of being in a relationship in the correlation between average and study time by gender.

As seen above, for students who are not in relationships the correlation between the average and study time is the same for males and females.

However, for those in relationships, it shows that the correlation of the average with the study time is significantly stronger for males compared in females (which is very similar to those in a relationship).

Generating and Testing Prediction Models

We looked at regression models in order to predict our target variable “average”.

Lasso Regularisation with optimized lambda

Firstly, we decided to carry out a Lasso Regularisation with optimized lambda:

startCol <- which(names(combined.df)=="traveltime")
endCol <- which(names(combined.df)=="health")
xknown <- as.matrix(combined.df.training[, startCol:endCol])
yknown <- combined.df.training$average#Basic Lasso Regulization Analysislibrary("glmnet")
library("Matrix")
library("foreach")lm.lasso <- glmnet(xknown, yknown, family = "gaussian")
plot(lm.lasso, xvar = "lambda", label = TRUE)
coef(lm.lasso, s = exp(0))#Using cross validation to optimise Lamdfor Lasso Regulization Model
set.seed(101)
lm.lasso.cv <- cv.glmnet(xknown, yknown, nfolds = 5, family = "gaussian")#Value for an optimised lamda value
lm.lasso.cv$lambda.min(minLogLambda <- log(lm.lasso.cv$lambda.min))#Plotting optimised Lamda Lasso Regulization
plot(lm.lasso, xvar = "lambda", label = TRUE)
abline(v = log(lm.lasso.cv$lambda.min))#Coefficients of the regularized linear regression with an optimal lambda.coef(lm.lasso.cv, s = "lambda.min")

#Determining the generalised error rate for this lasso model
xtest <- as.matrix(combined.df.test[, startCol:endCol])lm.lasso.cv.pred <- predict(lm.lasso.cv, newx = xtest, s = "lambda.min")#Calculate RMSE
grades.lasso.pred=predict(lm.lasso.cv.pred,newdata=combined.df.test)
rmse = sqrt(mean((combined.df.test$average-grades.lasso.pred )^2))errorvals<-sapply(-6:2, function(loglambda {mae(predict(lm.lasso.cv, newx = xtest, s = exp(loglambda)),combined.df.test$average)})plot(-6:2,errorvals, xlab="log lambda", ylab="error (mae)",type="o")abline(v = log(lm.lasso.cv$lambda.min))

The value for the log lambda value is -1.90482979908033. This model could be improved if we were to use the non-numerical variables too, as some of them such as address are essential in order to predict grades as shown in later predictions.

R part

The second prediction model we decided to try was r part. This prediction again excluded all fields related to only one of the subjects such as G1, G2, failures or absence.

library(rpart)
library(rpart.plot)rt <- rpart(average ~ address + Fedu + studytime + schoolsup + famsup + higher +internet + romantic + goout,data = combined.df.training)#Plotting the tree
prp(rt, type = 1, extra = 1)

RMSE of the r part model was 3.44455:

#Calculate RMSE
grades.rpart.pred = predict(rt,newdata=combined.df.test)
rmse = sqrt(mean((combined.df.test$average-grades.rpart.pred )^2)

Random Forest

Lastly, we tried a random forest. Running a Random Forest allowed us to achieve an even lower RMSE of 3.3236. This model took into account all variables except the grades, absence and failure because they were dependent on only one subject.

# Random Forest:
library(randomForest)
grades.forest <- randomForest(average ~ . , data =combined.df.boost, importance=TRUE, ntree=100)#Calculate RMSE
grades.forest.pred = predict(grades.forest,newdata=combined.df.test)
rmse = sqrt(mean((combined.df.test$average-grades.forest.pred)^2))#Plotting the forest and prediction comparisation
plot(grades.forest)
plot(combined.df.test$average,pch=16)
points(grades.forest.pred, col = "blue", pch=4)

After various iterations, this was the best result we could achieve. The black dots on the graph represent the data points of data of our target variable as seen in the database. On the other hand, the blue crosses are the predictions of the model. The prediction is not that accurate as seen on the right.

This is probably due to the fact that there is not a substantial amount of data points, more would probably lead to a more accurate result.

All the models helped us identify the most important variables in order to predict the grades of the students. In conclusion, out of all of the models, the one with the lowest RMSE with 3.32 was the black forest. This was probably because it is the most advanced model of prediction used and it took into account all variables, not only those that are numerical and it avoids the overfitting that occurs in other models like the r-part tree.

Ensemble Model

Furthermore, we decided to create an ensemble model combining the black forest and the r-part predictions. This gave us a very significantly lower RMSE of 1.1329:

#ENSEMBLE MODEL (r-part & Black Forest)
ensemble_pred <- (grades.rpart.pred + grades.forest.pred)/2 rmse = sqrt(mean((ensemble_pred-pred)^2))

In order to improve the individual models, we could have chosen a dataset with a bigger range of variables and more data points in this case students, as only 80% of those are used to train the model (201 * 0.8 =160.8 ~ 161).

Problem Conclusions and Recommendations

After our analysis we can conclude that the factors that affect the student’s grades the most are the following:

Age (the older the lower the grades)
Number of hours of studying (the more they study the higher the grades)
How much they go out (the more they go out the lower the grades)
Time it takes to travel to school (the more it takes the lower the grades)
Parent’s jobs (graph in section 2 illustrating this)

Therefore, our recommendations are that school focus on providing extra academic reinforcement towards those kids who:

Are older than the average as they tend to perform worse
Live far away from the school (particularly those that didn’t choose the school because of its reputation or because it is close to home)
Parents work in the health sector (as they are, by a significant difference, those with the lowest average grade)

Moreover, we would also recommend alerting their students on the effect of going out with their friends and consuming alcohol both on weekdays and weekends will have a significant effect on their grades (especially to the males).

References:

[1] Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary school student performance.

[2] Bhaird, C. M. A., Morgan, T., & O’Shea, A. (2009). The impact of the mathematics support centre on the grades of first-year students at the National University of Ireland Maynooth. Teaching Mathematics and its Applications: An International Journal of the IMA, 28(3), 117–122.