
Learning Analytics and AI
Predictive Analytics of Student Performance
Date
Team
2024
Individual work
My Role
Learning analyst
Target Users / Audience
Educators
Tools Used
Exploratory
Overview
This project focuses on predicting student performance using data from two schools related to Math and Portuguese language subjects. The data includes various predictors such as grades, demographic information, and school-related attributes. The goal is to develop and evaluate different predictive models (linear regression, logistic regression, decision tree, and random forest) to forecast final grades and identify students at risk of failing.
Data Loading and Exploration
Import CSV datasets for Math and Portuguese performance, ensuring the correct delimiter settings.
Examine the summary of both datasets and contrast them with the metadata. Identify at least two notable distributions across columns.
Predicting Final Grade (Linear Regression)
Build linear regression models to predict the final grade (G3) based on one or more previous grades (G1, G2).
Split data into 70% training and 30% testing.
Evaluate the model’s goodness-of-fit (R²) and visualize the relationship between actual and predicted values using scatter plots.
Compare models with different predictor variables (G1, G2, G1+G2, all variables) to assess which model is most accurate and useful.
Predicting Risk of Failing (Logistic Regression)
Create a new column to classify students as passing or failing based on their average grades.
Build a logistic regression model to predict “PassFail” using all other variables.
Split data into training and testing sets, and visualize the model’s performance with a confusion matrix.
Identify any issues with the model and propose improvements based on the confusion matrix.
Decision Tree for Risk Prediction
Build a decision tree model using the “PassFail” column and excluding grade variables (G1, G2, G3).
Analyze the decision tree to identify important variables and patterns or rules that predict student outcomes.
Evaluate the model’s utility in identifying students at risk and reflect on whether it should be used for interventions.
Random Forest for Risk Prediction
Apply a random forest model using the same dataset and “PassFail” column.
Adjust threshold values and assess the model’s performance through confusion matrices.
Compare the random forest’s effectiveness with other models and determine which threshold is optimal for identifying students at risk.
Process
Final Deliverable(s)
All visualizations and analyses are compiled into an Exploratory report, which is published online here:
https://exploratory.io/note/nwb6AhO8eK/Assignment-II-Report-xSM4HNA7BS