This post will demonstrate how to use the “caret” library in R to set up a simple machine learning model.
We’ll use the model we build to predict which drivers will finish on Podium (1st, 2nd or 3rd) at the next race, the Japan Grand Prix.
Enjoy!
1. Gather the F1 Data
Set up the same data frame as in the Regression model, including generating the “Prior Year Win Rate” and “Prior Year Podium Rate” for each driver and each race. We will use these same features in our machine learning model.
With one line of code, we can get an additional feature, driver age at the time of each race.
This is just the difference between the driver’s d.o.b. and the race date.
At this point, we can drop everything from the data frame that we won’t be using.
All we need is a table with the variable we want to predict, “podium” (0 = no podium, 1 = podium), as well as our features: “prior year win rate”, “prior year podium rate”, and “driver age”.
We are using “podium” instead of “win”, because there are 3x as many podium finishes than race wins, which will make it easier for the model we make to train itself.
The output table has 5581 observations (rows) and looks like this:
2. Make Machine Learning Model
What we want is a classification model, something that can take our features and classify the result into one of two outputs: podium, or no podium.
Typical examples of classification models are algorithms called “K-Nearest Neighbors”, “Support Vector Machines” and “Decision Trees.”
All of these algorithms work by minimizing the errors of the predictions they are making. This is the same concept as a regression (ordinary least squares involves minimizing the sum of squared errors).
We will use a popular classification algorithm called a Random Forest, which is like a bunch of Decision Trees.
Enable the “caret” library, which contains all of the machine learning functions that make doing this in R so easy.
Sometimes caret will tell you to install another library when you are trying to do something, which in our case turned out to be “kernlab”.
Keep in mind our dataframe is the 4 column table above. Caret wants the predictor variable (“podium”) to be a Factor, not a number, so we change the 1s and 0s to “yes” and “no” for podium finishes.
outcomeName is now our column of yes and no, and predictorNames are the other 3 columns in the table: prior year win rate, prior year podium rate and driver age.
The next step is to split the data into training and test sets.
80% of the data will be used to train the model, and the remaining 20% will be used to compare predictions against actual podium finish outcomes, to get a measure of accuracy for the model.
To train the model, you first set some parameters with the trainControl function.
We will use cross validation, which means breaking up the training data into subsets to give the computer more dimensions to run its error minimization algorithm.
Then it’s just a matter of using the train function, identifying the columns that are features (predictorsNames) and the column that is podium finish (outcomeName).
The method ‘rf’ is random forest. You can type hundreds of different model types into this command.
Now that the model is trained, it can be used to make predictions.
Predictions for podium finishes are made using the trained model and the Test Data, and the predict function.
These predictions can then be compared against the actual podium finish results in the Test Data.
Our model was able to accurately predict 81% of podium outcomes in the Test Data.
3. Podium Predictions for the Japan Grand Prix
We’ll use the model to make predictions for who will finish on the podium in Japan this weekend.
First, we need a table that has the input data for the 20 current drivers, including their performance histories and age (the features in the model).
This table is called “Japan”.
Use the predict function again, applying the trained model to the Japan table to make predictions.
The result is that the podium finishers are predicted to be Sergio Perez (last week’s winner btw), Max Verstappen, both of Red Bull, and Ferrari’s Carlos Sainz.
Time to contemplate betting some money with these results!