In this post, we will use the Formula 1 World Championship Dataset (available on kaggle.com) to estimate the race win probabilities for each driver in a hypothetical next race.

The model we will use is a “multinomial logistic” regression. Multinomial means we have multiple possible outcomes (20 drivers who can win). Logistic means we will use the coefficients from a linear regression to assign win probabilities to each driver.

R code can be downloaded at github.com/f1datadriver

Post questions and comments below!

Step 1: Gather F1 Data into One Dataframe

The only libraries needed are lubridate (for some date/time math), nnet (for the regression) and ggplot2 (for the chart at the end).

Use a subset of races since 2018, since we are targeting the race records of the current set of 20 drivers and 10 constructors (teams).

Account for a few team name changes that have occurred since 2018.

Drop data for drivers who have raced since 2018, but are not current drivers. There are only a few.

Now the dataframe called “df” has the race records for the 20 current F1 drivers for races since 2018.

Step 2: Build features for driver and constructor performance

For each driver, get the “win rate” (1st place) and “podium rate” (1st, 2nd or 3rd place) for the full dataframe as a constant for each driver. Pull these into a separate dataframe called driver_records.

The feature for constructor performance will be total team points last year (2021). Just ten numbers to enter manually, then pull into the dataframe df.

Make one more dataframe merging driver and constructor performance records.

Step 3: Regression

The multinomial logistic regression is essentially a linear regression where the dependent variable (y) has an outcome of 0 or 1. In this case, 1 is a race win and 0 is a loss.

Note that:

y = win or loss

X = [x1, x2, x3] = [driver win rate, driver podium rate, team points in 2021]

We want to calculate the β coefficient for each X.

The function “multinom” from the library nnet is an easy one liner for this type of regression. Note that all of the variables are in our dataframe “df”.

The column of “Values” are our regression coefficients, β. There is one for the intercept and one for each independent variable.

The probability of an outcome (i) in a multinomial logistic regression is:

Therefore we can convert our Xs and βs calculated above to a win probability for each driver.

The variable output$prob is the win probability for each driver.

To better visualize the result, make a chart:

The model gives a lot of credit to Lewis Hamilton for his lifetime “win rate” and “podium rate.” Max’s recent success is starting to appear as well.

Upcoming improvements include making the driver and constructor features dynamic with each race entry. We will also add features for driver age, DNF records etc.

Hopefully this example made it easy to see how to set up the math and the code.

Please leave questions or comments!

2 Replies to “A Regression Model for Driver Win Probability”

Comments are closed.