This is an update to the first post: A Regression Model for Driver Win Probability.

In that post, we treated the lifetime win rates and podium rates as constants for each driver.

Although this made setting up the regression model easier, we got the unlikely result of Lewis Hamilton having the highest win probability in a hypothetical next race.

As an improvement, we will upgrade the win rate and podium rate features to be dynamic year by year.

For example, we will regress all of the 2018 wins against 2017 driver performance, 2019 wins against 2018 performance, and so on to obtain our regression coefficients and then driver win probabilities.

Step 1: Import Data, Create Win & Podium Variables

This is the same first step we took in the first post:

Step 2: Get Year-by-Year Performance Records for Each Driver

The “for loop” below may look cumbersome, but it is way less tedious and typo prone than building a table of driver records for each year one by one. Doing it this way spares about 100 lines of repetitive code (picture the bulk of the “for loop” six times, instead of once).

First define the years you want the loop to cycle through, then calculate the win rate and podium rate by driver for each year.

Store the results in a “list” of tables called driver_records

As a reminder, the “lubridate” library is for date/time math, in particular the function “as.Date”

Step 3: Add Year-by-Year Driver Performances to the Main Dataframe

With year-by-year performances added, we can create features called “Prior Year Win Rate” and “Prior Year Podium Rate”

Step 4: Regression

For this regression we will drop team records and use only prior year driver performance, i.e. win rate and podium rate. Record the βs (coefficients), including for the intercept.

Next multiple the Xs by βs to get logits, and convert them to probabilities.

Refresh the Driver Win Probability chart from the first post:

In this model, 2021 driver performance is the driver of 2022 win probability.

This differs from the last model, where lifetime driver performance was our independent variable.

Max’s 2021 record (45% wins, 82% podiums) is a bit better than Lewis Hamilton’s (36% wins, 77% podiums).

Charles’s 2021 record is not doing justice to the success he and Ferrari have had this year (2021: 0% wins, 5% podiums. 2022: 27% wins, 45% podiums).

Accordingly, this model gives the highest win probability for a 2022 race to Max; though not by much over Lewis Hamilton.

Certainly the model remains overly simplistic, but our upgraded feature for historical driver performance is behaving better and will be useful in what we build in future posts.