Tennis Match Forecasting

Open Thesis PDF

Tennis is a fascinating sport to analyze and statistically model. Not only is it a zero-sum game, there are also key pieces of information known before each match starts that may significantly influence the outcome. For example, we know what level the match is (Challenger event, ATP/WTA 1000, Grand Slam, etc.) and we know what surface the match will be played on (indoor/outdoor hard, clay, grass). Roger Federer would have a higher probability of beating Rafael Nadal on grass, and the opposite on clay. Paired with time-variant historical information regarding each player's stats at the given level on the given surface, a forecast isn't necessarily a total toss-up.

My Master's thesis strives to create an Adaptive Least Squares forecasting model with Kalman filtering to generate probabilities of winning for professional tennis players. To do so, a statistical profile of each player was created, allowing each unit of observation (each athlete) to contain features pertaining to his/her stats as well as the stats of that athlete's historical opponents. This basically serves as "what have I averaged, and what has my upcoming opponent averaged." Features involving what each player (& the respective opponent) would give up on average were also included. With 4x the baseline number of features, the feature set was robust.

The model ultimately achieved an accuracy a few percentage points higher than if winners were chosen strictly by rankings (which is still a fairly strong method). Any two names along with a corresponding tournament name can be inputted into a program, which ingests two competing player names as well as unique information about the specific tournament (surface type, level, etc.) and outputs probabilities of winning for each player. Both R and Python were leveraged to bring this project past the finish line. The entire official document is displayed below.