A silver medal solution to the NFL Big Data Bowl kaggle competition
Warning: The top1 solution won the competition by a large margin, and was obviously far superior to mine. Another participant used a similar approach with mine, and he got 42nd place (I barely got a silver medal). I am writing this post mainly for recording my thinking process.
I joined kaggle many years ago, but have only been casually and quietly monitoring some competitions. Finally, I decided to commit some time to test my machine learning skills on a real battlefield.
The goal of this competition is to predict how many yards will an NFL player gain after receiving a handoff, given the tracking data from the game, players, and environment, at the exact moment when the player starting to run.
I chose the “NFL Big Data Bowl” competition mainly because:
- It is an interesting topic (I am a soccer fan. Football has some similarities with soccer.);
- It is a code competition and the dataset is relatively small (181 Mb), so the idea is more important than computing resources;
- The data is tabular, but rather than using xgboost with intensive feature engineering and parameter tuning, I can think of an end-to-end approach based on neural network/deep learning.
- The private leaderboard will be calculated using future games, so there is no data leakage. And it is fun to watch the private leaderboard gets updated every week according to the games played in the previous week.
Solution rationale
Andrew Ng’s has a rule of thumb that
If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.
When watching soccer and football games, I can easily tell whether the player can carry the ball forward very far or very close, based on the situations I am seeing in the field, e.g. how much space he has in front of him, the direction he is running, how fast he runs, how fast other defense players are running towards him, how his teammates help him block/distract the defense players, etc. I was confident that deep learning is able to solve the problem.
The only difference between my thought when watching games on TV and the data in this competition is that I am watching a video, i.e. time-series images, while the data recorded a static moment. Luckily, the data contain some information on the players’ motion:
S
- speed in yards/secondA
- acceleration in yards/second^2Dis
- distance traveled from prior time point (0.1 seconds ago), in yardsDir
- angle of player motion (degree)
Model
The data can be transferred to image-like data as the figure shows.
- Preprocess the data. Most importantly, make the offense team always running towards the right (increasing X direction).
- Using the rusher as the anchor, take a small 20-yard X 20-yard patch of the field “image” showing the players’ positions. The positions were rounded to 0.5 yards, so the patch has 40X40 pixels. It is a sparse matrix with a very small number of 1s (The top 1 solution used the interactions between players as input tensors, which are dense tensors. This is the major difference between my approach and the top 1 solution. The solution inspired me that it may be very effective to use a similar approach to analyze gene expression data. Using interactions between genes as input and 1X1 convolution as the top 1 solution did, we don’t need to pay much attention to how to organize the genes as in the CNN or GCN approaches.)
- Other motion data for each players were put into separate channels. I only selected S, A and Dis based on my own cross-validation(CV). These channels are also very sparse. Defense and offense teams were separated, so there are totally 8 channels. The data is similar like 2D images with the RGB channels.
- Use a VGG-like convolutional neural network architecture. Hopefully, it can learn the interactions between players.
- Concatenate some hand-engineered features with the convolution features, and add a few dense layers after that. (The most useful features are rusher’s yard gain after 0.5 or 1 seconds assuming no one blocked him. I was too lazy to do the feature engineering. A lot of top solutions did an enormous amount of feature engineering and put these features in the Transformer model, e.g. 2nd and 3rd in private leaderboard)
- Use softmax with 199 categories(99 yards backward, 99 yards forward, or no yard gain) as the output layer. The loss function is categorical cross-entropy. Early stopping is based on the validation CRPS.
- One game has multiple plays. The CV samples were selected based on games not individual plays, i.e. GroupKFold. 5-fold CV generated 5 models and the predictions were based on the mean of the 5 models.
Mistakes
- I saw the variation of CV error was quite large, which meant there are multiple clusters of data that should be treated differently. I should dig more into the large CV variation. Turns out the data in 2017 is very different from 2018 and 2019 (2017 and 2018 data are in training set, and public and private leaderboards are based on 2019 data). Many top solutions only used 2018 data as validation.
- I spent too much time predicting rare events (large yard gain or loss). A lot of people just ignored these rare events, and clip the target to, for example, -30 to 50 yards.
- I bet on one solution. For both competing and learning purposes, I should try as many solutions as possible, e.g. transformer and xgboost.
Things I tried but didn’t work
(These were solely based on my own cross-validation, which was likely inappropriate because of the first mistake I made)
- Calculate the positions and speed of players after 0.5 or 1 seconds, and add those as additional channels.
- Considering the quantitive order of the yard gains, I tried ordinal categorical cross-entropy and multiple sigmoid as the loss function.
- Data augmentation. Flip the data on Y-axis (This actually worked in many other participants' solutions) and randomly drift the players' position a little bit.