Using Dataiku to Predict the Upcoming College Football Season

CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

Originally posted by Dataiku on September 1, 2021 by @fsilva

Every single year it feels like Alabama, Clemson, or both compete in the College Football Playoff National Championship. In fact, nine of the last 12 national championship games have featured at least one of those two schools. And, when it is not one of those schools, it’s one of the perennial powers in college football like Ohio State, LSU, or Georgia, for example. There is never a surprise team. Never do you see a big-time underdog play for the title. The last time a team played for its first national championship was Oregon in 2010 and, before that, Virginia Tech in 2000.

The vast majority of schools are content if they are just competitive within their own conference. The dream of a national championship is so far-fetched that fans of most schools likely have a second or third favorite team to root for come playoff time. This paradigm presents some problems.

Even before the global health crisis, college football attendance was declining. In my opinion, the current situation where only the same schools compete for the national championship is a big reason behind the declining attendance. Why would a fan of a school with no hopes of ever winning the most important trophy bother attending games, when the best they can hope for is going to a bowl game that doesn’t have a mockable name?

Wouldn’t the sport draw more viewers and sell more tickets if we doubled or tripled the number of schools with a realistic shot at winning it all? This year, there is evidence that this notion is true: Iowa State nabbed its highest preseason ranking in school history. They’ve never played for the national championship before, and they have their best chance ever this season. They also just happened to set a school record for most season tickets sold.

In the next section, I will walk through the results of two predictive models I built using Dataiku. These models try to predict the outcome of this upcoming season. One of the key takeaways from these models is that the outcome of every season is basically predetermined before it even starts.

Predicting the 2021 National Champion


Figure 1. Workflow of the Dataiku project used to make predictions for the upcoming college football season. The workflow starts with data ingestion, continues with cleaning and joining, and ends with two predictive models.

Dreams of winning the national championship are dashed for most schools before the season even starts. If you pay attention to high school football recruiting, then you will know that the perennial powers always get the best players. If your school is not consistently in the top 15 or so in high school recruiting rankings, then do not bother blocking your calendar for a potential New Year’s Day bowl game trip.

Using Dataiku’s AutoML capabilities, I tried to predict two events: the likelihood of a school making it to the national championship game and the likelihood of a school winning a national championship. The AutoML capabilities created a default model design and allowed me to easily configure the parameters and algorithms used to train the models. In the end, an artificial neural network model ended up being the champion model for both events. The training data only consisted of the recruiting class rankings of each school since 2006 from three different websites: ESPN, 247, and Rivals.

Below is a list of the top schools’ chances of making the 2021 College Football Playoff National Championship and a list of the top schools’ chances of winning the championship:


Figure 2. Probability of schools winning the 2021-2022 College Football Playoff National Championship (built in Dataiku)


Figure 3. Probability of schools playing in the 2021-2022 College Football Playoff National Championship (built in Dataiku)

Examining the Models' Results

Both models omit some obvious variables that contribute to a team’s success, like graduate transfers, players leaving early for the NFL, injuries, quality of coaches, etc. Dataiku models are capable of handling hundreds of variables from various datasets. However, the point of the exercise was to show that the recruiting data alone is capable of accurately predicting the likely winners because winning can be simple: just have better players than the other team.

The schools that the Dataiku models predict to do well this season are basically the same exact schools that the oddsmakers, the coaches, and the national media expect to do well. There are no surprises. Don’t look for hope if you’re a fan of a school that consistently doesn’t recruit the best players. Iowa State might buck this trend this year. However, I wouldn’t look at them as a sustainable model of success — there can always be outliers in data. Plus, we’ll have to wait and see if they are able to live up to (and exceed) expectations.

Texas A&M has the highest chance out of the schools that have never made an appearance in a national championship game. According to the model, they have a 7.1% chance to win the national championship. But unless your favorite school is willing to give its head coach $75 million dollars, then I wouldn’t look to Texas A&M as a beacon of hope either.

The Dataiku models make it obvious that the best teams are the ones that recruit the best high school players. It can get discouraging to be a fan of a school that don’t have a way to consistently recruit top talent. The recruiting paradigm might never change. However, that doesn’t mean the NCAA or the conferences don’t have the power to do something about the lack of competitive parity in college football. In the next section I outlined two possible steps they could take to mitigate the inevitability of the best recruiting schools being the only schools competing for a national championship.

Distributing the Odds

Year after year, the same schools get the best players out of high school. With Dataiku’s visualization capabilities, we can easily see that Alabama owes a lot of its success to its recruiting prowess. Schools like Alabama, Clemson, and Ohio State will continue to have the best odds of winning a national championship — in models like the Dataiku model above — as long as they keep bringing in the best players. However, these schools shouldn’t be punished for being able to attract top talent.


Figure 4. Average recruiting ranking from ESPN, 247, and Rivals for each school from 2006-2020 (built in Dataiku)

High school players should be allowed to play at whatever school they want to if they’ve worked hard enough and proven to be worthy of a roster spot. Also, there should not be any restrictions on schools accumulating as much talent as they possibly can. However, there should be changes put in place to better distribute the odds of schools making the national championship game.

Here are a couple possible solutions:

1. Force the top recruiting schools to play each other.

The schools with the best high school recruiting rankings should not be allowed to schedule more than one non-conference game against a school with a recruiting ranking far below their own. Every year, top programs play multiple games against very weak opponents from weaker football conferences. Take the national champions from 2017-2019, for example:

  • In 2017, Alabama played Fresno State, Colorado State, and Mercer.
  • In 2018, Clemson played Furman and Georgia Southern (and plenty of weak ACC schools).
  • In 2019, LSU played Georgia Southern, Northwestern State, and Utah State.

The 2020 national champions (Alabama) did not play anyone out of conference because of the condensed pandemic season. This upcoming season, Alabama, once again the favorites, will play Mercer, Southern Miss, and New Mexico State.

Forcing top schools to play harder schedules will increase parity and give schools that rarely have a chance to play in the playoffs a much better chance at playing for the right to prove they’re the best in the country. There are usually one or two really good football schools outside a Power 5 Conference that finish the regular season undefeated, like UCF in 2017 and 2018 or Cincinnati in 2020. These schools deserve a chance to actually compete for a national championship.

2. Expand the playoffs to six teams.

Give the top two ranked teams in the country a bye-week, and have schools ranked 3-6 play for a spot in the semifinals. Increasing the pool of teams for consideration will likely let the conference champions of each Power 5 conference have a chance at a title, as well as an undefeated non-Power 5 team.

Dataiku’s neural network models helped make it very clear that recruiting alone is tremendously important in a school’s quest for a national championship. If the recruiting paradigm can’t change, then the governing bodies of college football should change what happens as a consequence of recruiting, like forcing top ranked recruiting schools to play each other more often. I don’t expect solution No. 1 to ever happen, but I do expect No. 2 (or some form of expansion) to happen somewhat soon. Here’s to hoping that this happens sooner rather than later.

Check Out a Dataiku Demo

Not a Dataiku user yet? Want to try cool predictive analytics experiments like the one in this blog? Watch the 14-minute demo to see the full range of features and see if Dataiku is right for your use case.


Setup Info
      Help me…