Oncrawl - Leveraging Dataiku for Predictive SEO as a Product Strategy

Team members:
Vincent Terrasi, Product Director
Elodie Mondon, Data Engineer
Damien Garaud, Data Scientist

Country:
France

Organization:
Oncrawl

Description:
Enterprise SEO platform powered by the industry-leading SEO Crawler and Log Analyzer. Combine the power of technical SEO, machine learning and data science for increased revenues from search engines.

Oncrawl offers two product suites to help you open Google’s blackbox and increase website revenues based on reliable SEO data.

Oncrawl Insights:
Unleash your SEO potential with prescriptive analysis. Unify your search data and improve your site’s traffic, rankings and online revenues:

  • Analyze your website like Google does, no matter how large or complex your website is.
  • Understand the impact of ranking factors on crawl budget and organic traffic
  • Relies on 600+ indicators, advanced data exploration and actionable dashboards.

And Oncrawl Genius:
Empower your SEO with data science and automation. Use SEO data to build a more profitable business through BI, data science and machine learning:

  • Build custom solutions to business and marketing problems with our API
  • Use ready-made machine learning projects and adaptable models applied to SEO
  • Connect with Business Intelligence solutions for better strategic decision-making


Awards Category:

  • Alan Tuning


Challenge:


Due to the complexity of today’s markets, the growing opacity of search engine ranking algorithms, and the sheer volume of data available affecting Search Engine Optimization (SEO), the ability to easily manipulate and analyze data now makes the difference between using it as a marketing tool, and leveraging it as an executive-level product strategy.

In SEO, the goal is to rank pages at the top of search engine results. However, search engine ranking algorithms are based on many factors and generally constitute a black box. Our clients wanted to know the ranking factors that are most influential for their website.

This is the goal of predictive SEO. Exhaustive data, incl. indexed pages, links, logs, etc. is collected to train a ML model to recognize the patterns between ranking factors and actual page rank. It is designed to answer questions frequently encountered in the field: how to predict crawl budget, how to detect anomalies based on trends, how to generate SEO text, etc.

Integrating technical SEO with a data science platform is the best solution to provide the most efficient and relevant insights to answer these. 

Within the field, many different use cases are possible:

  • Identification of new or unindexed content for real-time indexing requests
  • SEO text generation
  • Anomaly reporting based on trends in your crawl results
  • Prediction of future long tail trends
  • Find ranking factors per URL or group of URLs
  • Monitor your crawl budget by category or subcategory to detect SEO issues
  • Detect the best new products for the next few weeks for featured highlights
  • Monitor your crawl budget based on different
  • Google bots to focus on the right technologies
  • And lots more!

Additionally, another challenge is access to SEO data and data analysis skills in the filed: few specialists are also skilled in data analysis, and few data analysis platforms have the ability to easily interface with the sources of data used in SEO.


Solution:


API usage is limited by calculation speed in API-based solutions that then use Python or R to manipulate the data. Dataiku makes data manipulation simpler and more robust when compared to traditional API usage and enables faster data integration. 

Oncrawl plugin for Dataiku provides a recipe enabling the easy export of URLs or aggregated data from crawls, as well as logging monitoring events. Here's the step-sby-step process: 

Step 1: Import the data

You can retrieve different projects from Oncrawl, and request the latest crawls. You can therefore use both data related to your site and data related to your competitors. This is not possible directly in Oncrawl where each project corresponds to a specific website.

Capture d’écran 2021-07-16 à 16.59.19.png

Step 2: Prepare the data

Then, you need to prepare the data: clean up missing data, rename columns, enrich the data if necessary.

Step 3: Add additional datasets

Beyond the data linked to the crawl, you can add data from other tools: keywords tool, backlink tool, etc.

Step 4: Merge the data

Then, you simply have to merge the data, i.e. merge all the datasets based on the URL. The goal is to understand what impacts the SEO for each URL or group of URLs. Once the final dataset is ready, you can create a visual analysis. 

Capture d’écran 2021-07-16 à 17.01.11.png

Step 5: AutoML Prediction

You can click on ‘AutoML Prediction’: the interface helps you test the most efficient algorithms and recommends the best one. You must then choose on which variable the model should base its prediction. This is an essential step, as you must determine it according to your needs. You will then see the results of different algorithms on the same page, and be able to compare their accuracy to select the best fit. 

You should now have access to your results! For each of the algorithms used, an AUC score between 0 and 1 is available, the closest to 1 presenting the best results.

You can dive into the details to assess the accuracy and efficiency of the model, while 'Interpretation' will give you detailed explanations about the metrics. You also look into which keywords boost or penalize seach URL, which will help you determine where to focus your efforts, depending on each site and the metrics involved.

Capture d’écran 2021-07-16 à 17.02.24.png


Impact:


The use cases we mentioned above are not "new" in data science or machine learning, but they are newly accessible to the SEO community. As they often don't have advanced data skills, our work has made it possible for SEO specialists to work visually with SEO data.

It also enables experienced data scientists and analysts to more easily obtain SEO data, which was, until now, not a typical type of data they had access to.

SEO is a confirmed and growing field with an increasingly important role in business strategy. Improving data analysis and making this kind of data available for other purposes opens the door to more effective and broader-reaching strategies, as well as significant savings in cost. These strategies rely on the ability to implement the use cases listed above.

For example, analyzing data related to ranking factors and keywords, combined with crawl data, can help identify URLs with textual content that should be improved. For one customer, rewriting meta descriptions through machine learning and text generation led to savings of 30 man-hours and 24,000 USD/month in SEO "production" costs alone.

This project has made it easier to get the data, implement machine learning with Dataiku, and train a broader audience of practitioners. In terms of productivity, the overall process is twice as efficient: everything that is done in Dataiku would previously have been developed in R or Python, would have to be tested extensively and taken a lot of time to implement. In a few minutes, Dataiku is able to output all the variables to be worked on in priority, for each of the variables the analysis is detailed, and for each of the URLs we know what boosts or penalizes.

Once the machine learning model is in place, we can add new URLs and know even before putting the content if there is a chance to be in the top 10 rankings!

Up next: the value of machine kearning for SEO

The next step in the democratization of machine learning for SEO will be to integrate the results of a Dataiku analysis directly into the tools and interfaces known by SEO users. Oncrawl is working on a big project to make steps even easier and with fewer clicks for the final user. Stay tuned… 🙂

Version history
Last update:
Monday
Updated by:
Contributors
Public