Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE

How to do webscarpping using Selenium in dataiku Notebook.

Ramya
Level 2
How to do webscarpping using Selenium in dataiku Notebook.

I have a task to perform web scraping in a Dataiku notebook, and for that purpose, I need to utilize ChromeDriver. However, I'm unsure about the process of installing ChromeDriver and integrating it into a Dataiku notebook. Is there a method to invoke ChromeDriver within a Dataiku notebook?

0 Kudos
1 Reply
AlexT
Dataiker

Hi @Ramya ,

So you would need your systems admin to add and install 
1) chrome driver
wget https://chromedriver.storage.googleapis.com/$(curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE)/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/local/bin/

2) Download and install chrome
sudo wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
sudo yum localinstall google-chrome-stable_current_x86_64.rpm

Then you should add selenium to a code env and use 

import dataiku
from selenium import webdriver
import time
import pandas as pd

# selenium stuff 
options = webdriver.ChromeOptions() ;
prefs = {"download.default_directory" : "/tmp", "prompt_for_download": "false"};
output_dataset = dataiku.Dataset("fitness2")
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument("--download.prompt_for_download=false")
chromeOptions.add_argument("--download.default_directory=/tmp")
chromeOptions.add_experimental_option("prefs",prefs);
driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=chromeOptions)

try:

    driver.get('https://www.browserstack.com/test-on-the-right-mobile-devices');
    downloadcsv= driver.find_element_by_css_selector('.icon-csv');
    gotit= driver.find_element_by_id('accept-cookie-notification');
    gotit.click();    
    downloadcsv.click();
    time.sleep(5)
    driver.close()

except:
     print("Invalid URL")
     driver.close()
        
# read downloaded file and create dataset
cereal_df = pd.read_csv("/tmp/BrowserStack - List of devices to test on.csv")

output_dataset.write_with_schema(cereal_df)

 

0 Kudos