How to do webscarpping using Selenium in dataiku Notebook.

Ramya
Ramya Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 8

I have a task to perform web scraping in a Dataiku notebook, and for that purpose, I need to utilize ChromeDriver. However, I'm unsure about the process of installing ChromeDriver and integrating it into a Dataiku notebook. Is there a method to invoke ChromeDriver within a Dataiku notebook?

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
    edited July 17

    Hi @Ramya
    ,

    So you would need your systems admin to add and install
    1) chrome driver
    wget https://chromedriver.storage.googleapis.com/$(curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE)/chromedriver_linux64.zip
    unzip chromedriver_linux64.zip
    sudo mv chromedriver /usr/local/bin/

    2) Download and install chrome
    sudo wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
    sudo yum localinstall google-chrome-stable_current_x86_64.rpm

    Then you should add selenium to a code env and use

    import dataiku
    from selenium import webdriver
    import time
    import pandas as pd
    
    # selenium stuff 
    options = webdriver.ChromeOptions() ;
    prefs = {"download.default_directory" : "/tmp", "prompt_for_download": "false"};
    output_dataset = dataiku.Dataset("fitness2")
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    chromeOptions.add_argument("--download.prompt_for_download=false")
    chromeOptions.add_argument("--download.default_directory=/tmp")
    chromeOptions.add_experimental_option("prefs",prefs);
    driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=chromeOptions)
    
    try:
    
        driver.get('https://www.browserstack.com/test-on-the-right-mobile-devices');
        downloadcsv= driver.find_element_by_css_selector('.icon-csv');
        gotit= driver.find_element_by_id('accept-cookie-notification');
        gotit.click();    
        downloadcsv.click();
        time.sleep(5)
        driver.close()
    
    except:
         print("Invalid URL")
         driver.close()
            
    # read downloaded file and create dataset
    cereal_df = pd.read_csv("/tmp/BrowserStack - List of devices to test on.csv")
    
    output_dataset.write_with_schema(cereal_df)

Setup Info
    Tags
      Help me…