How to do webscarpping using Selenium in dataiku Notebook.
I have a task to perform web scraping in a Dataiku notebook, and for that purpose, I need to utilize ChromeDriver. However, I'm unsure about the process of installing ChromeDriver and integrating it into a Dataiku notebook. Is there a method to invoke ChromeDriver within a Dataiku notebook?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Ramya
,
So you would need your systems admin to add and install
1) chrome driver
wget https://chromedriver.storage.googleapis.com/$(curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE)/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/local/bin/
2) Download and install chrome
sudo wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
sudo yum localinstall google-chrome-stable_current_x86_64.rpm
Then you should add selenium to a code env and useimport dataiku from selenium import webdriver import time import pandas as pd # selenium stuff options = webdriver.ChromeOptions() ; prefs = {"download.default_directory" : "/tmp", "prompt_for_download": "false"}; output_dataset = dataiku.Dataset("fitness2") chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") chromeOptions.add_argument("--download.prompt_for_download=false") chromeOptions.add_argument("--download.default_directory=/tmp") chromeOptions.add_experimental_option("prefs",prefs); driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=chromeOptions) try: driver.get('https://www.browserstack.com/test-on-the-right-mobile-devices'); downloadcsv= driver.find_element_by_css_selector('.icon-csv'); gotit= driver.find_element_by_id('accept-cookie-notification'); gotit.click(); downloadcsv.click(); time.sleep(5) driver.close() except: print("Invalid URL") driver.close() # read downloaded file and create dataset cereal_df = pd.read_csv("/tmp/BrowserStack - List of devices to test on.csv") output_dataset.write_with_schema(cereal_df)