Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Using selenium with a python recipe

Solved!
PaulMordant
Level 1
Using selenium with a python recipe

Hi, 

I have a basic code that retrieves a .csv file from a website thanks to selenium. Here is a little hint of how my code performs the data gathering : 

 

#Initialize driver
wd = webdriver.Chrome('/home/dataiku/chromedriver',chrome_options=chrome_options)

#chrome_options containing all the correct options that enable file downloading 

web_driver = wd.get("desired_url")

...............................................................

#web driver performing login steps and web browsing to find the .csv file button

...............................................................

web_driver.find_element_by_xpath("//button[@label='Download CSV']").click()

 

I developped this code using a jupyter notebook inside my dataiku project. Everything works correctly and the file is successfully downloaded in the desired directory but once I try to switch to a python recipe, the file does not seem to be downloaded or at least, is not stored in the desired directory and I can't find it on my local system. 

Does anyone have a clue of what could explain that the download is successful in a notebook but not in a python recipe ? Are there any mandatory additional steps I should consider when trying to download a file using selenium inside a python recipe ? 

Best regards, 

Paul 

0 Kudos
1 Solution
AlexT
Dataiker
Dataiker

Hi Paul, I would suggest we go the support route if you can't figure it out. 

Hard to say what could be different in terms of the 2 code environments and your code that could cause it to fail. Dependency wise as long as selenium is added to the code env it should work, you would need to have chrome installed and chrome driver available at the defined path. I've tested on my end both in recipe and notebook with the same code env and it worked fine, I've used the following sample code. Please note I've installed Chrome and chrome driver for this to work.

import dataiku
from selenium import webdriver
import time
import pandas as pd

# selenium stuff 
options = webdriver.ChromeOptions() ;
prefs = {"download.default_directory" : "/tmp", "prompt_for_download": "false"};
output_dataset = dataiku.Dataset("fitness2")
# options added to get it to on Linux Server , installed latest chrome and compatible chrome driver
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument("--download.prompt_for_download=false")
chromeOptions.add_argument("--download.default_directory=/tmp")
chromeOptions.add_experimental_option("prefs",prefs);
driver = webdriver.Chrome('/home/dataiku/chromedriver', chrome_options=chromeOptions)

try:

    driver.get('https://www.browserstack.com/test-on-the-right-mobile-devices');
    downloadcsv= driver.find_element_by_css_selector('.icon-csv');
    gotit= driver.find_element_by_id('accept-cookie-notification');
    gotit.click();    
    downloadcsv.click();
    time.sleep(5)
    driver.close()

except:
     print("Invalid URL")
        
# read downloaded file and create dataset
cereal_df = pd.read_csv("/tmp/BrowserStack - List of devices to test on.csv")

output_dataset.write_with_schema(cereal_df)

 

View solution in original post

0 Kudos
3 Replies
AlexT
Dataiker
Dataiker

Hi Paul,

Without looking at job diagnostics for your job it's hard to say why this would not work in a recipe. 

It could be a different code env, or perhaps the recipe is running in a containerized execution and notebook is or permissions issues.

You can perhaps try uploading the resulting  ile to a managed folder instead by saving the file in /tmp or a buffer and then using https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder.upload_stream

This would work both in local and containerized execution. If you still have issues I would suggest you reach out via support ticket with the job diagnostics when you are running this in a recipe for us to troubleshoot further. 

0 Kudos
PaulMordant
Level 1
Author

Hi Alex, 

Thank you for your answer.

I believe the issue comes from a different code env because, by default, my notebooks and my recipe does not run on the same. I tried to create a new code env that would basically be a copy of the one that jupyter uses by default (the one that works) so I can force my recipe and my notebooks to work on the same.

Unfortunately I can't download the file with this new code env whether it's by using a notebook or a recipe. Do you know if there is any special argument or option I should use when creating a code env that would allow the download ?

I tried to check what could be different between the 2 environments but I can't find anything.

In case more information is required to debug this, I'll reach out to the support ticket as you mentionned. 

Thank you, 

Paul

0 Kudos
AlexT
Dataiker
Dataiker

Hi Paul, I would suggest we go the support route if you can't figure it out. 

Hard to say what could be different in terms of the 2 code environments and your code that could cause it to fail. Dependency wise as long as selenium is added to the code env it should work, you would need to have chrome installed and chrome driver available at the defined path. I've tested on my end both in recipe and notebook with the same code env and it worked fine, I've used the following sample code. Please note I've installed Chrome and chrome driver for this to work.

import dataiku
from selenium import webdriver
import time
import pandas as pd

# selenium stuff 
options = webdriver.ChromeOptions() ;
prefs = {"download.default_directory" : "/tmp", "prompt_for_download": "false"};
output_dataset = dataiku.Dataset("fitness2")
# options added to get it to on Linux Server , installed latest chrome and compatible chrome driver
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument("--download.prompt_for_download=false")
chromeOptions.add_argument("--download.default_directory=/tmp")
chromeOptions.add_experimental_option("prefs",prefs);
driver = webdriver.Chrome('/home/dataiku/chromedriver', chrome_options=chromeOptions)

try:

    driver.get('https://www.browserstack.com/test-on-the-right-mobile-devices');
    downloadcsv= driver.find_element_by_css_selector('.icon-csv');
    gotit= driver.find_element_by_id('accept-cookie-notification');
    gotit.click();    
    downloadcsv.click();
    time.sleep(5)
    driver.close()

except:
     print("Invalid URL")
        
# read downloaded file and create dataset
cereal_df = pd.read_csv("/tmp/BrowserStack - List of devices to test on.csv")

output_dataset.write_with_schema(cereal_df)

 

View solution in original post

0 Kudos
A banner prompting to get Dataiku DSS