Using selenium with a python recipe
Hi,
I have a basic code that retrieves a .csv file from a website thanks to selenium. Here is a little hint of how my code performs the data gathering :
#Initialize driver
wd = webdriver.Chrome('/home/dataiku/chromedriver',chrome_options=chrome_options)
#chrome_options containing all the correct options that enable file downloading
web_driver = wd.get("desired_url")
...............................................................
#web driver performing login steps and web browsing to find the .csv file button
...............................................................
web_driver.find_element_by_xpath("//button[@label='Download CSV']").click()
I developped this code using a jupyter notebook inside my dataiku project. Everything works correctly and the file is successfully downloaded in the desired directory but once I try to switch to a python recipe, the file does not seem to be downloaded or at least, is not stored in the desired directory and I can't find it on my local system.
Does anyone have a clue of what could explain that the download is successful in a notebook but not in a python recipe ? Are there any mandatory additional steps I should consider when trying to download a file using selenium inside a python recipe ?
Best regards,
Paul
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi Paul, I would suggest we go the support route if you can't figure it out.
Hard to say what could be different in terms of the 2 code environments and your code that could cause it to fail. Dependency wise as long as selenium is added to the code env it should work, you would need to have chrome installed and chrome driver available at the defined path. I've tested on my end both in recipe and notebook with the same code env and it worked fine, I've used the following sample code. Please note I've installed Chrome and chrome driver for this to work.
import dataiku from selenium import webdriver import time import pandas as pd # selenium stuff options = webdriver.ChromeOptions() ; prefs = {"download.default_directory" : "/tmp", "prompt_for_download": "false"}; output_dataset = dataiku.Dataset("fitness2") # options added to get it to on Linux Server , installed latest chrome and compatible chrome driver chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") chromeOptions.add_argument("--download.prompt_for_download=false") chromeOptions.add_argument("--download.default_directory=/tmp") chromeOptions.add_experimental_option("prefs",prefs); driver = webdriver.Chrome('/home/dataiku/chromedriver', chrome_options=chromeOptions) try: driver.get('https://www.browserstack.com/test-on-the-right-mobile-devices'); downloadcsv= driver.find_element_by_css_selector('.icon-csv'); gotit= driver.find_element_by_id('accept-cookie-notification'); gotit.click(); downloadcsv.click(); time.sleep(5) driver.close() except: print("Invalid URL") # read downloaded file and create dataset cereal_df = pd.read_csv("/tmp/BrowserStack - List of devices to test on.csv") output_dataset.write_with_schema(cereal_df)
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi Paul,
Without looking at job diagnostics for your job it's hard to say why this would not work in a recipe.
It could be a different code env, or perhaps the recipe is running in a containerized execution and notebook is or permissions issues.
You can perhaps try uploading the resulting ile to a managed folder instead by saving the file in /tmp or a buffer and then using https://doc.dataiku.com/dss/latest/python-api/managed_folders.html#dataiku.Folder.upload_stream
This would work both in local and containerized execution. If you still have issues I would suggest you reach out via support ticket with the job diagnostics when you are running this in a recipe for us to troubleshoot further.
-
Hi Alex,
Thank you for your answer.
I believe the issue comes from a different code env because, by default, my notebooks and my recipe does not run on the same. I tried to create a new code env that would basically be a copy of the one that jupyter uses by default (the one that works) so I can force my recipe and my notebooks to work on the same.
Unfortunately I can't download the file with this new code env whether it's by using a notebook or a recipe. Do you know if there is any special argument or option I should use when creating a code env that would allow the download ?
I tried to check what could be different between the 2 environments but I can't find anything.
In case more information is required to debug this, I'll reach out to the support ticket as you mentionned.
Thank you,
Paul
-
DarinB Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1 Dataiker
How did you install Chrome and the Chrome driver in DSS?