Web Scraping with Dataiku- Watch on Demand

MichaelG
Community Manager
Community Manager
0 min read 4 9 3,584

In this session, Matt showed how he automated some web scraping processes. He went through examples of how he can find an API to get the data, download a webpage to extract his content, or simulate a navigation. And he shared code samples using python packages like requests, beautifulsoup, or selenium. 

Hosted by Matthieu Scordia: Data scientist @ Dataiku for the last 8 years, my goal is to help our customer building advanced data science project. Now based in Singapore, I'm covering all the APAC area for the implementation of data science project using Dataiku DSS.

 

Labels

?
9 Comments
Mattsco
Dataiker

Hi everyone, 

You can download the starter project from here: 
https://dl.dataiku.com/file/niE1MnzR9oV2f4uj/QeLMn75XhSNBLGmJ/WEBSCRAPINGWEBINAR.zip

You would need to install 2 packages in a code env to make it work: 

bs4 (Beautiful Soup)
selenium 

Feel free to reach out to me if you have any questions.

 

Nikhil14
Level 1

I downloaded the chromedriver and put it in a managed folder.  But when I run the code, it shows binary chrome missing. 

I am using the dataiku on my company's server which has linux OS. How do I proceed here? 

seattle_ds
Level 1

How did you reference the chromedriver? The plain f.read, io.BytesIO(f.read()), io.StringIO(f.read()) didn't work. Here is my code, and I'm using the managed folders to bring chromedriver to Dataiku:

with folder.get_download_stream('chromedriver.exe') as f:
f_read = f.read()

chromeOptions = webdriver.ChromeOptions()
driver = webdriver.Chrome(f_read, options=chromeOptions)

Thank you!

Mattsco
Dataiker

@seattle_ds check your mp, I've shared the full notebook with you. 
Matt 

luna
Level 1

Hi @Mattsco , I also encountered the same problem as @Nikhil14 where it cannot find chrome binary. May I know how to resolve this problem? Thanks!

Daniel_B
Level 1

@Mattscoim having the same issue with the reference of the chrome driver. Could i please take a look at the full notebook? Thanks a lot.

Mattsco
Dataiker

@Daniel_B  @luna 
Hello,
Sorry for the late reply!
Let me share a notebook with you :
https://dl.dataiku.com/file/DjEsAPVvgNQ1rxYt/dlKu2Ev777hfkWy9/propertyguru%20selenium.ipynb

hmin412
Level 1

@Mattsco 

I can not see your notebook and code, could you please share it again ? 

 

tgb417

@hmin412 

Welcome to the dataiku community 

plik as a service is designed as a place to store content temporarily. So that I would guess has long ago been deleted. 

@MichaelG , would there be a way to relocate the notebook and put it somewhere for download that will not get auto deleted?  

Share: