How to use chrome driver non headless in dataiku

Mitra
Mitra Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 1

I'm trying to use chrome driver headless in DSS to parse data from a website.

Is there a way to use non headless to visualize the parsing activity on the screen?

Tagged:

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron

    This is something you would develop / test locally in your machine and then deploy the automation to run somewhere else. Is there a reason as to why you can't take this approach?

  • Vitaliy
    Vitaliy Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    edited July 17

    Hi,
    If by using Chrome driver headful you mean being able to see the page in UI, then there is no way to do that. However, if you want to be able to parse the page the same way as in a Browser using developer tools, you can do that in a Notebook with Selenium. The prerequisites will be downloading the Chrome driver and adding it to PATH on the DSS server or specifying the path to the driver directly in the code. Also, you may get an error regarding missing "Xvfb" (I got the error in my testing). In that case, you can fix the issue with the below OS package:

    sudo yum install xorg-x11-server-Xvfb

    Then create a code env adding the below packages (I used Python3.9 in my test):

    pyvirtualdisplay
    selenium

    Then try the code below to grab a specific element from a website:

    from pyvirtualdisplay import Display
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service as ChromeService
    from selenium.webdriver.common.by import By
    
    # Start virtual display using Xvfb
    display = Display(visible=0, size=(800, 600))
    display.start()
    
    # Path to your ChromeDriver executable.
    chromedriver_path = "/usr/bin/chrome/chrome-linux64/chromedriver" # change to your path
    # Set up Chrome service with executable_path.
    chrome_service = ChromeService(executable_path=chromedriver_path)
    
    # Set up Chrome options
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--disable-extensions')
    chrome_options.add_argument('--no-sandbox')
    # chrome_options.add_argument("--headless=new") # enable headless for Chrome >= 109
    
    # Create a Chrome driver instance
    driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
    
    # Example: Navigate to a website
    driver.get("https://www.selenium.dev/")
    
    # Example: Extract data from the website
    # ... Your scraping code here ...
    
    element = driver.find_element(By.CSS_SELECTOR, "body > div.container-fluid.td-default.td-outer > main > section.row.td-box.td-box--gradient.-bg-selenium-green.p-2 > div > div > div > h1")
    print(element.text)
    
    # Close the driver
    driver.quit()
    
    # Stop the virtual display
    display.stop()

    Screenshot 2024-01-20 at 16.42.37.png

    Hope this helps.

    Best.

Setup Info
    Tags
      Help me…