Dowloading Data from Web link in TSV format

Scobbyy2k3
Scobbyy2k3 Partner, Registered Posts: 26 Partner

I have TSV format data in multiple web links . How do I import that data from the multiple web links at once into dataiku.

Thanks


Operating system used: windows

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    Hi @Scobbyy2k3
    ,

    If the web links don't require authentication you can use HTTP dataset +Dataset -> Network -> HTTP

    Screenshot 2022-06-02 at 08.09.33.pngScreenshot 2022-06-02 at 08.08.37.png

    You can add multiple URLs this is suitable if all the TSV files have the same schema.

    If you need authenticated requests you can use the Python recipe and with requests to download the files and store them in a managed folder or dataset/s.

  • Scobbyy2k3
    Scobbyy2k3 Partner, Registered Posts: 26 Partner

    Hi, I Got this Error, How do I resolve it ?

  • Scobbyy2k3
    Scobbyy2k3 Partner, Registered Posts: 26 Partner

    Can you also help with how to call request in Python using the python recipe.

    Thanks

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker

    So yeah you likely require some authorization like a header, 401 unauthorized is returned by the HTTP endpoint.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
    edited July 17

    We also have a plugin if the data is actually on the REST API endpoint.
    https://www.dataiku.com/product/plugins/api-connect/

    If the TSV files are just on an HTTP or HTTPS connection that requires some type of authentication, e.g basic auth, etc you can use something like this in python and adapt it to loop through different URLs, filenames etc.

    You can use a python recipe( +Recipe - Python) Then choose folder as output( no input required).

    After your files are in the managed folder you write to and then use +Dataset - Internal Files in a folder to create the various datasets based on the filename patterns.

    Here is a basic sample code you can build on:

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import requests
    
    
    
    url = "https://downloads.dataiku.com/public/website-additional-assets/data/CO2_and_Oil.csv"
    payload={}
    
    headers = {
      'Authorization': 'Basic YkQySFB0RE1WQ1BRMXgyeWlncU1BQzRvenB5eHQ3Zmg6'
    }
    
    print("sending request")
    
    response = requests.request("GET", url, headers=headers, data=payload)
    
    print(type(response.text))
    
    folder = dataiku.Folder("MhnBjJl7")
    folder_info = folder.get_info()
    print("uploading data")
    folder.upload_data("filename.tsv", response.text.encode('utf-8') )

  • Scobbyy2k3
    Scobbyy2k3 Partner, Registered Posts: 26 Partner

    Hi ,

    I downloaded data using multiple http links with identical data structure. One of link links with identical data structure has a missing column. This has allowed the data from this link map to the wrong column. How can I fix this?

Setup Info
    Tags
      Help me…