Dowloading Data from Web link in TSV format

Scobbyy2k3
Level 3
Dowloading Data from Web link in TSV format

I have TSV format data in multiple web links . How do I import that data from the multiple web links at once into dataiku.

 

Thanks


Operating system used: windows

0 Kudos
6 Replies
AlexT
Dataiker

Hi @Scobbyy2k3 ,

If the web links don't require authentication you can use HTTP dataset  +Dataset ->  Network -> HTTP

Screenshot 2022-06-02 at 08.09.33.pngScreenshot 2022-06-02 at 08.08.37.png

You can add multiple URLs this is suitable if all the TSV files have the same schema. 

If you need authenticated requests you can use the Python recipe and with requests to download the files and store them in a managed folder or dataset/s. 

 

 

0 Kudos
Scobbyy2k3
Level 3
Author

Hi, I Got this Error, How do I resolve it ?

0 Kudos
AlexT
Dataiker

So yeah you likely require some authorization like a header, 401 unauthorized is returned by the HTTP endpoint.

 

0 Kudos
Scobbyy2k3
Level 3
Author

Can you also help with how to call request in Python using the python recipe.

 

Thanks

0 Kudos
AlexT
Dataiker

We also have a plugin if the data is actually on the REST API endpoint.
https://www.dataiku.com/product/plugins/api-connect/

If the TSV files are just on an HTTP or HTTPS connection that requires some type of authentication, e.g basic auth, etc you can use something like this in python and adapt it to loop through different URLs, filenames etc. 

You can use a python recipe( +Recipe - Python) Then choose folder as output( no input required).

After your files are in the managed folder you write to and then use +Dataset - Internal Files in a folder to create the various datasets based on the filename patterns. 

Here is a basic sample code you can build on: 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import requests



url = "https://downloads.dataiku.com/public/website-additional-assets/data/CO2_and_Oil.csv"
payload={}

headers = {
  'Authorization': 'Basic YkQySFB0RE1WQ1BRMXgyeWlncU1BQzRvenB5eHQ3Zmg6'
}

print("sending request")

response = requests.request("GET", url, headers=headers, data=payload)

print(type(response.text))

folder = dataiku.Folder("MhnBjJl7")
folder_info = folder.get_info()
print("uploading data")
folder.upload_data("filename.tsv", response.text.encode('utf-8') )

 

0 Kudos
Scobbyy2k3
Level 3
Author

Hi ,

 I downloaded data using multiple http links with identical data structure. One of link links with identical data structure has a missing column. This has allowed the data from this link map to the wrong column. How can I fix this?

0 Kudos