Dowloading Data from Web link in TSV format
I have TSV format data in multiple web links . How do I import that data from the multiple web links at once into dataiku.
Thanks
Operating system used: windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Scobbyy2k3
,If the web links don't require authentication you can use HTTP dataset +Dataset -> Network -> HTTP
You can add multiple URLs this is suitable if all the TSV files have the same schema.
If you need authenticated requests you can use the Python recipe and with requests to download the files and store them in a managed folder or dataset/s.
-
-
Can you also help with how to call request in Python using the python recipe.
Thanks
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
So yeah you likely require some authorization like a header, 401 unauthorized is returned by the HTTP endpoint.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
We also have a plugin if the data is actually on the REST API endpoint.
https://www.dataiku.com/product/plugins/api-connect/If the TSV files are just on an HTTP or HTTPS connection that requires some type of authentication, e.g basic auth, etc you can use something like this in python and adapt it to loop through different URLs, filenames etc.
You can use a python recipe( +Recipe - Python) Then choose folder as output( no input required).
After your files are in the managed folder you write to and then use +Dataset - Internal Files in a folder to create the various datasets based on the filename patterns.
Here is a basic sample code you can build on:
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu import requests url = "https://downloads.dataiku.com/public/website-additional-assets/data/CO2_and_Oil.csv" payload={} headers = { 'Authorization': 'Basic YkQySFB0RE1WQ1BRMXgyeWlncU1BQzRvenB5eHQ3Zmg6' } print("sending request") response = requests.request("GET", url, headers=headers, data=payload) print(type(response.text)) folder = dataiku.Folder("MhnBjJl7") folder_info = folder.get_info() print("uploading data") folder.upload_data("filename.tsv", response.text.encode('utf-8') )
-
Hi ,
I downloaded data using multiple http links with identical data structure. One of link links with identical data structure has a missing column. This has allowed the data from this link map to the wrong column. How can I fix this?