Set column names in Python recipe
Thomas_K
Registered Posts: 15 ✭✭✭✭
tl;dr: I need to rename the column names of a huge dataset programmatically, preferably in Python code. The values are extracted from a different dataset, I have them as a Python list with the corresponding data type in another list of the same length.
Long version: DataIku GUI only lets me manually change the column names. I have, however, lots of csv files containing data without the column names, and one csv file where all the column names for all the other csv files are written. (I can't do anything about that layout, since it's not my data and I only have read access). This should be a one-liner that does not necessitate touching the actual data. What would be the best way to do this? I managed to extract the values (and the corresponding data types per column) as a Python list in a code recipe, but am not sure what to do with it. My lists might look like this:
col_names = ["ID", "Names", "Count"]
col_types = ["int", "string", "int"]
Long version: DataIku GUI only lets me manually change the column names. I have, however, lots of csv files containing data without the column names, and one csv file where all the column names for all the other csv files are written. (I can't do anything about that layout, since it's not my data and I only have read access). This should be a one-liner that does not necessitate touching the actual data. What would be the best way to do this? I managed to extract the values (and the corresponding data types per column) as a Python list in a code recipe, but am not sure what to do with it. My lists might look like this:
col_names = ["ID", "Names", "Count"]
col_types = ["int", "string", "int"]
Tagged:
Best Answer
-
Hi,
You would generally not do that in a recipe, which is part of the flow / rerunnable / supposed more or less to process data, but, first in a Python notebook. You can then automate that as a "Macro" as part of a DSS plugin
This would use the DSS public API (https://doc.dataiku.com/dss/latest/api/public/client-python/index.html)
Something like ("pseudo-code")
import dataiku
client = dataiku.api_client()
project = client.get_project("PROJECT_NAME")
dataset = project.get_dataset("dataset_name")
current_schema = dataset.get_schema()
# current_schema is now a dict, containing "columns", list of dicts. Each dict contains "name"
# Build the new columns list.
new_cols = []
for i in xrange(0, len(col_names)):
new_cols.append({"name": col_names[i], "type": col_types[i]})
# And update the schema, and save it
current_schema.columns = new_cols
dataset.set_schema(current_schema)
Answers
-
Thanks, I'll try that and give an update later on whether it worked.
-
I couldn't quite figure out how to do it as a recipe, so I did it as a script as described by you and that works for now. Thanks.