Set column names in Python recipe

Thomas_K
Thomas_K Registered Posts: 15 ✭✭✭✭
tl;dr: I need to rename the column names of a huge dataset programmatically, preferably in Python code. The values are extracted from a different dataset, I have them as a Python list with the corresponding data type in another list of the same length.

Long version: DataIku GUI only lets me manually change the column names. I have, however, lots of csv files containing data without the column names, and one csv file where all the column names for all the other csv files are written. (I can't do anything about that layout, since it's not my data and I only have read access). This should be a one-liner that does not necessitate touching the actual data. What would be the best way to do this? I managed to extract the values (and the corresponding data types per column) as a Python list in a code recipe, but am not sure what to do with it. My lists might look like this:

col_names = ["ID", "Names", "Count"]

col_types = ["int", "string", "int"]

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 2024 Answer ✓

    Hi,

    You would generally not do that in a recipe, which is part of the flow / rerunnable / supposed more or less to process data, but, first in a Python notebook. You can then automate that as a "Macro" as part of a DSS plugin

    This would use the DSS public API (https://doc.dataiku.com/dss/latest/api/public/client-python/index.html)

    Something like ("pseudo-code")


    import dataiku
    client = dataiku.api_client()
    project = client.get_project("PROJECT_NAME")
    dataset = project.get_dataset("dataset_name")

    current_schema = dataset.get_schema()
    # current_schema is now a dict, containing "columns", list of dicts. Each dict contains "name"

    # Build the new columns list.
    new_cols = []
    for i in xrange(0, len(col_names)):
    new_cols.append({"name": col_names[i], "type": col_types[i]})

    # And update the schema, and save it
    current_schema.columns = new_cols
    dataset.set_schema(current_schema)

Answers

Setup Info
    Tags
      Help me…