Why does Dataiku consistently choose the wrong data type for excel files I upload

Bayardo
Bayardo Registered Posts: 18 ✭✭✭

Should I be using excel files? No. But that's the world I live in. Dataiku needs to do better in data types. It requires significant and repetitive manual intervention to make sure that it continues to apply the right data type to my excel files. There is no good reason it has to be this bad. Particularly annoying is the imposing of data types like Boolean. That has to be fixed. There are plenty of times you will have two outputs of a data set that is not Boolean. Much less want it to change it to True or False. Applying the wrong data type would be fine if it didn't affect anything. But it does.

Comments

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,993 Neuron

    Can you please post / attach a sample XLS where you see data types changing in Dataiku when you import the XLS?

  • Bayardo
    Bayardo Registered Posts: 18 ✭✭✭

    Sadly, I can't share the file. But the problem remains even when I save the data as a csv file. All of the number columns are identified as String. The changing in later steps is odd because I will change the data type to double and run it through. I go to steps that are later in the flow and they will be string or bigint.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,993 Neuron
    edited September 10

    CSV files have not data types so it makes sense to load everything as string. This is a very common approach in ETL tools where you can't guarantee the consistenty of the data types. Once you load all records you can then apply a type casting step where rows that fail to suit a specific data type can be dealt accordingly. Otherwise what do you expect a loading program to do if it can't fit a row into a defined schema?

    Identify an Excel file that causes you problems and anonymise the data so you can share it. You should only need a few rows to provide evidence of your issue.

  • Bayardo
    Bayardo Registered Posts: 18 ✭✭✭

    what I can tell you is that I used tableau prep and I did not have any of these issues. So somehow, they figured out how to load data from excel files without having problems.

    Update:

    I took the excel file and changed the format of the columns with numbers to general and uploaded them. Still considered all of the columns to be big int or string.

    I took the excel file and changed it to csv. then uploaded and it considered every column to be a string.

    this is the best part. I did "Infer Types From Data" and it says this:

    Infer storage types from data

    This only takes into account a very small sample of data and could lead to invalid data. For safer typing of data, use a prepare recipe.

    Soooooo in my payment column…. because only a couple of hundred people earn something, and the rest of the rows are 0. It changes the string to BigInt….. and the BEST PART! IT DELETES THE PAYMENTS THAT I KNOW ARE IN THE EXCEL/CSV FILE.

    Which is awesome.

    oh I also before all this went and changed the type in the schema from the excel document. ran a prepare step to save it to my s3 bucket… and… it of course did not keep the data type from my schema change. Why? who knows.

    We have wrong data types at the beginning, wrong data types even if you change it in the schema, wrong data types that delete data in future steps and future steps that apply the wrong data type

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,993 Neuron

    I can't help without a data sample so I wish you the best with your problem. I am out.

  • Bayardo
    Bayardo Registered Posts: 18 ✭✭✭

    Solved: take the excel file. Select all columns and change it to NUMBER without any commas . upload the file as a dataset . it will upload it as string. You have to select all of the columns you want to change to a double. You can leave meaning alone. Then save it. run your prepare recipe and your dataset output will be double and decimal as it should be.

Setup Info
    Tags
      Help me…