When importing datasets with columns stored as int, DSS seems to want to store them as double.

stainlessteelra
stainlessteelra Registered Posts: 14 ✭✭✭✭

Using wip_discrete_jobs as an example we have columns that should be integers but are ingested as numeric.

For example,

Table:  wip_discrete_jobs

Column: wip_entity_id

Type:    numeric(38,15)

image-899bde11064548-a36b.png

If I query the table

SELECT wip_entity_id, organization_id  FROM khw_eng.wip_discrete_jobs WHERE wip_entity_id = 34922

The result is as expected with a scale of 15:

image-c5f1087e742e38-44e9.png

If I take this table as an input and do a data prepare step without adding any transformations, I end up with different column type.

image-50bc07be5ff958-aa77.png

Schema:           rpcadw

Table:    wip_discrete_jobs_prepared

Column: wip_entity_id

Type:                   float8

image-bc24b0848f4a18-33b7.png

If I query the table I end up NULL result:

SELECT wip_entity_id, organization_id  FROM rpcadw.wip_discrete_jobs_prepared WHERE wip_entity_id  = 349225

image-59c727fee18168-a7a4.png

To return the result I must round instead:

               SELECT wip_entity_id, organization_id  FROM rpcadw.wip_discrete_jobs_prepared WHERE round(wip_entity_id)  = 349225

image-efd4d8e2a4867-4963.png

We can see that the conversion to float by datiaku has not preserved the true value.

I also tried not normalizing float/doubles:

image-492000f6b34bb8-d7a6.png

But we have the same issue:

               SELECT wip_entity_id, organization_id  FROM rpcadw.wip_discrete_jobs_prepared2 WHERE round(wip_entity_id) = 349225

image-80e0df3d2646a-b299.png

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,400 Dataiker

    Hi,
    It's probably best if we share a job diagnostics via a support ticket so we can investigate this in more detail.

    The behavior here can vary depending on the settings at the instance, see "Lock strongly typed input":
    https://doc.dataiku.com/dss/latest/schemas/data-preparation.html#schema-in-prepare-recipe

    If the column is being changed, e.g, renamed, or any processing is performed on it, DSS may infer it again and switch to double.
    If you only see the issue after writing but not during the preview of prepare, it may be due to DSS and redshift behavior. Double type to Float8 in Redshift

    You could potentially treating it as a string by setting it to type in the prepare recipe, or you can modify the SQL creation statement in the advanced tab of the recipe to set it to numeric

    Thanks

Setup Info
    Tags
      Help me…