Data Type

Curious
Curious Registered Posts: 1

Hi everyone,

I’m running into a type inference issue in Dataiku DSS when reading my dataset.

The ID column is explicitly defined as a string/text type in the dataset schema.

If I do not set infer_with_pandas=False (letting Dataiku infer types), pandas reads the ID column as float64 (for example, 42022100 becomes 42022100.0). This also causes leading zeros to be lost, which is unacceptable for my IDs.

According to the Dataiku documentation:
"If you pass infer_with_pandas=False as an option to get_dataframe(), the exact dataset types will be passed to Pandas. Note that if your dataset contains invalid values, the whole get_dataframe call will fail."

When I set infer_with_pandas=False to enforce the predefined schema:

  • The ID column is handled correctly as a string
  • However, another column fails with the error:
    ValueError: Integer column has NA values in column 16

In summary:

  • Without infer_with_pandas=False: ID column is incorrectly inferred as float64, trailing .0 appears and leading zeros are lost
  • With infer_with_pandas=False: schema is enforced, ID column is fine, but integer columns containing NA raise an error

My question: what is the recommended way in Dataiku DSS to handle this situation so that:

  1. The ID column is consistently treated as a string (preserving leading zeros), and
  2. Integer columns with missing values do not cause errors when schema enforcement is enabled?

Any guidance or best practices for handling this trade-off in Dataiku would be greatly appreciated.

Dataiku version used: 14.2.1

Answers

  • ÁngelÁlvarez
    ÁngelÁlvarez Registered Posts: 5 ✭✭

    The Short Solution

    Set infer_with_pandas=False to protect your IDs and use_nullable_integers=True to allow NAs in integer columns:

    # Read recipe inputs
    test_data = dataiku.Dataset("test_data")
    
    # This combination preserves leading zeros and handles NA integers
    test_data_df = test_data.get_dataframe(
        infer_with_pandas=False, 
        use_nullable_integers=True
    )
    

    Why this works:

    • infer_with_pandas=False: Forces Pandas to respect the "String" type for your ID column, preventing it from being mangled into a float.
    • use_nullable_integers=True: Tells Dataiku to use Pandas' Int64 (nullable) type instead of the standard NumPy int64. This allows the column to remain an integer type while safely supporting NaN values.
    image.png
Setup Info
    Tags
      Help me…