Data Type
Hi everyone,
I’m running into a type inference issue in Dataiku DSS when reading my dataset.
The ID column is explicitly defined as a string/text type in the dataset schema.
If I do not set infer_with_pandas=False (letting Dataiku infer types), pandas reads the ID column as float64 (for example, 42022100 becomes 42022100.0). This also causes leading zeros to be lost, which is unacceptable for my IDs.
According to the Dataiku documentation:
"If you pass infer_with_pandas=False as an option to get_dataframe(), the exact dataset types will be passed to Pandas. Note that if your dataset contains invalid values, the whole get_dataframe call will fail."
When I set infer_with_pandas=False to enforce the predefined schema:
- The ID column is handled correctly as a string
- However, another column fails with the error:
ValueError: Integer column has NA values in column 16
In summary:
- Without infer_with_pandas=False: ID column is incorrectly inferred as float64, trailing .0 appears and leading zeros are lost
- With infer_with_pandas=False: schema is enforced, ID column is fine, but integer columns containing NA raise an error
My question: what is the recommended way in Dataiku DSS to handle this situation so that:
- The ID column is consistently treated as a string (preserving leading zeros), and
- Integer columns with missing values do not cause errors when schema enforcement is enabled?
Any guidance or best practices for handling this trade-off in Dataiku would be greatly appreciated.
Dataiku version used: 14.2.1
Answers
-
The Short Solution
Set
infer_with_pandas=Falseto protect your IDs anduse_nullable_integers=Trueto allow NAs in integer columns:# Read recipe inputs test_data = dataiku.Dataset("test_data") # This combination preserves leading zeros and handles NA integers test_data_df = test_data.get_dataframe( infer_with_pandas=False, use_nullable_integers=True )Why this works:
infer_with_pandas=False: Forces Pandas to respect the "String" type for your ID column, preventing it from being mangled into a float.use_nullable_integers=True: Tells Dataiku to use Pandas'Int64(nullable) type instead of the standard NumPyint64. This allows the column to remain an integer type while safely supportingNaNvalues.
