Data Type

Curious · February 5

Hi everyone,

I’m running into a type inference issue in Dataiku DSS when reading my dataset.

The ID column is explicitly defined as a string/text type in the dataset schema.

If I do not set infer_with_pandas=False (letting Dataiku infer types), pandas reads the ID column as float64 (for example, 42022100 becomes 42022100.0). This also causes leading zeros to be lost, which is unacceptable for my IDs.

According to the Dataiku documentation:
"If you pass infer_with_pandas=False as an option to get_dataframe(), the exact dataset types will be passed to Pandas. Note that if your dataset contains invalid values, the whole get_dataframe call will fail."

When I set infer_with_pandas=False to enforce the predefined schema:

The ID column is handled correctly as a string
However, another column fails with the error:
ValueError: Integer column has NA values in column 16

In summary:

Without infer_with_pandas=False: ID column is incorrectly inferred as float64, trailing .0 appears and leading zeros are lost
With infer_with_pandas=False: schema is enforced, ID column is fine, but integer columns containing NA raise an error

My question: what is the recommended way in Dataiku DSS to handle this situation so that:

The ID column is consistently treated as a string (preserving leading zeros), and
Integer columns with missing values do not cause errors when schema enforcement is enabled?

Any guidance or best practices for handling this trade-off in Dataiku would be greatly appreciated.

Dataiku version used: 14.2.1

ÁngelÁlvarez · February 5

The Short Solution

Set infer_with_pandas=False to protect your IDs and use_nullable_integers=True to allow NAs in integer columns:

# Read recipe inputs
test_data = dataiku.Dataset("test_data")

# This combination preserves leading zeros and handles NA integers
test_data_df = test_data.get_dataframe(
    infer_with_pandas=False, 
    use_nullable_integers=True
)

Why this works:

infer_with_pandas=False: Forces Pandas to respect the "String" type for your ID column, preventing it from being mangled into a float.
use_nullable_integers=True: Tells Dataiku to use Pandas' Int64 (nullable) type instead of the standard NumPy int64. This allows the column to remain an integer type while safely supporting NaN values.

Data Type

Answers

The Short Solution

Why this works:

Categories

Setup Info

Tags