Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on March 20, 2025 3:09PM
Likes: 0
Replies: 1
I've recently started to use the "Run integration test" scenario step for testing. It's definitely some work to create the test reference datasets but it once set up it's great to be able to run this test after later code changes to confirm the process works as expected.
Our flows typically mostly use SQL script recipes. However, some flows do include Python recipes. It turns out that these recipes don't currently work with the integration test feature. References to datasets (and - more importantly for our use case - underlying SQL table names) aren't swapped by the integration testing process.
Can the integration testing process be used with Python recipes? If so, how?
Operating system used: Red Hat Linux
I did some research on how to use integration testing with Python recipes. I'm sharing my findings below.
Any input dataset reference that is inside a call to create a handle/object to the dataset is swapped by integration testing:
dataset_obj = dataiku.Dataset("DATASET_NAME")
This makes sense as most uses of an input dataset would start with this step.
A common use case for us is to use Python to create and execute dynamic SQL scripts. In some cases we had hard coded references to table names in these SQL script portion of the Python recipes. Not surprisingly this doesn't work with integration testing. Here are several ways to set table names in a way that works with integration testing.
unresolved_table_name = dataiku.Dataset("DATASET_NAME").get_config().get('params').get('table')
resolved_table_name = dataiku.Dataset("DATASET_NAME").get_location_info().get('info').get('table')
quoted_resolved_table_spec = dataiku.Dataset("DATASET_NAME").get_location_info().get('info').get('quotedResolvedTableName')
The first returns ${projectKey}_DATASET_NAME. The second returns MYPROJ_DATASET_NAME. The third returns "MYPROJ_DATASET_NAME" (or "DATABASE"."SCHEMA"."MYPROJ_DATASET_NAME" if database and schema are specified for an external/non-managed dataset). Depending on the need, one of these approaches may be the best choice. These methods work with with SQL datasets that reference a table but not SQL datasets that reference a SQL query.
If you want to use a dataset name multiple times or pass it to a function, rather than assigning it a variable or including it as an argument directly, create a dataset object and get the name from that. Seems odd to specify the dataset name in the call to create the object and then use the object to get the name but this is needed to work with integration testing. For example:
dataset_name = dataiku.Dataset("DATASET_NAME").get_config()['name']
Hope this helpful to others interested in using integration testing with Python recipes.