Imported MySQL dataset - schema incomplete

Solved!
cuezumo
Level 2
Imported MySQL dataset - schema incomplete

Hi folks,

I'm importing data from MongoDB connection into Dataiku using MongoDB's BI Connector. So far this has worked fine, I've never had any problems. However, the schema of this MySQL dataset is incomplete, I'm missing one important column. 

There's not many settings that I can change or experiment with. In the MongoDB the schema is complete. Does anybody have a hint how to solve this? 

Thanks!

cuezumo

0 Kudos
1 Solution
AlexT
Dataiker

Hi,

The schema is determined by testing the collection and only the fields are available in the first rows returned from the collection. DSS does not scan the whole collection to determine the schema.

 

To get around this behavior you have a few options:

1) Filter if that column exists, if this is important column perhaps you only want data containing this column. 
 You can later remove this filter if you actually want all rows. Once this is added and you Schema - Check data you should be able to pull in the new schema.

You can use https://www.mongodb.com/docs/manual/reference/operator/query/exists/:

Example key3 would the column you want to filter for. 

โ€ƒScreenshot 2022-07-18 at 21.28.55.png

If you remove the filter later you will see a schema mismatch warning which you can ignore. 

2) Another workaround, is from the schema interface (using "Add column") of the dataset. 

Screenshot 2022-07-18 at 21.37.11.png

3) You could also do it via code using the `write_schema()` method on a dataset, like:

 

dataset = dataiku.Dataset("dataset_name") # replace with your dataset name
dataset.write_schema([{"type": "string", "name": "column1"}, {"type": "string", "name": "column2"}, ...]) # replace ... by other columns

 

You will have a warning in the Schema tab if you check the consistency. So use that dataset and the modified schema with caution (see https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.write_schema).

Let me know if that helps. 

View solution in original post

0 Kudos
2 Replies
AlexT
Dataiker

Hi,

The schema is determined by testing the collection and only the fields are available in the first rows returned from the collection. DSS does not scan the whole collection to determine the schema.

 

To get around this behavior you have a few options:

1) Filter if that column exists, if this is important column perhaps you only want data containing this column. 
 You can later remove this filter if you actually want all rows. Once this is added and you Schema - Check data you should be able to pull in the new schema.

You can use https://www.mongodb.com/docs/manual/reference/operator/query/exists/:

Example key3 would the column you want to filter for. 

โ€ƒScreenshot 2022-07-18 at 21.28.55.png

If you remove the filter later you will see a schema mismatch warning which you can ignore. 

2) Another workaround, is from the schema interface (using "Add column") of the dataset. 

Screenshot 2022-07-18 at 21.37.11.png

3) You could also do it via code using the `write_schema()` method on a dataset, like:

 

dataset = dataiku.Dataset("dataset_name") # replace with your dataset name
dataset.write_schema([{"type": "string", "name": "column1"}, {"type": "string", "name": "column2"}, ...]) # replace ... by other columns

 

You will have a warning in the Schema tab if you check the consistency. So use that dataset and the modified schema with caution (see https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.write_schema).

Let me know if that helps. 

0 Kudos
cuezumo
Level 2
Author

Big, big thanks, @AlexT  , that worked perfectly!

Thanks for the quick help.

Best, cuezumo

0 Kudos