Imported MySQL dataset - schema incomplete
Hi folks,
I'm importing data from MongoDB connection into Dataiku using MongoDB's BI Connector. So far this has worked fine, I've never had any problems. However, the schema of this MySQL dataset is incomplete, I'm missing one important column.
There's not many settings that I can change or experiment with. In the MongoDB the schema is complete. Does anybody have a hint how to solve this?
Thanks!
cuezumo
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
The schema is determined by testing the collection and only the fields are available in the first rows returned from the collection. DSS does not scan the whole collection to determine the schema.
To get around this behavior you have a few options:
1) Filter if that column exists, if this is important column perhaps you only want data containing this column.
You can later remove this filter if you actually want all rows. Once this is added and you Schema - Check data you should be able to pull in the new schema.You can use https://www.mongodb.com/docs/manual/reference/operator/query/exists/:
Example key3 would the column you want to filter for.
If you remove the filter later you will see a schema mismatch warning which you can ignore.
2) Another workaround, is from the schema interface (using "Add column") of the dataset.
3) You could also do it via code using the `write_schema()` method on a dataset, like:
dataset = dataiku.Dataset("dataset_name") # replace with your dataset name dataset.write_schema([{"type": "string", "name": "column1"}, {"type": "string", "name": "column2"}, ...]) # replace ... by other columns
You will have a warning in the Schema tab if you check the consistency. So use that dataset and the modified schema with caution (see https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.write_schema).
Let me know if that helps.