Imported MySQL dataset - schema incomplete

cuezumo
cuezumo Registered Posts: 8 ✭✭✭✭

Hi folks,

I'm importing data from MongoDB connection into Dataiku using MongoDB's BI Connector. So far this has worked fine, I've never had any problems. However, the schema of this MySQL dataset is incomplete, I'm missing one important column.

There's not many settings that I can change or experiment with. In the MongoDB the schema is complete. Does anybody have a hint how to solve this?

Thanks!

cuezumo

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
    edited July 17 Answer ✓

    Hi,

    The schema is determined by testing the collection and only the fields are available in the first rows returned from the collection. DSS does not scan the whole collection to determine the schema.

    To get around this behavior you have a few options:

    1) Filter if that column exists, if this is important column perhaps you only want data containing this column.
    You can later remove this filter if you actually want all rows. Once this is added and you Schema - Check data you should be able to pull in the new schema.

    You can use https://www.mongodb.com/docs/manual/reference/operator/query/exists/:

    Example key3 would the column you want to filter for.

    Screenshot 2022-07-18 at 21.28.55.png

    If you remove the filter later you will see a schema mismatch warning which you can ignore.

    2) Another workaround, is from the schema interface (using "Add column") of the dataset.

    Screenshot 2022-07-18 at 21.37.11.png

    3) You could also do it via code using the `write_schema()` method on a dataset, like:

    dataset = dataiku.Dataset("dataset_name") # replace with your dataset name
    dataset.write_schema([{"type": "string", "name": "column1"}, {"type": "string", "name": "column2"}, ...]) # replace ... by other columns

    You will have a warning in the Schema tab if you check the consistency. So use that dataset and the modified schema with caution (see https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html#dataiku.Dataset.write_schema).

    Let me know if that helps.

Answers

Setup Info
    Tags
      Help me…