MongoDB connector - get embedded document in a document of a collection

Options
clemEs
clemEs Registered Posts: 9 ✭✭✭

Hi all,

Let's consider a Mongo collection accessible by the Mongo connector. I have different keys in my document :

{ key1: value1,

key2:value2,

key3:[{key31:value311, key32:value321 ...},

{key31:value312, key32:value322 ...}]

}

In one of the value I have an array of documents.

With the connector I want to only access this array of documents.

Is it possible to do it only with the connector?

Regards,

Clement

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Answer ✓
    Options

    Hi,

    The filtering alone on the connection doesn't produce the results you are looking for. If I understand correctly you create a new dataset with only the key3 as a folder array and use a prepared recipe to achieve the results you are looking for.

    I've used the same test data you suggested:

    1. My Dataset settings are :

    Screenshot 2021-09-30 at 11.43.37.png

    2) I've updated the schema to keep only key3

    Screenshot 2021-09-30 at 11.43.43.png

    3. This produces a dataset like so :

    Screenshot 2021-09-30 at 11.45.37.png

    4. Now I can fold the array using the processor Fold array values:

    Screenshot 2021-09-30 at 11.46.44.png

    5) Now I can also unnest the object so I have ended up with the following dataset,

    Screenshot 2021-09-30 at 11.47.14.png

    Let me know if you see any issues with the approach to achieve what you are looking for.

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @clemEs
    ,

    It's not possible to limit a collection or filter for a specific key in the MongoDB connection string. You can only set the database there. The advanced URI syntax does not have filtering by collection or key. https://docs.mongodb.com/manual/reference/connection-string/

    Filtering can be done on the collection chosen using the “Input filtering query” when creating the datasets.

    See: https://knowledge.dataiku.com/latest/kb/setup-admin/dss-and-mongodb/index.html#create-a-mongodb-dataset

    Let me know if that works for you.

  • clemEs
    clemEs Registered Posts: 9 ✭✭✭
    Options

    Hi that's actually the input filtering query that I don't manage to use.

    Even with the documentation, it doesn't help me with embedded document.

    Is it supported by Dataiku?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi,

    Can you share what your filter syntax is? Essentially the syntax would be the same as in mongo find

    https://docs.mongodb.com/manual/reference/method/db.collection.find/

    As per your example if you want to filter only for key3 then you can do something like this.

    my_new_db> db.my_new_col.find({key3:{$exists:true}},{key3:1})[{_id: ObjectId("613b55eedb1a5b9ef163bc87"),key3: [{ key31: 'value311', key32: 'value321' },{ key31: 'value312', key32: 'value322' }]}]

    If you need to more complex queries for your embedded documents you can have a look at : https://docs.mongodb.com/manual/tutorial/query-array-of-documents/

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Just to clarify, {key3:1} part from my example would be ignored when generating the actual since the query will contina : fields { "_id" : 0} with limit 100

    In order to only use the only key3. You would need to make the following changes :

    1. Select "When reading, only keep columns declared" along with you filtering query in this example I look for only documents where key3 exists.

    Screenshot 2021-09-10 at 15.19.52.png

    2. Remove all other columns you don't want in from the schema and save. This will remove key1..key2

    Let me know if this is the type of filtering you were looking for.

  • clemEs
    clemEs Registered Posts: 9 ✭✭✭
    Options

    Hi,

    unfortunately it is not what I'm aiming for.

    Let me represent one row of the dataset with multiple columns:

    1. { key1: value1,

      key2:value2,

      key3:[{key31:value311, key32:value321 ...},

      {key31:value312, key32:value322 ...}]

      }

    2. I want to get the multiple rows that are in key3 and create this new dataset :

      {key31:value311, key32:value321 ...},

      {key31:value312, key32:value322 ...}

    I'm starting to think that the MongoDB connector proposed by Dataiku cannot do that, contrary to other conenctor in ohter ETL or BI tool.

  • clemEs
    clemEs Registered Posts: 9 ✭✭✭
    Options

    Thank you this. The process is working well but unfortunately le memory alloated to my instance is not matching the computing needs.

Setup Info
    Tags
      Help me…