MongoDB connector - get embedded document in a document of a collection

Solved!
clemEs
Level 2
MongoDB connector - get embedded document in a document of a collection

Hi all,

Let's consider a Mongo collection accessible by the Mongo connector. I have different keys in my document :

{ key1: value1,

    key2:value2,

     key3:[{key31:value311, key32:value321 ...},

                {key31:value312, key32:value322 ...}]

}

In one of the value I have an array of documents.

With the connector I want to only access this array of documents.

Is it possible to do it only with the connector?

Regards,

Clement

0 Kudos
1 Solution
AlexT
Dataiker

Hi, 

The filtering alone on the connection doesn't produce the results you are looking for. If I understand correctly you create a new dataset with only the key3 as a folder array and use a prepared recipe to achieve the results you are looking for. 

I've used the same test data you suggested:

1. My Dataset settings are :

Screenshot 2021-09-30 at 11.43.37.png

2) I've updated the schema to keep only key3

 

Screenshot 2021-09-30 at 11.43.43.png

3. This produces a dataset like so :

Screenshot 2021-09-30 at 11.45.37.png

4. Now I can fold the array using the processor Fold array values:

Screenshot 2021-09-30 at 11.46.44.png

5) Now I can also unnest the object so I have ended up with the following dataset,

Screenshot 2021-09-30 at 11.47.14.png

Let me know if you see any issues with the approach to achieve what you are looking for. 

View solution in original post

0 Kudos
7 Replies
AlexT
Dataiker

Hi @clemEs ,

It's not possible to limit a collection or filter for a specific key in the MongoDB connection string. You can only set the database there. The advanced URI syntax does not have filtering by collection or key.  https://docs.mongodb.com/manual/reference/connection-string/

Filtering can be done on the collection chosen using the “Input filtering query” when creating the datasets.

See: https://knowledge.dataiku.com/latest/kb/setup-admin/dss-and-mongodb/index.html#create-a-mongodb-data... 

Let me know if that works for you. 

 

0 Kudos
clemEs
Level 2
Author

Hi that's actually the input filtering query that I don't manage to use.

Even with the documentation, it doesn't help me with embedded document. 

Is it supported by Dataiku?

0 Kudos
AlexT
Dataiker

Hi,

Can you share what your filter syntax is? Essentially the syntax would be the same as in mongo find

https://docs.mongodb.com/manual/reference/method/db.collection.find/

As per your example if you want to filter only for key3 then you can do something like this.

my_new_db> db.my_new_col.find({key3:{$exists:true}},{key3:1})
[
  {
    _id: ObjectId("613b55eedb1a5b9ef163bc87"),
    key3: [
      { key31: 'value311', key32: 'value321' },
      { key31: 'value312', key32: 'value322' }
    ]
  }
]

 

If you need to more complex queries for your embedded documents you can have a look at : https://docs.mongodb.com/manual/tutorial/query-array-of-documents/ 

0 Kudos
AlexT
Dataiker

Just to clarify, {key3:1} part from my example would be ignored when generating the actual since the query will contina : fields { "_id" : 0} with limit 100

In order to only use the only key3. You would need to make the following changes :

1. Select "When reading, only keep columns declared" along with you filtering query in this example I look for only documents where key3 exists. 

Screenshot 2021-09-10 at 15.19.52.png

2. Remove all other columns you don't want in from the schema and save. This will remove key1..key2

Let me know if this is the type of filtering you were looking for.

 

0 Kudos
clemEs
Level 2
Author

Hi,

unfortunately it is not what I'm aiming for.

Let me represent one row of the dataset with multiple columns:

  1. { key1: value1,

        key2:value2,

         key3:[{key31:value311, key32:value321 ...},

                    {key31:value312, key32:value322 ...}]

    }

  2. I want to get the multiple rows that are in key3 and create this new dataset :

    {key31:value311, key32:value321 ...},

    {key31:value312, key32:value322 ...}

 

I'm starting to think that the MongoDB connector proposed by Dataiku cannot do that, contrary to other conenctor in ohter ETL or BI tool.

0 Kudos
AlexT
Dataiker

Hi, 

The filtering alone on the connection doesn't produce the results you are looking for. If I understand correctly you create a new dataset with only the key3 as a folder array and use a prepared recipe to achieve the results you are looking for. 

I've used the same test data you suggested:

1. My Dataset settings are :

Screenshot 2021-09-30 at 11.43.37.png

2) I've updated the schema to keep only key3

 

Screenshot 2021-09-30 at 11.43.43.png

3. This produces a dataset like so :

Screenshot 2021-09-30 at 11.45.37.png

4. Now I can fold the array using the processor Fold array values:

Screenshot 2021-09-30 at 11.46.44.png

5) Now I can also unnest the object so I have ended up with the following dataset,

Screenshot 2021-09-30 at 11.47.14.png

Let me know if you see any issues with the approach to achieve what you are looking for. 

0 Kudos
clemEs
Level 2
Author

Thank you this. The process is working well but unfortunately le memory alloated to my instance is not matching the computing needs.

0 Kudos