MongoDB connector - get embedded document in a document of a collection
Hi all,
Let's consider a Mongo collection accessible by the Mongo connector. I have different keys in my document :
{ key1: value1,
key2:value2,
key3:[{key31:value311, key32:value321 ...},
{key31:value312, key32:value322 ...}]
}
In one of the value I have an array of documents.
With the connector I want to only access this array of documents.
Is it possible to do it only with the connector?
Regards,
Clement
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi,
The filtering alone on the connection doesn't produce the results you are looking for. If I understand correctly you create a new dataset with only the key3 as a folder array and use a prepared recipe to achieve the results you are looking for.
I've used the same test data you suggested:
1. My Dataset settings are :
2) I've updated the schema to keep only key3
3. This produces a dataset like so :
4. Now I can fold the array using the processor Fold array values:
5) Now I can also unnest the object so I have ended up with the following dataset,
Let me know if you see any issues with the approach to achieve what you are looking for.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi @clemEs
,It's not possible to limit a collection or filter for a specific key in the MongoDB connection string. You can only set the database there. The advanced URI syntax does not have filtering by collection or key. https://docs.mongodb.com/manual/reference/connection-string/
Filtering can be done on the collection chosen using the “Input filtering query” when creating the datasets.
Let me know if that works for you.
-
Hi that's actually the input filtering query that I don't manage to use.
Even with the documentation, it doesn't help me with embedded document.
Is it supported by Dataiku?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi,
Can you share what your filter syntax is? Essentially the syntax would be the same as in mongo find
https://docs.mongodb.com/manual/reference/method/db.collection.find/
As per your example if you want to filter only for key3 then you can do something like this.
my_new_db> db.my_new_col.find({key3:{$exists:true}},{key3:1}) [ { _id: ObjectId("613b55eedb1a5b9ef163bc87"), key3: [ { key31: 'value311', key32: 'value321' }, { key31: 'value312', key32: 'value322' } ] } ]
If you need to more complex queries for your embedded documents you can have a look at : https://docs.mongodb.com/manual/tutorial/query-array-of-documents/
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Just to clarify, {key3:1} part from my example would be ignored when generating the actual since the query will contina : fields { "_id" : 0} with limit 100
In order to only use the only key3. You would need to make the following changes :
1. Select "When reading, only keep columns declared" along with you filtering query in this example I look for only documents where key3 exists.
2. Remove all other columns you don't want in from the schema and save. This will remove key1..key2
Let me know if this is the type of filtering you were looking for.
-
Hi,
unfortunately it is not what I'm aiming for.
Let me represent one row of the dataset with multiple columns:
{ key1: value1,
key2:value2,
key3:[{key31:value311, key32:value321 ...},
{key31:value312, key32:value322 ...}]
}
- I want to get the multiple rows that are in key3 and create this new dataset :
{key31:value311, key32:value321 ...},
{key31:value312, key32:value322 ...}
I'm starting to think that the MongoDB connector proposed by Dataiku cannot do that, contrary to other conenctor in ohter ETL or BI tool.
-
Thank you this. The process is working well but unfortunately le memory alloated to my instance is not matching the computing needs.