Combining data sources from Blob Storage

Tomasz
Tomasz Dataiku DSS Core Designer, Registered Posts: 9 ✭✭

Hello,

I have a pretty simple problem, but somehow I'm not able to solve it.

I'm using Dataiku to do ETL stuff and I need to pull the data from Blob Storage. Usually I just specify a path this is the start of my flow. But in this case, I need to create the first source dataset from multiple blob "folders" that are in a container. So for example I have these:

A1
B1
B2
B3
C1

I want to get only the blobs which path starts with "B". Now I know I can do regex pattern in the "Import from Blob Storage" functionality, but unfortunately there is more than 1 million blobs in this container and the enumeration fails with this error:

ERR_FSPROVIDER_TOO_MANY_FILES: Attempted to enumerate too many files — Dataiku DSS 13 documentation

What would be the cleanest solution to this problem?

Thanks!
Tomasz

Operating system used: Windows

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,248 Neuron

    While it's easy to dump the data in cloud storage like that it's very hard to retrieve and to perform any data retrieval when you have millions of files. So you really should move away from this way of storing your data.

    Try pointing a Files in Folder dataset to a managed folder pointing to our blob storage. This will allow you to filter using B* as the inclusion rule but I don't know if it will work with so many files.

Setup Info
    Tags
      Help me…