Errors when syncing partitioned dataset to lower granularity

Peter_van_Klave · ‎11-01-2019

I use a Twitter connector to collect tweets. By standard, the twitter-data is partitioned by the hour in Dataiku. In my dataflow I want to use a lower granularity and partition by day (or week / month / ...), e.g. to calculate the sentiment of all the tweets per day (week / month). This can be done by using a Time Range when syncing the twitter-data to a new (output) dataset. However, sometimes there are no tweets detected in a particular hour and the partition for that hour does not exist in the input. When I try to run the sync-recipe in this case, this results in an error saying the partition in question is missing and processing stops and the output dataset is not filled. How can I sync when not all input-partitions that should make up the output-partition, are available?

Regards,

Peter van Klaveren

Liev · ‎11-01-2019

Hi Peter,

Assuming this is a FS dataset, under its Settings > Advanced, you should see an option to tick for "Missing partitions as empty".

Let me know if this works for you 🙂

Peter_van_Klave · ‎11-01-2019

Hi Liev,
I am afraid this did not work. My Output dataset is indeed a FS dataset (and I checked the option you mentioned), but the problem is in the Input dataset, which is a (streaming) Twitter dataset. And although this is stored on the FS in managed_folders, this type of dataset does not have the option to treat missing partitions as empy.

Liev · ‎11-02-2019

Hi Peter, indeed the answer was regarding your input dataset. I'm not sure I understand what shape the streaming Twitter dataset take or what is populating this. Would you mind giving us a little more info?

Peter_van_Klave · ‎11-02-2019

Hi Liev, from what I see, the Twitter dataset must be configured with a connector (in my case 'filesystem_managed') and a path (in my case 'centric_tweets'). The Twitter dataset also has an option to 'Start Streaming' and from the moment that it is activated, folders are created on the given path with a a structure like '%Y/%M/%D/%H'. I have folders like:
centric_tweets/2019/11/01/08
centric_tweets/2019/11/01/09
centric_tweets/2019/11/01/10
etc.

Errors when syncing partitioned dataset to lower granularity

Errors when syncing partitioned dataset to lower granularity

Labels

Partitioning

Sign up to take part

Errors when syncing partitioned dataset to lower granularity

Errors when syncing partitioned dataset to lower granularity

Labels

Partitioning