Errors when syncing partitioned dataset to lower granularity

Peter_van_Klave · November 2019

I use a Twitter connector to collect tweets. By standard, the twitter-data is partitioned by the hour in Dataiku. In my dataflow I want to use a lower granularity and partition by day (or week / month / ...), e.g. to calculate the sentiment of all the tweets per day (week / month). This can be done by using a Time Range when syncing the twitter-data to a new (output) dataset. However, sometimes there are no tweets detected in a particular hour and the partition for that hour does not exist in the input. When I try to run the sync-recipe in this case, this results in an error saying the partition in question is missing and processing stops and the output dataset is not filled. How can I sync when not all input-partitions that should make up the output-partition, are available?

Regards,

Peter van Klaveren

Liev · November 2019

Hi Peter,

Assuming this is a FS dataset, under its Settings > Advanced, you should see an option to tick for "Missing partitions as empty".

Let me know if this works for you

Peter_van_Klave · November 2019

Hi Liev,
I am afraid this did not work. My Output dataset is indeed a FS dataset (and I checked the option you mentioned), but the problem is in the Input dataset, which is a (streaming) Twitter dataset. And although this is stored on the FS in managed_folders, this type of dataset does not have the option to treat missing partitions as empy.

Liev · November 2019

Hi Peter, indeed the answer was regarding your input dataset. I'm not sure I understand what shape the streaming Twitter dataset take or what is populating this. Would you mind giving us a little more info?

Peter_van_Klave · November 2019

Hi Liev, from what I see, the Twitter dataset must be configured with a connector (in my case 'filesystem_managed') and a path (in my case 'centric_tweets'). The Twitter dataset also has an option to 'Start Streaming' and from the moment that it is activated, folders are created on the given path with a a structure like '%Y/%M/%D/%H'. I have folders like:
centric_tweets/2019/11/01/08
centric_tweets/2019/11/01/09
centric_tweets/2019/11/01/10
etc.

Errors when syncing partitioned dataset to lower granularity

Answers

Categories

Setup Info

Tags