Errors when syncing partitioned dataset to lower granularity

Peter_van_Klave
Peter_van_Klave Partner, Registered Posts: 10 Partner
I use a Twitter connector to collect tweets. By standard, the twitter-data is partitioned by the hour in Dataiku. In my dataflow I want to use a lower granularity and partition by day (or week / month / ...), e.g. to calculate the sentiment of all the tweets per day (week / month). This can be done by using a Time Range when syncing the twitter-data to a new (output) dataset. However, sometimes there are no tweets detected in a particular hour and the partition for that hour does not exist in the input. When I try to run the sync-recipe in this case, this results in an error saying the partition in question is missing and processing stops and the output dataset is not filled. How can I sync when not all input-partitions that should make up the output-partition, are available?

Regards,

Peter van Klaveren
Tagged:

Answers

  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭
    Hi Peter,

    Assuming this is a FS dataset, under its Settings > Advanced, you should see an option to tick for "Missing partitions as empty".

    Let me know if this works for you :)
  • Peter_van_Klave
    Peter_van_Klave Partner, Registered Posts: 10 Partner
    Hi Liev,
    I am afraid this did not work. My Output dataset is indeed a FS dataset (and I checked the option you mentioned), but the problem is in the Input dataset, which is a (streaming) Twitter dataset. And although this is stored on the FS in managed_folders, this type of dataset does not have the option to treat missing partitions as empy.
  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭
    Hi Peter, indeed the answer was regarding your input dataset. I'm not sure I understand what shape the streaming Twitter dataset take or what is populating this. Would you mind giving us a little more info?
  • Peter_van_Klave
    Peter_van_Klave Partner, Registered Posts: 10 Partner
    Hi Liev, from what I see, the Twitter dataset must be configured with a connector (in my case 'filesystem_managed') and a path (in my case 'centric_tweets'). The Twitter dataset also has an option to 'Start Streaming' and from the moment that it is activated, folders are created on the given path with a a structure like '%Y/%M/%D/%H'. I have folders like:
    centric_tweets/2019/11/01/08
    centric_tweets/2019/11/01/09
    centric_tweets/2019/11/01/10
    etc.
Setup Info
    Tags
      Help me…