Errors when syncing partitioned dataset to lower granularity
I use a Twitter connector to collect tweets. By standard, the twitter-data is partitioned by the hour in Dataiku. In my dataflow I want to use a lower granularity and partition by day (or week / month / ...), e.g. to calculate the sentiment of all the tweets per day (week / month). This can be done by using a Time Range when syncing the twitter-data to a new (output) dataset. However, sometimes there are no tweets detected in a particular hour and the partition for that hour does not exist in the input. When I try to run the sync-recipe in this case, this results in an error saying the partition in question is missing and processing stops and the output dataset is not filled. How can I sync when not all input-partitions that should make up the output-partition, are available?
Hi Liev, from what I see, the Twitter dataset must be configured with a connector (in my case 'filesystem_managed') and a path (in my case 'centric_tweets'). The Twitter dataset also has an option to 'Start Streaming' and from the moment that it is activated, folders are created on the given path with a a structure like '%Y/%M/%D/%H'. I have folders like: centric_tweets/2019/11/01/08 centric_tweets/2019/11/01/09 centric_tweets/2019/11/01/10 etc.
Hi Peter, indeed the answer was regarding your input dataset. I'm not sure I understand what shape the streaming Twitter dataset take or what is populating this. Would you mind giving us a little more info?
Hi Liev, I am afraid this did not work. My Output dataset is indeed a FS dataset (and I checked the option you mentioned), but the problem is in the Input dataset, which is a (streaming) Twitter dataset. And although this is stored on the FS in managed_folders, this type of dataset does not have the option to treat missing partitions as empy.