Welcome to the fifth Community Conundrum! This week we have a Data Prep puzzle for you - give it a go!
Situation: A twitter user has a dataset (attached) of new followers by date. The problem is the new followers (of type “natural language”) contains values with all the new followers on a particular date.
Objective: Given the dataset "followers_count", create a single dataset output that splits up these natural followers into their own rows and determine the number of weeks since the new followers were counted.
If you would like to share your results with the community feel free to do so here either in the form of screenshots, a text based description, a project upload, or all three! If you would like to upload your project please refer to Submission guidelines .
What do we know about this date format?
Or is this some sort of odd date range? Or something else?
What is the 4th part of this date column?
I'm going to assume the above.
In what time zone was this date and time recorded.
Without some more information about the data, this can be difficult.
I'm going to assume local time EDT.
However, when I did this I got an exception on a date with Daylight Saving time there was an invalid hour listed.
So I'm going to go back to UTC for my time.
The range of values in nb_followers is 643-15540
if we use space as a separator on new_followers.
The range of new_followers is 2590 - 1000701322063204352
with a strange distribution.
In looking at the twitter API I found nothing specific about this data column.
So I'm wondering if I'm missing something here.
What are you asking for when you say
"determine the number of weeks since the new followers were counted."
The latest date in the date column is:
If I assume that this is local time in my time zone. (EDT).
You can find the exact number of weeks days hours since that date at the URL below.
As of the point in time when I was putting this into
However, I'm really wondering if I'm answering the right question. And If I am answering the right question above if there is a way of using the community edition to calculate these values in DSS.
That said I feel like the answer I'm giving is not answering the question being asked. I did not need to break up data in the column new_followers in order to do the calculation above. I just needed to do a string to date conversion, and then a Max.
If on the other hand, if the question in weeks, how long has it been since each nb_follower <-> new_followers pair was added.
Then maybe that makes more sense.
Yes this is really a separate question. You can use "now" as the time reference, then in the Prepare Recipe, find the time difference between the parsed date and now.
So here is the solution I ended up with.
I used the Split & Fold option.
That was put into the Prepare Recipe this way.
I then ran the prepare script.
Here is what grouping looked like.
And here is an example of the results.
So, here is a possible step to get to what is needed for the "number of weeks since".
In this case, I defined the week as 7 days and simply rounded. Maybe there is a bit better way to the date calculations.
Again I like your answer to the date calculation. Thanks. Here is a brief graphic of the result.
Attached is my flow. See how this opens for you?
I was just thinking, for the number of weeks since new follower, there is a 'Compute difference between dates' processor which is more straightforward in my opinion since you can specify that you want the difference in weeks and 'now' is one of the options for the end time. But your solution works too!