Conundrum 5: Natural Followers

MichaelG
Community Manager
Community Manager
Conundrum 5: Natural Followers

Generic Community Conundrums - header for posts9.png

Welcome to the fifth Community Conundrum! This week we have a Data Prep puzzle for you - give it a go!

Situation: A twitter user has a dataset (attached) of new followers by date. The problem is the new followers (of type โ€œnatural languageโ€) contains values with all the new followers on a particular date. 


Objective: Given the dataset "followers_count", create a single dataset output that splits up these natural followers into their own rows and determine the number of weeks since the new followers were counted.

 

If you would like to share your results with the community feel free to do so here either in the form of screenshots, a text based description,  a project upload, or all three! If you would like to upload your project please refer to Submission guidelines .

I hope I helped! Do you Know that if I was Useful to you or Did something Outstanding you can Show your appreciation by giving me a KUDOS?

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
11 Replies
tgb417

What do we know about this date format?

2017-10-01-19

is it 

yyyy-MM-dd-HH

Or is this some sort of odd date range?  Or something else?

What is the 4th part of this date column?

I'm going to assume the above.

In what time zone was this date and time recorded.

Without some more information about the data, this can be difficult.

I'm going to assume local time EDT.

However, when I did this I got an exception on a date with Daylight Saving time there was an invalid hour listed.  

So I'm going to go back to UTC for my time.

--Tom
taraku
Dataiker

the date format yyyy-MM-dd-HH is correct! ๐Ÿ˜

tgb417

The range of values in nb_followers is 643-15540

if we use space as a separator on new_followers.

The range of new_followers is 2590 - 1000701322063204352

with a strange distribution.

new_followers distribution.jpg

In looking at the twitter API I found nothing specific about this data column.

So I'm wondering if I'm missing something here.

--Tom
tgb417

@MichaelG 

What are you asking for when you say

"determine the number of weeks since the new followers were counted."

The latest date in the date column is:

2018-06-07T01:00:00.000Z

If I assume that this is local time in my time zone.  (EDT).

You can find the exact number of weeks days hours since that date at the URL below.

https://www.timeanddate.com/countdown/to?msg=Graduation&p0=814&year=2018&month=6&day=7&hour=0&min=0&...

As of the point in time when I was putting this into 

Time Since.jpg

However, I'm really wondering if I'm answering the right question.  And If I am answering the right question above if there is a way of using the community edition to calculate these values in DSS.

-------------------

That said I feel like the answer I'm giving is not answering the question being asked.  I did not need to break up data in the column new_followers in order to do the calculation above.  I just needed to do a string to date conversion, and then a Max.

If on the other hand, if the question in weeks, how long has it been since each nb_follower <-> new_followers pair was added.

Then maybe that makes more sense.

  

--Tom
taraku
Dataiker

Yes this is really a separate question. You can use "now" as the time reference, then in the Prepare Recipe, find the time difference between the parsed date and now.

tgb417

@taraku 

You are correct.  In my version, I did not do the right thing.  I got the date.  Not the number of weeks.

let's see what I can do about that.

--Tom
tgb417

So here is the solution I ended up with.

The Flow.

The Flow.jpg

Prepare Steps.

Prepare Date

Prepare Date2.jpg

Splitting the new_followers.

I used the Split & Fold option.

 

That was put into the Prepare Recipe this way.

split 2.jpg

I then ran the prepare script.

Here is what grouping looked like.

Grouping.jpg

And here is an example of the results.

Restults.jpg

--Tom
taraku
Dataiker

Your "split and fold" is the solution! Great job!

tgb417

@taraku 

So, here is a possible step to get to what is needed for the "number of weeks since".  

In this case, I defined the week as 7 days and simply rounded.  Maybe there is a bit better way to the date calculations.

Week Calculation.jpg

--Tom
tgb417

@anita-clmnt ,

Again I like your answer to the date calculation.  Thanks.  Here is a brief graphic of the result.

Natural Followers.jpg

Attached is my flow.  See how this opens for you?

P.S. Will you be able to join us for the discussion on Wednesday?

--Tom
anita-clmnt
Level 3

@tgb417 ,

I was just thinking, for the number of weeks since new follower, there is a 'Compute difference between dates' processor which is more straightforward in my opinion since you can specify that you want the difference in weeks and 'now' is one of the options for the end time. But your solution works too!

anita-clmnt_0-1588602396376.png

-Anita