This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results forÂ

- Community
- Â»
- Discussions
- Â»
- Using Dataiku
- Â»

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Converting integer to hours

Hello,

As said in the title, I have a column of data of the type integer that I would like to convert to hours. The idea is to then be able to use the resampling function of the "Time series preparation" recipe that needs to have a time as input.

Does anyone know how to do this conversion?

Thanks in advance,

Olivier

Solutions shown first - Read whole discussion

5 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Let me take a moment to see if I can be a little bit of help.

Dataiku DateTime Stamps have both Date and Time. I'm not aware of a Time (by itself) data type. And the Time Series Cleanup recipe from the Forecast Plugin 0.4.1 by Dataiku (Alex Combessie) seem to have a Date(time) Stamp is a needed field.

There is also the Time Series Preparation Plugin v 1.1.3 by Du PHAN that seems to also do some time series cleanup work. From your description below, I'm not clear which one you are using.

Regardless, it is my guess that you are going to need to create a string that looks like a date-time and then parse the string to an actual Dataiku Date.

You could do this in a variety of ways. One way would be with a formula recipe step.

However, from your description below I don't know where you will get the following components for your data

- Date
- Minutes
- Seconds
- Time Zone

The Date Parse visual recipe step will fill in some of these missing values.

Given the above, I'd like to invite you to share some more about your dataset, like where we might get a date for each record, the timezone, and other information. It might also be helpful if you can explain or show the plugging you are using.

With that information, I or someone else here may be able to help you further.

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi @tgb417 , thanks for taking the time to look into my question.

The time serie preparation I mention is "Time Series Preparation Plugin v 1.0.3 by Du PHAN".

I give more context below regarding what I am trying to do.

I have a variable counting the number of flights of each aircraft of a fleet over time.

This counter ranges from zero to several thousands. I am interested to resample the data of each aircraft on an identical grid of number of flights (for example 100, 200, 300...), so that I can compare the data of different aircrafts for a same number of flights.

One idea I had was to convert this counter into a date to use the resampling function of the time preparation plugin. So the first question is : do I need to pass by a date format for resampling, or are there other interpolation functions that don't need a date as input?

If the answer is no, then I tried to transform my counter into a parsed date as you suggested : the number of flights is counted as years, as shown in the screenshot below.

However, when I use the variable count_sum_parsed in the resampling function of the time serie preparation, I get the following error message : "out of bounds nanosecond timestamp". With a quick research on google, it appears to be due to the fact that the time span allowed in Python can't exceed 584 years :

"Since pandas represents timestamps in nanosecond resolution, the time span that can be represented using a 64-bit integer is limited to approximately 584 years".

The documentation provides information to deal with out-of-bound spans at the adress after (http://pandas-docs.github.io/pandas-docs-travis/user_guide/timeseries.html#timeseries-oob), but as I am using an existing plugin, I do not know how I am supposed to go about modifying the code.

Could you advise me on that?

Have a nice day.

Olivier

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I have not used the plugin you are using and I'm not a time series expert.

But as I'm looking at your visual example. I see that your number is being reflected in the __year__ column in the date. That is why you are likely to break the underlying tool.

This seems a little bit odd to me to be using a sum calculation of anything as the time variable of a time series. Using Hours of I get, years of service maybe I get. But the sum of some count does not on the first impression seem like a very helpful time interval.

What I'm hearing from you is that you want to do some sort of binning of your data. Give the value of a number of features at the end of the first 10 flights for this plane. And for a different plane give me the same set of values after its's first 10 flights. And so On. And then increment by an additional 10 flights or 20 total flights and give me the values for each of the planes at the end of those flights. And the next set of fights and give me the values. Is that the sort of idea you are thinking of? I'm not a statistician, you might want to talk to someone with that kind of background to see how badly you might be breaking the "assumptions" of a time series analysis by using this as your definition of time.

The Windows functions in DSS might be helpful for what you are trying to do. In your data do you have a way of ordering the flights so that you know which are the first 10 flights, the second 10 flights for each plain?

It also sounds like you are talking about airplane flights. As various flights are different lengths, is grouping by flight the best way to look at total plane usage. Would something base on miles be a better measure of plane use? (Although this might be a better approach, this might however be a bit harder to calculate.) Of the folks who I've know who do aviation analysis. They seem to be all about the mile traveled by the plain.

There are my $0.02. Hope I've been of a little bit of help.

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello Tom,

What I said in my reply is that I would gladly use another method of interpolation than the time serie plugin for my counter, which clearly is not a date or a time. However, when I look for interpolation in dataiku documentation, I do not find other options than the resampling of date in the time serie plugin. Do you know of an other interpolation method that would accept as an input a parameter that is not a time? The way the interpolation function works in Excel for example.

Also, I need to interpolate because I do not have a row of data for each increment of my counter, but a row for one month worth of data for a given aircraft.

For example, on line x, I will have data for february 2020, telling me how many flights aircraft number y did this month.

Number of flights is a relevant counter for my study as it represents the number of Take-Off done by the engines.

Have a nice day,

Olivier

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Aah, so it sounds like the natural time based, heart beat of your data is month and year. And you may actually have that data in your dataset.

With a data set like that, I think that I would start by convert the time values in the dataset to something like the first day of the month or last day of the month and get it in a format that Dataiku can use. I might even create a start date and end ate for each row. If you have to claim a higher level of granularity of the data than you already have. You are going to have to start making assumptions about the distribution of the activity over the months time. Then I see you are getting into a situation of interpolation.

if this was mine to solve, Iâ€™d probably look for a R or Python library that does the kind of calculation I would need.

good luck.

--Tom