In-memory dataset

0 Kudos

Hello community,

I was thinking that it would be nice for Dataiku to provide an in-memory dataset type, i.e. a dataframe that once built once can be accessed and updated directly by multiple code recipes and shared across projects. This would be useful in some high-perfomance computation scenarios, where loading large datasets can be a significant overhead.

Are there any plans to implement such a feature, or is there a recommended solution to approach such a scenario?

Best Regards,

Alessandro

4 Comments

I don't represent Dataiku but I doubt this is the sort of feature they would want to add to the product. The beauty of Dataiku is that it supports most data compute technologies and can easily integrate with them to offload workloads to other products that can process the data way faster than Dataiku could do it.
One of the most popular in-memory databses is Redis. I think you will find it will not fit your needs as it's aimed at key/value use cases. In-memory databases tend to be used for sub-second transactions rather than to manage large datasets. I would encourage you to try the latest datawarehouse solutions like Snowflake and Google's BigQuery. I think you will surprised at how much data they can handle and how fast they can do it. Also keep in mind in-memory datases are not a silver bullet, they introduce their own problems like how do you persist the data to disk in a consistent way.
However if you do want to try to see what performance you can get using in-memory data you could try using SQLite or MySQL both of which are supported by Dataiku and have options to store data in-memory (see list of in-memory databases in Wikipedia)

I don't represent Dataiku but I doubt this is the sort of feature they would want to add to the product. The beauty of Dataiku is that it supports most data compute technologies and can easily integrate with them to offload workloads to other products that can process the data way faster than Dataiku could do it.
One of the most popular in-memory databses is Redis. I think you will find it will not fit your needs as it's aimed at key/value use cases. In-memory databases tend to be used for sub-second transactions rather than to manage large datasets. I would encourage you to try the latest datawarehouse solutions like Snowflake and Google's BigQuery. I think you will surprised at how much data they can handle and how fast they can do it. Also keep in mind in-memory datases are not a silver bullet, they introduce their own problems like how do you persist the data to disk in a consistent way.
However if you do want to try to see what performance you can get using in-memory data you could try using SQLite or MySQL both of which are supported by Dataiku and have options to store data in-memory (see list of in-memory databases in Wikipedia)

alextag94
Level 2

Thanks for your response!

I do see your point.

Just to provide more context for my idea, in my use case, I have one common large dataset of Financial Market data that needs to be accessed by multiple Python recipes running some complex analytic functions.

At present, my data resides in a PostGreSQL database, the recipes will read the data from PostGre, process it, then write it back. The multitude of read/write operations are quite expensive and harm performance, which is why I thought it would be nice to have the option to load the data to memory once and let it reside there for consumption. 

Unfortunately the SQL language is not an option, as it is not expressive enough which eliminates most data warehousing solutions from the equation.

I checked kdb+, which is an in-memory database that is widely used in the Financial Industry, but unfortunately its price tag is a clear constraint.

Maybe a possible way to go is to use a database that supports custom procedures written in a programming language such as Python or C (I have seen PostGreSQL offers something similar, but I  need more research in this area)?

 

 

Thanks for your response!

I do see your point.

Just to provide more context for my idea, in my use case, I have one common large dataset of Financial Market data that needs to be accessed by multiple Python recipes running some complex analytic functions.

At present, my data resides in a PostGreSQL database, the recipes will read the data from PostGre, process it, then write it back. The multitude of read/write operations are quite expensive and harm performance, which is why I thought it would be nice to have the option to load the data to memory once and let it reside there for consumption. 

Unfortunately the SQL language is not an option, as it is not expressive enough which eliminates most data warehousing solutions from the equation.

I checked kdb+, which is an in-memory database that is widely used in the Financial Industry, but unfortunately its price tag is a clear constraint.

Maybe a possible way to go is to use a database that supports custom procedures written in a programming language such as Python or C (I have seen PostGreSQL offers something similar, but I  need more research in this area)?

 

 

Out of interest what sort transformations are you doing that can't be done in SQL? Have a read a this article:

https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd

https://www.dask.org/

https://vaex.io/docs/index.html

 

Out of interest what sort transformations are you doing that can't be done in SQL? Have a read a this article:

https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd

https://www.dask.org/

https://vaex.io/docs/index.html

 

alextag94
Level 2

It is mostly

1) Time series operations such as trend filters

2) Numerical simulations

3) Even-driven backtesting of trading signals

some of it could be translated into SQL, but it would be quite a task and require recursion and loops, which I have found to be slower than the Python/C implementation

It is mostly

1) Time series operations such as trend filters

2) Numerical simulations

3) Even-driven backtesting of trading signals

some of it could be translated into SQL, but it would be quite a task and require recursion and loops, which I have found to be slower than the Python/C implementation