Greenplum interoperability enhancements

Our current main backend database is the Greenplum DB, which is an MPP database, based on Postgres.

Here are the few areas I think Dataiku can improve the support for this database:

Enhance write speed:

Currently, Dataiku has a wrapper around COPY function of, presumably, modified psycopg2 driver. It works perfectly fine with Postgres, but becomes a bottleneck in Greenplum due to data first gets copied to the Master segment, and then has to be pushed by the Master to all participating Worker nodes. This is the second-slowest way to do this, first being executing "INSERT INTO" statements.

Greenplum itself supports 3 separate methods of dealing with bulk data loading: gpload, gpfdist and PXF. All of these work by creating an external table and using it to directly load data by selected distribution pattern (column) from files. Since Dataiku already uses something like StringIO to generate fake csv-s to load, it can theoretically also use the same objects to feed to any one of these technologies.

Similarly to how some of the supported connections support "automatic fast-write", make this an option for Greenplum with set prerequisites.

Some considerations are to be made when creating and maintaining these external tables as to not overload the catalog in Greenplum, probably delete and recreate them as necessary.

Enhance read speeds in Python:

The abovementioned approach can also be utilized to generate a DataFrame object from a dataset. In most recipe cases, reading data through the Master node is completely normal and desirable behavior, especially when SQL engine is used. However, when data first needs to be completely read into Pandas, reading through the Master node becomes the bottleneck, as it eliminates all the benefits of MPP columnar db.

Using any of these abovementioned technologies could significantly speed up the object generation, the bigger the amount of data, the most speed benefit you can get.

Enhance read and write speeds in Spark-based recipes:

The reasons remain the same, but even more articulated in the case of Spark, since it has to read all the data from the database and then somehow partition it among the nodes. Greenplum supports its own Spark connector, that was made specifically to overcome this limitation. Essentially, it allows Spark workers to get only the columns that are required and already distributed by the database itself, thus taking full advantage over the lazy evaluation.

As it stands right now (also confirmed by Dataiku architecture team), Spark brings exactly zero benefit to us, running on prem with SQL db as a backend. But with this connector, we can finally allow data folks to work with big datasets efficiently. Same goes for AutoML capacities of Dataiku: if it can utilize this connector, benefits for training would be pretty great.

Partitioning with Greenplum:

Partitioning is something I still don't quite get with Dataiku. It seems like it's a wrapper around creating multiple database objects, but not utilizing or maintaining the database capabilities for it. Am I correct in that? Would be great to see Dataiku directly managing the partitions of Greenplum, so that gporca (the query planner) can be utilized efficiently.

Overall, I consider Greenplum to be a fitting choice for many companies running on prem, as it has the familiarity of Postgres, but with much more power for proper OLAP use-cases. I would love to see Dataiku showing it more love and support.

 

P.S. Sorry for the long post and for the "tech-speak", but at its core this is a tech problem. Benefits, however, will be huge: increasing speeds in every imaginable Dataiku operation for us!