Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

Direct SQL & Filesystem connections for performance boost

While I appreciate DSS is written in Java and transmits data from db / filesystem to recipes via https, from our tests, this layer of indirection is

- twice as slow compared to a direct connection to database with a package like pyscopg2 

- almost 5 times as slow compared to reading from filesystem directly 

Are there plans to expose python apis that connect to filesystem and DB directly should we require the performance improvement?

5 Comments
Turribeach
Level 6

I fail to see what's missing for you to implement this yourself where needed. If you can use Python to interact with another data storage technology in a faster way then why not do so already? eg what's stopping you from using pyscopg2 directly in a Python recipe and loading the data into a Pandas data frame as the output? 

I worked in a project where I needed to search millions of XML files for certain strings to then load those files. In the beggining I wrote a Python recipe but it was way too slow, Python is just never going to be quick for intensive OS read/write operations. So I moved part of my code to a Shell script recipe and using grep and performance improved 100x. 

@Turribeach ,

I've run into the same kinds of file system access performance issues we are discussing here.  And as you have described I moved to Shell Scripts and significantly improved system performance.  Here is some more information about my experience.

https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Is-there-a-limit-to-a-directory-structure... 

However, this is the product idea section of the website.  I agree with @somepunter  that it would be very nice if the file system performance of Dataiku DSS were a bit better.  The things we are doing with Shell Scripts feel like workarounds.  I wish the built-in options were more performant.

Turribeach
Level 6

Given that Dataiku already supports interacting directly with a file system (via shell scripts) or other data technologies using native drivers or Python I am not sure what else could Dataiku do. Trying to do the same file system work in Java would probably find the same performance issues we see in Dataiku. In other words I don't think it's Dataiku the issue, it's Java.

I think this is the beauty of Dataiku, you can use the technology or backend that better suits your needs. It’s also worth noting that in a lot of cases Dataiku already uses the faster data load APIs where applicable. For instance if you move millions of rows from a GCP bucket into GCP BigQuery table it's going to be blazing fast as it uses the BigQuery's fast Storage Write API. So I think you will most likely find that where Dataiku can interact directly in a high performance way with data technologies using Java they already do it. How would interact with a Linux file system using Java to get “direct performance” levels? Linux file systems don’t really have an API. Yes you can write C++ and use OS level APIs but then you will run into lots of issues due to the different Linux OS distributions Dataiku supports. 

somepunter
Level 3

as a cheap and cheerful halfway house perhaps offer the same Dataset('mydata').get_dataframe() API  but with an optional argument which would effectively conmect directly to the filesystem / DB via python underneath the hood instead of JEK.

thereby at least obscuring the connection or filepath details from the end user. also allowing existing code to benefit from this performance boost with minimal code change.

Turribeach
Level 6

Well the issue for workloads with lots of file system operations is that neither Java nor Python are you going to be fast enough so not sure what "connect directly" will mean in this case. The solution is to use low level file tools written in C (like grep) which use low level file system APIs. Most likely something like https://cython.org/ will be needed to get that performance level in Python. But what will be the point of doing that if you can already use any of the existing file system tools via the shell script recipe? In fact you could even write your own super fast C file system tool and use it in Dataiku.

What I am trying to say here is that I understand the issue this idea raises, I even suffered the performance issues myself. But I think the current work around of moving your work load to shell scripts or another technology that supports the performance required is a reasonable approach.