While I appreciate DSS is written in Java and transmits data from db / filesystem to recipes via https, from our tests, this layer of indirection is
- twice as slow compared to a direct connection to database with a package like pyscopg2
- almost 5 times as slow compared to reading from filesystem directly
Are there plans to expose python apis that connect to filesystem and DB directly should we require the performance improvement?
I fail to see what's missing for you to implement this yourself where needed. If you can use Python to interact with another data storage technology in a faster way then why not do so already? eg what's stopping you from using pyscopg2 directly in a Python recipe and loading the data into a Pandas data frame as the output?
I worked in a project where I needed to search millions of XML files for certain strings to then load those files. In the beggining I wrote a Python recipe but it was way too slow, Python is just never going to be quick for intensive OS read/write operations. So I moved part of my code to a Shell script recipe and using grep and performance improved 100x.
I've run into the same kinds of file system access performance issues we are discussing here. And as you have described I moved to Shell Scripts and significantly improved system performance. Here is some more information about my experience.
However, this is the product idea section of the website. I agree with @somepunter that it would be very nice if the file system performance of Dataiku DSS were a bit better. The things we are doing with Shell Scripts feel like workarounds. I wish the built-in options were more performant.
Given that Dataiku already supports interacting directly with a file system (via shell scripts) or other data technologies using native drivers or Python I am not sure what else could Dataiku do. Trying to do the same file system work in Java would probably find the same performance issues we see in Dataiku. In other words I don't think it's Dataiku the issue, it's Java.I think this is the beauty of Dataiku, you can use the technology or backend that better suits your needs. It’s also worth noting that in a lot of cases Dataiku already uses the faster data load APIs where applicable. For instance if you move millions of rows from a GCP bucket into GCP BigQuery table it's going to be blazing fast as it uses the BigQuery's fast Storage Write API. So I think you will most likely find that where Dataiku can interact directly in a high performance way with data technologies using Java they already do it. How would interact with a Linux file system using Java to get “direct performance” levels? Linux file systems don’t really have an API. Yes you can write C++ and use OS level APIs but then you will run into lots of issues due to the different Linux OS distributions Dataiku supports.
as a cheap and cheerful halfway house perhaps offer the same Dataset('mydata').get_dataframe() API but with an optional argument which would effectively conmect directly to the filesystem / DB via python underneath the hood instead of JEK.
thereby at least obscuring the connection or filepath details from the end user. also allowing existing code to benefit from this performance boost with minimal code change.
Well the issue for workloads with lots of file system operations is that neither Java nor Python are you going to be fast enough so not sure what "connect directly" will mean in this case. The solution is to use low level file tools written in C (like grep) which use low level file system APIs. Most likely something like https://cython.org/ will be needed to get that performance level in Python. But what will be the point of doing that if you can already use any of the existing file system tools via the shell script recipe? In fact you could even write your own super fast C file system tool and use it in Dataiku.
What I am trying to say here is that I understand the issue this idea raises, I even suffered the performance issues myself. But I think the current work around of moving your work load to shell scripts or another technology that supports the performance required is a reasonable approach.
Thank you for your kind input @Turribeach
firstly, I believe the bottleneck here is not CPU but IO but a faster language will have little impact on. I don't believe there's a need to write any bespoke functions when pandas and polars do a perfectly good job.
The point I'm trying to make is dataiku's dataset api is 100x slower when reading and writing the same data to and from the same disk. I believe this is because of the streaming architecture that sends it to the Java server first.
I could easily bypass that and use pd.read_parquet instead but that would negate the "standard pattern" consistency and some of the ui benefits the UI provides.
I can't see a strong downside to providing the dataset api an optional param means of directly interacting with the IO in question rather than streaming it via the java server. that feels like low hanging fruit for 100x speedup while retaining all the benefits of a consistent api to me.