Slow performance questions

Hello,
I am having some performance issues. I have about 5000 rows of data where one of the columns is a a large amount of text, and importing this data from JSON was quick. But everything else I try to do is painfully slow. Since data-science tasks can include millions of rows, something must be wrong with my implementation if this software is running so slow with only 5000 rows. The software is also eating my ram at 22-GB for this dataset which is a 23-MB JSON file.
For example, when I apply a recipe, just changing the names of my 8 columns, when I try to open the dateset after applying the recipe it takes several minutes.
Also when I try to load the dataset with Python in a notebook it takes about 3 minutes just to get the data into a pandas dataframe. These are the two lines that are importing the data which is taking the 3 minutes:
'''
dataset = dataiku.Dataset("dataset")
df = dataset.get_dataframe()
'''
Why on my server is it taking DSS 22 GB of ram to work on a 23-MB file? Also, what can I do differently to get things running faster? Are there some best practices I am missing?
Operating system used: pop-os
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,349 Dataiker
Hi @Fahraynk
,I would suggest you open a support ticket for this particular issue.
Please reproduce a slow recipe using this dataset and send us the job diagnostics in the support ticket or over our Live Chat: https://doc.dataiku.com/dss/latest/troubleshooting/obtaining-support.html#live-chat.
If you can provide a few sample lines from your dataset( any sensitive data can be obfuscated) that will also help with the investigation.
https://doc.dataiku.com/dss/latest/troubleshooting/obtaining-support.html