Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

Big Oracle SQL table

pkansal
Level 3
Big Oracle SQL table

I am trying to run a complex query on a SQL dataset. However, it just takes around 40 min to run. What can I do to run it faster?

0 Kudos
9 Replies
Manuel
Dataiker
Dataiker

Hi,

It is hard to help you without detailed information:

  • What is the size of the dataset?
  • What type of recipe are you running?
  • Which connections are used for input and output dataset?
  • Which engine are you using to run the recipe? (check cogs near the Run button)

It would help if you screenshot that part of your flow.

Best regards

0 Kudos
pkansal
Level 3
Author

1. Around 440 million rows

2. It is a SQL recipe that first does a SUM group by, then takes a MAX, and then applies a FILTER on the value of MAX

3. For Input, it is Oracle RDBMS. I am creating a SQL dataset based on a SQL query in settings.

4. I think it is running the query in database itself, right?

0 Kudos
Manuel
Dataiker
Dataiker

Hi,

So, it seems that your input dataset is an SQL query. Did you see the attached warning? Instead, it is recommended that you setup a view that fits your needs in your database.

What is the data connection of the output? Recipes with different input and output connections will likely run on the DSS engine itself.

On the current engine, I suspect because of your setup, the recipe is not running on the database itself, but on the DSS engine. Check the options on the cog near the run button. What does it say its running? Local?

If possible,please take a screenshot of that part of your flow.

I hope this helps.

 

0 Kudos
pkansal
Level 3
Author

CDR_Filtered- SQL dataset from sql query

The python recipe is running in a containerized execution mode.

0 Kudos
Manuel
Dataiker
Dataiker

Hi, 

One suggestion is that you use first Sync your source dataset (the SQL Query) into an S3 dataset, which will then be the input to your Python recipe. This way you won't be running the Oracle query every time you test your Python recipe.

On the performance of the SQL query iteself, I am afraid there is not much you can do. It is a heavy query, so it will take time to run.

I hope this helps.

0 Kudos
pkansal
Level 3
Author

I tried to do it but the number of rows is more than 100 million and it takes a lot of time to do it. It uses DSS engine(RUN on stream).

Is there a faster way to do it?

 

The database is Oracle.

0 Kudos
Manuel
Dataiker
Dataiker

Hi,

I am afraid there are no fast paths between Oracle and S3.

How large is the whole table you define the SQL query dataset on? Instead of the SQL Query dataset, have you considered using the whole table as the input dataset, followed by a recipe that selects the relevant rows, followed by the remaining logic

This would also allow you to use partitioning, processing the data in chunks.

I don't know what your use case is, but it seems worth to engage your Customer Success Manager and other services to have a closer look and help you optimise.

Best regards

 

Best regards

0 Kudos
pkansal
Level 3
Author

Bu the partitioning column is a derived column, so I cant use it to create partitions in SQL dataset.

0 Kudos
Manuel
Dataiker
Dataiker

You should be able to use existing columns for SQL partitioning.

0 Kudos