Build several partitions in one go
Hi,
I want to synchronize an Oracle table of 1 billion rows to another Oracle table. The query is very long and I end up with the following Oracle error:
[11:06:27] [INFO] [dku.output.sql] - appended 178620000 rows, errors=0
[11:06:27] [INFO] [dku.utils] - Closing oracle.jdbc.driver.T4CConnection@7fc1cb4f
[11:06:27] [INFO] [dku.utils] - Conn oracle.jdbc.driver.T4CConnection@7fc1cb4f is now true
[11:06:27] [INFO] [dku.flow.activity] - Run thread failed for activity extract_items_NP
java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 5 with name "_SYSSMU5_3914542555$" too small
ORA-02063: preceding line from ORCBI01_SYSTEM
Therefore, I'd like to use partitioning to split query into multiple little queries. I use a discrete partition key that has a cardinality of about 200. When I run the job for just one key everything is ok.
Three questions:
- Is partitioning the right answer to my issue
- How can I build several partitions in one go
- How can run ALL partitions in one go without listing the 200 possible values
Best Answer
-
Partitioning might indeed be an answer to your case. Beware that it requires to learn a bit about it first, and it needs some practice (don't be discouraged if things don't work on the first attempt !).
> synchronize an Oracle table of 1 billion rows to another Oracle table
If both datasets are partitioned, and you set the partition dependency to be "Equal" (which is the default), then DSS will indeed run the recipe partition by partition, as “multiple little queries”.
>How can I run several repartition key in one go
Specifying which partitions you want DSS to sync is done on the recipe page, just above the "run" button.
See http://doc.dataiku.com/dss/latest/partitions/identifiers.html to specify a (list of) partitions and more generaly http://doc.dataiku.com/dss/latest/partitions/index.html to start learning about partitioning.
> How can run all partitions in one go without listing the 200 possible values
This isn't directly supported, but there is a workaround for now: add a recipe from the dataset for which you want to build all partitions to a dummy unpartitioned dataset. Define partition dependencies as “all available”.
Answers
-
Ok I have a clearer understanding now.
Anyway, something is not totally clear to me.
In the recipe page I choose "Explicit values" as partition dependencies to be able to run the recipe for a specific list of key. For instance 301420,300600.
But I also have to fill a value in the Build section of the recipe page. For example if I put 000000 inside when I run the recipe, DSS first deletes all the rows for the partition and it executes the following query :
DELETE FROM "EXT_STN_PIE_PIECES" WHERE "LOGIN" = '000000'
Which obviously not what I need. Then it loads the data for partition 301420 and 300600, which is fine.
Can you enlighten me a little more please? -
Ooops! I finally find the solution (and shame on me it was explicitly written...)
I need to fill Build text box with the syntax :
Any/Any/Any/... # Several values (ExactValue)
Now my third question is still opened : How I run all partition in one go ? -
I use another solution involving Scenarios:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-scenario.html
-
Hello,
Here is my 2 cents for this request we solved : we used a project variable updated automatically with the list of all partitions.
1. create a new global variable : example : "tech_all_existing_partitions": ""
2; Create a scenario (that starts every hour for example) with a first step "python custom". Put inside this code below is listing all the partitions of a "main dataset" (here Clusters_summary) and then concat them in one string comma separated and assign variable
import dataiku from dateutil.relativedelta import relativedelta from dataiku import pandasutils as pdu import pandas as pd partitions = dataiku.Dataset("Clusters_summary").list_partitions() partitionStr = ','.join(partitions) client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) variables = project.get_variables() variables["standard"]["tech_all_existing_partitions"] = partitionStr project.set_variables(variables)
3. run the scenario manuall to initialize the variable (then it will be uptaded every hour as your schedule)
4. when you want to rebuild a dataset for all partitions, just go to the recipe and in the run options use the variable : ${tech_all_existing_partitions}, then run it.
That's all !
Hope this helps
Stephane C.