Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on June 10, 2015 12:00AM
Likes: 0
Replies: 5
Hi,
I want to synchronize an Oracle table of 1 billion rows to another Oracle table. The query is very long and I end up with the following Oracle error:
[11:06:27] [INFO] [dku.output.sql] - appended 178620000 rows, errors=0
[11:06:27] [INFO] [dku.utils] - Closing oracle.jdbc.driver.T4CConnection@7fc1cb4f
[11:06:27] [INFO] [dku.utils] - Conn oracle.jdbc.driver.T4CConnection@7fc1cb4f is now true
[11:06:27] [INFO] [dku.flow.activity] - Run thread failed for activity extract_items_NP
java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 5 with name "_SYSSMU5_3914542555$" too small
ORA-02063: preceding line from ORCBI01_SYSTEM
Therefore, I'd like to use partitioning to split query into multiple little queries. I use a discrete partition key that has a cardinality of about 200. When I run the job for just one key everything is ok.
Three questions:
I use another solution involving Scenarios:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-scenario.html
Hello,
Here is my 2 cents for this request we solved : we used a project variable updated automatically with the list of all partitions.
1. create a new global variable : example : "tech_all_existing_partitions": ""
2; Create a scenario (that starts every hour for example) with a first step "python custom". Put inside this code below is listing all the partitions of a "main dataset" (here Clusters_summary) and then concat them in one string comma separated and assign variable
import dataiku from dateutil.relativedelta import relativedelta from dataiku import pandasutils as pdu import pandas as pd partitions = dataiku.Dataset("Clusters_summary").list_partitions() partitionStr = ','.join(partitions) client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) variables = project.get_variables() variables["standard"]["tech_all_existing_partitions"] = partitionStr project.set_variables(variables)
3. run the scenario manuall to initialize the variable (then it will be uptaded every hour as your schedule)
4. when you want to rebuild a dataset for all partitions, just go to the recipe and in the run options use the variable : ${tech_all_existing_partitions}, then run it.
That's all !
Hope this helps
Stephane C.