Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I want to synchronize an Oracle table of 1 billion rows to another Oracle table. The query is very long and I end up with the following Oracle error:
[11:06:27] [INFO] [dku.output.sql] - appended 178620000 rows, errors=0
[11:06:27] [INFO] [dku.utils] - Closing oracle.jdbc.driver.T4CConnection@7fc1cb4f
[11:06:27] [INFO] [dku.utils] - Conn oracle.jdbc.driver.T4CConnection@7fc1cb4f is now true
[11:06:27] [INFO] [dku.flow.activity] - Run thread failed for activity extract_items_NP
java.sql.SQLException: ORA-01555: snapshot too old: rollback segment number 5 with name "_SYSSMU5_3914542555$" too small
ORA-02063: preceding line from ORCBI01_SYSTEM
Therefore, I'd like to use partitioning to split query into multiple little queries. I use a discrete partition key that has a cardinality of about 200. When I run the job for just one key everything is ok.
Three questions:
I use another solution involving Scenarios:
https://knowledge.dataiku.com/latest/mlops-o16n/partitioning/concept-scenario.html
Hello,
Here is my 2 cents for this request we solved : we used a project variable updated automatically with the list of all partitions.
1. create a new global variable : example : "tech_all_existing_partitions": ""
2; Create a scenario (that starts every hour for example) with a first step "python custom". Put inside this code below is listing all the partitions of a "main dataset" (here Clusters_summary) and then concat them in one string comma separated and assign variable
import dataiku
from dateutil.relativedelta import relativedelta
from dataiku import pandasutils as pdu
import pandas as pd
partitions = dataiku.Dataset("Clusters_summary").list_partitions()
partitionStr = ','.join(partitions)
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
variables = project.get_variables()
variables["standard"]["tech_all_existing_partitions"] = partitionStr
project.set_variables(variables)
3. run the scenario manuall to initialize the variable (then it will be uptaded every hour as your schedule)
4. when you want to rebuild a dataset for all partitions, just go to the recipe and in the run options use the variable : ${tech_all_existing_partitions}, then run it.
That's all !
Hope this helps
Stephane C.