Stacking Excel Datasets
Hi all,
i had the great luck to receive the objective of creating a customs import volume forecast for my company on DataIKU as part of my annual target agreement. So now i've started my new DataIKU Project. To be honest, i am a complete newbie to this platform.
Personally i work daily with Excel, a very little bit with Power BI, but never with DataIKU. Thats also the reason why i guess many stupid questions will follow this post.
Currently i have stored my datasets (on monthly basis) 2021 - running 2025 on our team Sharepoint Website. There i can create folders, files, etc… and this datasets are already connected to my DataIKU Project. In my next Step i wanted to stack these round about 50 datasets (each one has about 350k rows and 12 fixed columns & xlsx. files) into 1 big dataset. So the structure of data is the same in each file.
Unfortunately DataIKU is not able to stack these Datasets into 1 big dataset. Is it because the datasets are too big? Is there a problem with my rights in terms of Sharepoint or DataIKU? Is my Storage in DataIKU is too small for the amount of data?
Can somebody help me with this problem? Would be great to meet someone.
Best Regards
Lucas
Operating system used: DataIKU
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,611 NeuronHi please post a screenshot of your flow design. Also when you say Dataiku “is not able to stack” what exactly do you mean? Do you get an error? Why can’s you do it?
While Dataiku can load data from pretty much any source you really should have a database storage layer for your intermediate and output results. Sharepoint is not a good database solution.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,362 DataikerHi @Lucas1006 ,
So, stacking "50 datasets (each one has about 350k rows and 12 fixed columns & xlsx. files) should be no problem for DSS.
Note, you don't need to use the stack recipe if your files have the same schema /columns.
If your data is stored in SharePoint, you can create a folder that points to the SharePoint location where all 50 files are stored.
In DSS, create a file in the folder dataset and read all the files as a single dataset.
https://doc.dataiku.com/dss/latest/connecting/files-in-folder.html
Note: once this happens, you will see a preview of the first 10k rows only. This is expected, but it doesn't mean DSS doesn't read all files, when you run a recipe it will read all the fles , more about sampling https://doc.dataiku.com/dss/latest/explore/sampling.html
In the files in the folder, you can define the pattern for your current/future files to select these.
if the file names partition e.g, date within, you could also partition this dataset
https://doc.dataiku.com/dss/latest/partitions/index.html
You may want to take a look to get familiar with Dataiku.
Thanks -
Hi Turribeach, i've attached a screenshot of my flow. As mentioned i'm just at the beginning of my project, so you will just see my connected database.
So what i did, was to click on "Stack"-Recipe and run this stack on my Sharepoint Folder, where i store my Cleansingvolume Datasets. The Stack-Job was running now 38 minutes and i received this message in my activity log:
[11:34:09] [INFO] [dku.flow.activity] - Run thread failed for activity compute_Cleansingvolumen_stacked_NP
com.dataiku.common.server.APIError$SerializedErrorException: Job 1 cancelled because SparkContext was shut down
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(AbstractSparkBasedRecipeRunner.java:327)
at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:26)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(AbstractSparkBasedRecipeRunner.java:342)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(AbstractSparkBasedRecipeRunner.java:150)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:118)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:103)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLQueryRecipeRunnerBase.executeJobDef(SparkSQLQueryRecipeRunnerBase.java:37)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLExecutor.run(SparkSQLExecutor.java:45)
at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:213)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:430)
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - activity is finished
[11:34:09] [ERROR] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - Activity failed
com.dataiku.common.server.APIError$SerializedErrorException: Job 1 cancelled because SparkContext was shut down
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(AbstractSparkBasedRecipeRunner.java:327)
at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:26)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(AbstractSparkBasedRecipeRunner.java:342)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(AbstractSparkBasedRecipeRunner.java:150)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:118)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:103)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLQueryRecipeRunnerBase.executeJobDef(SparkSQLQueryRecipeRunnerBase.java:37)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLExecutor.run(SparkSQLExecutor.java:45)
at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:213)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:430)
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - Executing default post-activity lifecycle hook
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - Done post-activity tasks
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - done handling times
[11:34:09] [DEBUG] [dku.flow.jobrunner] running compute_Cleansingvolumen_stacked_NP - runActivity terminated, success=false
[11:34:09] [DEBUG] [dku.flow.jobrunner] running compute_Cleansingvolumen_stacked_NP - Signaling end of activity to backend
[11:34:11] [DEBUG] [dku.flow.jobrunner] running compute_Cleansingvolumen_stacked_NP - Done signaling end of activity to backend
[11:34:11] [INFO] [dku.jobs] - Connects using Shared secret
[11:34:11] [DEBUG] [dku.jobs] - Received command : /pintercom/stop_session
[11:34:12] [DEBUG] [dku.jobs] - Command /pintercom/stop_session processed in 61ms -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,362 DataikerHi,
The actual reason for the failure can be found further up in the logs.
Could you pelase generate a job diagnostics from this
https://doc.dataiku.com/dss/latest/troubleshooting/problems/job-fails.html#getting-a-job-diagnosis
Please open a support ticket with the diagnostics.
Please don't share the diagnosis over community
THanks -
Hey Alexandru,
thank you very much for sharing this Link with me. I just raised a support ticket with my Job Diagnosis.
Best Regards
Lucas
