Stacking Excel Datasets

Lucas1006 · November 11

Hi all,

i had the great luck to receive the objective of creating a customs import volume forecast for my company on DataIKU as part of my annual target agreement. So now i've started my new DataIKU Project. To be honest, i am a complete newbie to this platform.

Personally i work daily with Excel, a very little bit with Power BI, but never with DataIKU. Thats also the reason why i guess many stupid questions will follow this post.

Currently i have stored my datasets (on monthly basis) 2021 - running 2025 on our team Sharepoint Website. There i can create folders, files, etc… and this datasets are already connected to my DataIKU Project. In my next Step i wanted to stack these round about 50 datasets (each one has about 350k rows and 12 fixed columns & xlsx. files) into 1 big dataset. So the structure of data is the same in each file.

Unfortunately DataIKU is not able to stack these Datasets into 1 big dataset. Is it because the datasets are too big? Is there a problem with my rights in terms of Sharepoint or DataIKU? Is my Storage in DataIKU is too small for the amount of data?

Can somebody help me with this problem? Would be great to meet someone.

Best Regards

Lucas

Operating system used: DataIKU

Turribeach · November 11

Hi please post a screenshot of your flow design. Also when you say Dataiku “is not able to stack” what exactly do you mean? Do you get an error? Why can’s you do it?

While Dataiku can load data from pretty much any source you really should have a database storage layer for your intermediate and output results. Sharepoint is not a good database solution.

Alexandru · November 11

Hi @Lucas1006 ,

So, stacking "50 datasets (each one has about 350k rows and 12 fixed columns & xlsx. files) should be no problem for DSS.
Note, you don't need to use the stack recipe if your files have the same schema /columns.

If your data is stored in SharePoint, you can create a folder that points to the SharePoint location where all 50 files are stored.
In DSS, create a file in the folder dataset and read all the files as a single dataset.
https://doc.dataiku.com/dss/latest/connecting/files-in-folder.html

Note: once this happens, you will see a preview of the first 10k rows only. This is expected, but it doesn't mean DSS doesn't read all files, when you run a recipe it will read all the fles , more about sampling https://doc.dataiku.com/dss/latest/explore/sampling.html

In the files in the folder, you can define the pattern for your current/future files to select these.
if the file names partition e.g, date within, you could also partition this dataset
https://doc.dataiku.com/dss/latest/partitions/index.html

You may want to take a look https://academy.dataiku.com/excel-to-dataiku-dss-quick-start to get familiar with Dataiku.

Thanks

Lucas1006 · November 12

https://community.dataiku.com/discussion/comment/46853#Comment_46853

Hi Turribeach, i've attached a screenshot of my flow. As mentioned i'm just at the beginning of my project, so you will just see my connected database.

So what i did, was to click on "Stack"-Recipe and run this stack on my Sharepoint Folder, where i store my Cleansingvolume Datasets. The Stack-Job was running now 38 minutes and i received this message in my activity log:

[11:34:09] [INFO] [dku.flow.activity] - Run thread failed for activity compute_Cleansingvolumen_stacked_NP
com.dataiku.common.server.APIError$SerializedErrorException: Job 1 cancelled because SparkContext was shut down
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(AbstractSparkBasedRecipeRunner.java:327)
at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:26)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(AbstractSparkBasedRecipeRunner.java:342)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(AbstractSparkBasedRecipeRunner.java:150)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:118)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:103)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLQueryRecipeRunnerBase.executeJobDef(SparkSQLQueryRecipeRunnerBase.java:37)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLExecutor.run(SparkSQLExecutor.java:45)
at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:213)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:430)
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - activity is finished
[11:34:09] [ERROR] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - Activity failed
com.dataiku.common.server.APIError$SerializedErrorException: Job 1 cancelled because SparkContext was shut down
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner$3.throwFromErrorFileOrLogs(AbstractSparkBasedRecipeRunner.java:327)
at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:26)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runUsingSparkSubmit(AbstractSparkBasedRecipeRunner.java:342)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.doRunSpark(AbstractSparkBasedRecipeRunner.java:150)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:118)
at com.dataiku.dip.dataflow.exec.AbstractSparkBasedRecipeRunner.runSpark(AbstractSparkBasedRecipeRunner.java:103)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLQueryRecipeRunnerBase.executeJobDef(SparkSQLQueryRecipeRunnerBase.java:37)
at com.dataiku.dip.recipes.code.sparksql.SparkSQLExecutor.run(SparkSQLExecutor.java:45)
at com.dataiku.dip.dataflow.exec.MultiEngineRecipeRunner.run(MultiEngineRecipeRunner.java:213)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:430)
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - Executing default post-activity lifecycle hook
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - Done post-activity tasks
[11:34:09] [INFO] [dku.flow.activity] running compute_Cleansingvolumen_stacked_NP - done handling times
[11:34:09] [DEBUG] [dku.flow.jobrunner] running compute_Cleansingvolumen_stacked_NP - runActivity terminated, success=false
[11:34:09] [DEBUG] [dku.flow.jobrunner] running compute_Cleansingvolumen_stacked_NP - Signaling end of activity to backend
[11:34:11] [DEBUG] [dku.flow.jobrunner] running compute_Cleansingvolumen_stacked_NP - Done signaling end of activity to backend
[11:34:11] [INFO] [dku.jobs] - Connects using Shared secret
[11:34:11] [DEBUG] [dku.jobs] - Received command : /pintercom/stop_session
[11:34:12] [DEBUG] [dku.jobs] - Command /pintercom/stop_session processed in 61ms

Alexandru · November 12

Hi,
The actual reason for the failure can be found further up in the logs.

Could you pelase generate a job diagnostics from this
https://doc.dataiku.com/dss/latest/troubleshooting/problems/job-fails.html#getting-a-job-diagnosis
Please open a support ticket with the diagnostics.

Please don't share the diagnosis over community

THanks

Lucas1006 · November 12

https://community.dataiku.com/discussion/comment/46860#Comment_46860

Hey Alexandru,

thank you very much for sharing this Link with me. I just raised a support ticket with my Job Diagnosis.

Best Regards

Lucas

Stacking Excel Datasets

Answers

Categories

Setup Info

Tags