Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

creating compiled excel files from input excel files

Solved!
UN
Level 2
creating compiled excel files from input excel files

Hello, 

I have a vba code that does the following:

- opens 41 excel files. Each of these files has 4 to 5 tabs with information like incurred claims, paid claims, enrollment data in different rows. 

- it then pulls all of this information in one single excel sheet. it categorizes the information based on the name of the file. 

How could i do something like this on dataiku? Any help would be appreciated. 


Operating system used: Version: 12.2.1

0 Kudos
1 Solution
AlexandreL
Dataiker

Hi,

Thanks for the examples, it makes things much clearer. I managed to replicate your use case (using the excel sheet importer, if you created a single dataset from the folder, you can skip the stack recipe) and the final flow looks like this:

Screenshot 2024-01-12 at 11.48.24 AM.png

 

To go more into details we have:

  • a Stack recipe: it concatenates all files into one. I kept all columns from all sheets and made sure the "Origin Column" setting is ticked, so that I know where each line comes from. The output dataset looks like this

    Screenshot 2024-01-12 at 11.53.33 AM.png
  • A filter recipe: to keep only rows where column "Unnamed:0" equals "claims", as we're only interested by them
  • A prepare recipe: which will handle data reshaping and columns selection / renamings:
    • To reshape data: I used a fold columns step
      Screenshot 2024-01-12 at 11.57.36 AM.png
    • To extract sheet name and file name: the logic will depend on your names structure. In my case I had to split them on the underscore, so I could do it with a split step of with a formula, using the "split" operator (example is with formula)
      Screenshot 2024-01-12 at 12.01.18 PM.png
    • Finally, you can just rename columns and delete unnecessary ones from the same prepare recipe. The final dataset will look like this
      Screenshot 2024-01-12 at 12.02.48 PM.png

View solution in original post

0 Kudos
4 Replies
Turribeach

For a start I would move away from Excel as your data sources. The data those Excel files have must sure be coming from a proper database so you should look to source the data from those data sources and load it up on in a Dataiku dataset. If you still need to use the Excel files have a look at the Files in Folder dataset.

0 Kudos
AlexandreL
Dataiker

Hi,

First of all, you won't be able to execute VBA in DSS, so you'll have to re-implement your code's logic in a DSS flow. Starting by importing all the excel files as datasets.

In order to help you, we might need more information:

  • Do all 41 excel files have the same structure ? (same column names and same tabs)
    • If so, you can merge all files together by placing them in a managed folder (from the flow, "+ Dataset" button > Managed folder). Then, you can open the folder and create a dataset from the files in this folder (Actions menu > Create dataset). DSS will concatenate all excel sheets in a single table. When doing so, in the "Format / Preview" settings, you'll have the ability to ask DSS which sheets to use (likely all in your use case) and you'll be able to add the sheet name in an extra column as well
    • If not, you'll have to create one dataset per excel file, and implement the VBA's code logic using recipes in a flow:
      • The easiest way to mass import those files would be to put them in a managed folder, and then use the Excel sheet importer plugin.This will create one dataset per tab per excel file. Keep in mind that if the plugin isn't installed, you need to reach a platform admin to install it
      • You can also drag and drop all those files on your flow. This will navigate you to an import prompt, where you'll be able to manually create one dataset per file. As this process is manual, it's going to take you a bit longer
  • What is your VBA code doing ? Is it just concatenating files ? Performing lookups ? Reshaping the data ? Based on this information, we might be able to help you a bit more by identifying which recipes you need

Finally, @Turribeach's piece of advice is a valid one. If your excel files are exports from a data source, you might want to connect directly to this source, as you'll be able to directly pull the data when it changes, and automating the flow will be easier

0 Kudos
UN
Level 2
Author

Thanks for the quick response, Alexandre! Here is some information about the input files and what I'm trying to do:

- Directly pulling from the data source would be hard due to admin restrictions but I don't mind uploading all the 41 files into a folder on Dataiku project. 

- All the input Excel files have the same format, same sheet names, but different file names.

- The input files may not have the same number of sheets/tabs in them.. 

Here are the steps for one input excel file right now: 

- I want to pull one particular row (example row 56th) from a sheet in that input file and paste it as a column in the new file. I want to repeat this process for all the sheets in the input file (except one particular sheet named "total") and paste the new data below the previously pasted data. I want the sheet name and the file name in two separate columns to categorize the data. My output essentially should now have only three columns - sheet name, file name, and claim amounts. the data pulled from all the sheets/tabs in that one input file should be stacked on top of each other. They will be categorized by the sheet name and file name. I have attached images of the input and output files (these are examples) so it helps to visualize my ask, better. 

- Now, I want to repeat this process for all the 41 input files. I want to go into each file, pull that particular row (it is the same row number in each file thankfully), pull the sheet name, file name, and paste the data in one single file. 

I don't know if there is a recipe i can write or already a recipe that can help do this whole process. Your help is very much appreciated!

0 Kudos
AlexandreL
Dataiker

Hi,

Thanks for the examples, it makes things much clearer. I managed to replicate your use case (using the excel sheet importer, if you created a single dataset from the folder, you can skip the stack recipe) and the final flow looks like this:

Screenshot 2024-01-12 at 11.48.24 AM.png

 

To go more into details we have:

  • a Stack recipe: it concatenates all files into one. I kept all columns from all sheets and made sure the "Origin Column" setting is ticked, so that I know where each line comes from. The output dataset looks like this

    Screenshot 2024-01-12 at 11.53.33 AM.png
  • A filter recipe: to keep only rows where column "Unnamed:0" equals "claims", as we're only interested by them
  • A prepare recipe: which will handle data reshaping and columns selection / renamings:
    • To reshape data: I used a fold columns step
      Screenshot 2024-01-12 at 11.57.36 AM.png
    • To extract sheet name and file name: the logic will depend on your names structure. In my case I had to split them on the underscore, so I could do it with a split step of with a formula, using the "split" operator (example is with formula)
      Screenshot 2024-01-12 at 12.01.18 PM.png
    • Finally, you can just rename columns and delete unnecessary ones from the same prepare recipe. The final dataset will look like this
      Screenshot 2024-01-12 at 12.02.48 PM.png
0 Kudos