Data Refresh out of a Managed Folder

tgb417 · April 2022

I have a setup that looks like this.

Shows a DSS flow with Managed Folder to start

On a design node, I will put an updated MS Excel Spreadsheet into the managed folder shown as Raw_Data, and attempt to rebuild the flow from Clean_Data, expecting that the new data will be discovered and used.

Unfortunately, this does not seem to work for me I have to go into the RAW_DATA data file and test the data set. Some time I have to re-set the scheme because I've changed the columns in the dataset. Most time the scheme is exactly the same. However, every time I have to go into RAW_DATA for the new data to be recognized. It does not seem to matter what type of rebuild I'm doing Forced or Smart.

I feel like I'm missing something obvious.

Thanks for any insights you can share.

--Tom

Operating system used: Ubuntu 18.04 TLS (Running under WSL2)

Alexandru · May 2022

Hi @tgb417

I've tested on 10.0.3 to 10.0.5 . so the version shouldn't be an issue here.

I've created a video here https://www.loom.com/share/4c46707e37de4fe1a3e859fd591dd736?t=189

When I add a file to the managed folder and perform a recursive rebuild the new data is picked up. As this is not what you are seeing I would suggest you raise a support ticket with the job diagnostics so we can have a look.

Thanks!

tgb417 · May 2022

@AlexT
,

I think when I add a Sync Step as you have shown in your video, and have a Glob/Regex Inclusion rules as you described it; things are now working as expected.

Shows a flow adding a Sync Step

More testing followed. I'm going to mark this one answered. However, understanding why is still an open question.

Alexandru · May 2022

Hi Tom,

Can you clarify if you see this behavior is only when the schema changes?

DSS would not change the schema automatically once this is set you have to detect it again and save it and propagate it across the flow.

New files with the same schema were successfully detected if the schema remains the same .

Anytime I add new files to the managed folder which I use with the "Files from Folder" dataset is given they match the pattern will be automatically picked up by a recursive build.

Screenshot 2022-05-02 at 09.02.53.png

If the schema changes and you want to detect this automatically without going into the dataset what I can suggest is you build your flow using a scenario instead.

1) Detect the new schema by unsettling and auto-detecting again using Python API

import dataiku

client = dataiku.api_client()
project = client.get_default_project()

dataset = project.get_dataset("Directory_by_Organization_1")
settings = dataset.get_settings()
settings.get_raw()["schema"] = {"columns":[]}
settings.save()
new_settings = dataset.autodetect_settings()
new_settings.save()

Screenshot 2022-05-02 at 09.32.03.png

2) Propagate the schema from Flow

Screenshot 2022-05-02 at 09.32.08.png

3) Build the dataset recursively as the last step in your scenario.

Let me know if that helps or if I misunderstood anything there.

tgb417 · May 2022

@AlexT
,

Thanks for the detailed response. At the moment I’m not at a computer with access to the project and it may be a day or so until I get back to this.

However, this lack of refresh was happening both when the schema changed and when it did not change. (however, I will go back and check.)

When you say “new” file what do you mean. I’ve been updating the file, and deleting the old file from the managed folder and copied the new file of the same name into the managed folder. Is this the cause of the behavior? Dss is not looking at the date and time stamp as on file?

I’ve also not enumerated the file by name as you are showing.

On the computer in question I don’t have an enterprise license this is setup for a small non-profit theater as a prof of concept, before making a ikig.ai grant request. So for now the scenario approach for dealing with this issue is off the table.

Thanks for the feedback.

Alexandru · May 2022

If the file name changes it won't be detected unless you use a pattern. Try using Blob/Regex instead when defined "Files from Folder dataset" if the file is deleted and re-uploaded it should work

When the schema does change it needs to be manually detected again from the UI if you can't use the scenario.

Another solution here would be to use a Python recipe directly to read files from the managed folder instead and create the output datasets with write_with_schema which creates a new schema each type the recipe is run.

Example of reading xls file : https://community.dataiku.com/t5/Using-Dataiku/Read-excel-file-using-Python-Pandas/m-p/23022 it would require code env with specific package xlrd<1.2.0 or openpyxl

tgb417 · May 2022

@AlexT

Just getting back to this.

I'm not quite clear what you mean when you say:

Try using Blob/Regex instead when defined "Files from Folder dataset"   
if the file is deleted and re-uploaded it should work

Not clear on which screen of which type of object in DSS this refers.

Alexandru · May 2022

Hi @tgb417
,

Sorry for not being clear here. What I mean was Glob/Regex Inclusion rules for files found under Settings - Files of the "Files from Folder" dataset. There I can define filename* and include anything like filename20200101.csv etc.

Screenshot 2022-05-05 at 20.30.46.png Screenshot 2022-05-05 at 20.32.34.png

For me this is correctly picking up new files or files I replace with the same schema and a recursive build will read the new files without any manual actions.

Let me know if you are not seeing behave this way on your end you may want to open a support ticket with the job diagnostics.

tgb417 · May 2022

@AlexT
,

On a first attempt this evening this was a #Fail. Changed to the Configuration as below

Saved.

Uploaded a new file and tried to rebuild the next recipe in the flow and I ended up getting the old data.

Go in and Test and the data comes through.

Load from a Managed Folder.png

I'm using a slightly old 10.0.3, I see that there are a few slightly later version.

Any Thoughts?

tgb417 · May 2022

@AlexT

In your example where you have a sync recipe, I have a prepare recipe. Is there something “magic” about the sync recipe at this point in the process?

Regarding a support ticket, at the moment, I just have to get my current project done, so I may not get a support ticket out at this moment.

tgb417 · May 2022

@AlexT

So far so good with the sync recipe in the flow. Still testing. But all signs point to this being the solution.

The sync recipe is still sort of confusing to me. What is this recipe actually doing. (In the past I’ve thought of the sync recipe as sort of a very limited version of every other recipe type taking input and moving it somewhere else. A recipe with sort of no internal actions.) However, this experience leads me to believe that this assessment may in some ways not be a fully correct interpretation of this understanding. How is the treatment of an input data for a sync recipe being treated differently from the input of say the prepare recipe. Or for that matter any other recipe type?

tgb417 · May 2022

@AlexT

My testing now indicates that the need for the Sync Recipe was critical.

I'd still like to understand a bit more about why without the Sync Recipe I could create the connection. But it would not refresh. With the Sync Recipe things are working as I would expect.

Ignacio_Toledo · May 2022

Hi @tgb417
,

I'm not sure the sync recipe is the key to your problem. I also thought about it, since I saw you had a prepared recipe and @AlexT
example used a sync recipe.

So, I did my test on my own instance, and I was unable to replicate the behavior you saw, where the data wouldn't get updated unless you opened the first dataset created directly from the XLS file.

Maybe there is actually a bug somewhere? If it helps your case, I can also create a video with the prepare recipe and the workflow behaving as expected.

Cheers!

Alexandru · May 2022

@tgb417
,

I can confirm what @Ignacio_Toledo
, in general, a prepare recipe with no steps would behave as sync in most cases.

Sync recipe are powerful when you want to use things like fast-path from Cloud Storage -> Analytical database.

I've changed by flow and I am seeing both Prepare/Sync recipe update the records as soon as I add file and do a recursive rebuild.

Screenshot 2022-05-09 at 12.39.56.png

tgb417 · May 2022

@Ignacio_Toledo
& @AlexT

Ignacio thanks for the offer. Right now I’m in a good enough state. And otherwise very busy. So, I don’t know if a video is necessary.

So let me come clean a little bit. Another thing that might be going on is that my ms excel spreadsheet is not that straight forward. For example:

The data in each of the workbooks in this spreadsheet is not manually entered into the spreadsheet, it is dynamically pulled from another data source through an odbc connection using “m” power query embedded inside MS Excel. ( Long story why I’m gathering the data this way. ) At some point I might create a rest API coming out of this other data source but it does not exist today.
Second there are two worksheets in the one workbook file. Dataiku does a nice job reading these two separate worksheets as two separate data sources. However, using this feature might be an edge case on this refresh issue.

I can imaging that either of these complexities are use cases somewhat outside the norm and might be causing the problems I’m seeing. Particularly the odbc power query based data. But the Sync Recipe does make my challenge go away.

Thank you my friend, for you time to check this out.

Data Refresh out of a Managed Folder

Best Answers

Answers

Categories

Setup Info

Tags