Folder deleting uploaded files

rtaylor
rtaylor Registered Posts: 24 ✭✭✭✭✭
edited July 18 in Using Dataiku

I have created a series of folders to store uploaded files (.csv). This folder is connected to a python recipe that connects to the folder, and uses a read_csv loop to read in each file and append it to a dataframe. The dataframe then outputs to a dataset. This all works fine, but the folder is sporadically deleting all of the csv files, usually within -2 days of uploading the files. Functionally, this prevents me from using the Flow r scenarios on the Flow or parts thereof, as the first steps of importing the data fails as there is nothing in the folders after the folders delete the data. Has anyone else experienced this file deletion behavior?

read_csv recipe, if it is relevant:


# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
raw_data_handle = dataiku.Folder("2XlQv9z4")
raw_data_handle_info = raw_data_handle.get_info()

paths = raw_data_handle.list_paths_in_partition()

raw_data = pd.DataFrame()
for i in (paths):
with raw_data_handle.get_download_stream(i) as f:
new_data = pd.read_csv(f, header = 0) # Call read_csv on the object
raw_data = pd.concat([raw_data, new_data])

# Write recipe outputs
int_20190501 = dataiku.Dataset("int_20190501")
int_20190501.write_with_schema(raw_data)

Answers

  • Alan_Fusté
    Alan_Fusté Partner, Registered Posts: 43 Partner
    Hi rtaylor,

    It's possible that this folder is set by OS as temporary?

    Another maybe is that you have a two day trigger scenario with managed_folder.clear() or something similar... It can be?

    Have you tried to change the folder?
  • rtaylor
    rtaylor Registered Posts: 24 ✭✭✭✭✭
    I do not know the exact storage details and configurations, as our instance of DSS is administered by one team, and the file storage systems by another. We have set up a few test cases to monitor and try to reproduce the behavior.

    I am certain we do not have any scenarios clearing the data.

    I am not sure if this could be related, but we are also running into issues (in the same project) where datasets, after being built, will randomly give a root path error and need to be rebuilt. The timing on this is hard to nail down (as I was not until this week regularly checking folders and datasets to see if they still existed after upload/build), but I do know it has occurred at intervals between 30 minutes, and weeks (super helpful, range, I know!).

    I'm mostly trying to figure out if this is a data system issue, a DSS issue, or some combination of the two.

    A little background: I had a whole series of notebooks in which I was developing python code. Once all the code worked (reading datasets, doing the stuff, writing output datasets), I turned the notebooks into python recipes. I discovered the folder deletion behavior when my flow failed. Since then, the folders have purged multiple times and I am also running into the root paths issue randomly on other datasets.
  • Alan_Fusté
    Alan_Fusté Partner, Registered Posts: 43 Partner
    Hi, it sounds like a bug of DSS... Maybe it's util for you to talk with Dataiku's support. They use to answer fast and maybe they can reproduce and check the error: https://support.dataiku.com/support/home.
    Another option is if you have got a scenario running macros (DSS has got some "cleaning macros" integrated) and they aren't working well.
    If they get you the answer, let me know please!
  • rtaylor
    rtaylor Registered Posts: 24 ✭✭✭✭✭
    We are going to let the flow sit until the end of the weekend, as we are currently working on the assumption that some automated process is purging the data (either one of the DSS admin jobs, or a process in our data storage). I will post an update with whatever the results are, and then go from there.
  • Alan_Fusté
    Alan_Fusté Partner, Registered Posts: 43 Partner
    Perfect rtaylor, I'm very curious about that now, so thank you about updating all next week :)
  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭
    FYI you can select the folder in the flow and use the "Create dataset" action, it'll let you create a dataset based on the csv files.

    Regarding the auto deletion, it clearly shouldn't happen if you did not schedule it. What is the complete path where the files are stored? Can it be a temp folder?
  • rtaylor
    rtaylor Registered Posts: 24 ✭✭✭✭✭
    The folder is not a temp folder. the path is "${projectKey}/${odbId}". We've confirmed a few times after deletion that the folders themselves are removed. Normally a created folder, if you look into the underlying file system, appears with an alias in /PROJECTKEY/. After deletion the aliased folder is gone from the underlying data system. To date we have also not found logs in the data system or in DSS that indicate a folder or the files within were deleted.
  • rtaylor
    rtaylor Registered Posts: 24 ✭✭✭✭✭
    This is a very late answer, but two interesting things to report on:

    First, the initial project Flow has steadfastly refused to cooperate, with enough odd behaviors that we basically gave up on troubleshooting it.

    Second, I decided to just recreate the whole flow in a new project, not including the .csv data as we had decided the value of it was limited compared to the time and trouble it was causing. Once I created the new project, everything... just worked. Do dropped data and no odd root_path errors on datasets already built. The read_csv loop has been reused in other projects as well, and seems to perform correctly.

    I am aware that this is not really a satisfying resolution, but using the strategy of "burn it with fire and rebuild" was apparently enough.
Setup Info
    Tags
      Help me…