Using Dataiku

mardp
mardp Registered Posts: 1 ✭✭✭

Hi all,

I'm working on a Python recipe to automate file validation in Dataiku using managed folders. My goal is to:

  1. Scan a "validation" folder for Excel or CSV files.
  2. Check that they contain the exact column headers that I defined.
  3. Route them to either an "inprogress" or "rejected" folder based on the result.

I’m using dataiku.Folder(...).list_paths_in_partition() and get_download_stream() to read files, but even correctly formatted .xlsx files seem to end up in the rejected folder. My code tries to read the files with pandas.read_excel() and falls back to read_csv() if needed.

Despite this, files are consistently rejected with read errors, even though they open fine in Excel.

Has anyone successfully implemented this kind of folder-based validation workflow? Are there any known issues with pandas.read_excel() in Dataiku, or is there a better pattern?

Any examples, insights, or debugging tips would be greatly appreciated!

Thanks in advance!

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,467 Neuron

    Paste your read/validation Python code and a sample XLS file that fails validation. You can remove real data and leave just dummy data.

Setup Info
    Tags
      Help me…