Hello, I am brand new to Dataiku. I will be working with imaging data in the Nifti format. I would like to start by making a very simple Dataiku workflow. So far I have my image, let's called it "a.nii.gz" uploaded to the workflow. The first thing I would like to do is read in this file into python and say either print out the content of it, or perhaps make an image (through matplotlib.pyplot). When I link a python script to the data set "a.nii.gz" it automatically creates a python template for me. When I run the template I get an error, which isn't surprising at all given how I'm not sure what I'm doing. Within the template I get this line:
# Read recipe inputs
a = dataiku.Dataset("a")
a_df = a.get_dataframe()
However the data is not a dataframe, it's a nifti .nii.gz file. This can be read in via a line like
img = nib.load('a.nii.gz')
I tried commenting out the lines pertaining to the dataframe, however I then get errors that these lines are missing. It seems, understandably, that Dataiku is built around processing dataframe data. However my data is not immediately in dataframe format, and I would prefer to avoid it if at all possible. Is there a way to load in data that's essentially of an unknown format which will be later converted via a python script? Or am I going about this the wrong way?
Hi @kman88 ,
Thank you for your detailed steps and description of your setup! And always useful to start with a small test setup initially.
For the situation you describe, where you are reading in "non-dataframe" data, I would suggest creating a managed folder instead, to house or point to your image files.
Here's an example. From the flow, you can click on + Dataset > Folder to create a new managed folder:
Just for testing, you can simply upload a file like "a.ii.gz":
Then, you can create a Python recipe from your folder. I would suggest for testing to use the "Edit in Notebook" option so that you can run your code in regular notebook format. The flow will require an "output", so if you just want to test out plotting the image or play around first, you can do this easily in the notebook view.
Here's the Python recipe that I created, and am testing in a notebook:
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs folder = dataiku.Folder("nifti") nifti_info = nifti.get_info() # go through all the files in the folder for nifti_file in folder.list_paths_in_partition(): my_nifti_file = nib.load(nifti_info['path'] + nifti_file)
With the folder input, the template Python code that you'll see provided for you instead points to the Folder and not a dataset. You can then iterate through all files in the folder, as shown here. And then you can easily read in files of any data type, so you aren't expected to read in a dataframe.
In addition, once you start to scale, you can easily change your folder to point to a folder on your filesystem, or to an external file store (i.e. S3), which should make this easy to manage. You can toggle these settings under the Folder "Settings" tab.
Let me know if you have any questions about this process.