Path issues while creating a Keras ImageDataGenerator.flow_from_directory for image classification.
Hello everyone. I am trying to train a simple Cat vs Dog image classification model using Keras in Dataiku DSS. However, I am having certain difficulties in constructing the path for flow_from_directory(). Before we get started, here's a structure of the training data I am using. "training_dataset" is present in the dataiku-managed-server.
def get_dataset_objects(source_id, target_id):
# Read recipe inputs
training_dataset = dataiku.Folder(source_id)
print("Training Dataset Info.")
print(training_dataset.get_info())
# Write recipe outputs
models_folder = dataiku.Folder(target_id)
print("Models Folder Info.")
print(models_folder.get_info())
return training_dataset, models_folder
source_id = "FvaDN1ly" #This is the folder id corresponding to 'training_dataset' folder
target_id = "EMHPYeer" #This is where my models should be saved
training_dataset, models_folder = get_dataset_objects(source_id, target_id)
Below is the code snippet with necessary comments.
Training Dataset Info.
{'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'training_dataset', 'id': 'FvaDN1ly', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/FvaDN1ly'}, 'type': 'S3'}
Models Folder Info.
{'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'models', 'id': 'EMHPYeer', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/EMHPYeer'}, 'type': 'S3'}
Output :
training_dataset
ââââtest
â ââââCat
â ââââDog
ââââtrain
â ââââCat
â ââââDog
ââââvalid
ââââCat
ââââDog
Code for model creation.
def train_model(train_data_path, valid_data_path, models_path, save_labels_path):
train_datagen = ImageDataGenerator(rescale=1./255)
valid_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
directory='/train',
target_size=(64, 64),
batch_size=32,
class_mode='binary'
)
valid_generator = valid_datagen.flow_from_directory(
directory='/valid',
target_size=(64, 64),
batch_size=32,
class_mode='binary'
)
# Save class labels
class_labels = list(train_generator.class_indices.keys())
np.save(save_labels_path, class_labels)
model = create_model()
checkpoint_callback = ModelCheckpoint(
filepath=os.path.join(models_path, 'model_{epoch:02d}.h5'),
save_freq='epoch',
save_best_only=False,
save_weights_only=False,
verbose=1
)
model.fit(
train_generator,
epochs=5,
validation_data=valid_generator,
callbacks=[checkpoint_callback]
)
# Save the final model
model.save(os.path.join(models_path, 'final_model.h5'))
if __name__ == "__main__":
root_path = "/"
train_data_path = os.path.join(root_path, "training_dataset")
models_path = os.path.join(root_path, "models")
save_labels_path = os.path.join(models_path, "class_labels.npy")
train_model(train_data_path, train_data_path, models_path, save_labels_path)
The problem is I am not able to figure out the right way to construct the path in the 'directory' attribute inside flow_from_directory(). How can I modify the last code block so as to construct the path properly and train a model using custom script? Also, what are the possible scenarios in which I can get the flow_from_directory() function working, even if the data resides in the managed s3 storage? I have tried using the get_path() function, but it won't work since the datais hosted in S3 and not in local. So not able to understand, how to set the correct directory so that flow_from_directory() works.
For reference, I have attached the PDF version of the notebook I am using in Dataiku.
Operating system used: Windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,384 DataikerHi,
get_path is not available from S3 managed folder as is non-local.
In this case you should use tempdir/tempfile use get_download_stream, and then feed the tempfile to flow_from_directory.
You can find a detailed example of this usage pattern here:
https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples
More information on managed folder usage here:
https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local
Thanks