Path issues while creating a Keras ImageDataGenerator.flow_from_directory for image classification.
Hello everyone. I am trying to train a simple Cat vs Dog image classification model using Keras in Dataiku DSS. However, I am having certain difficulties in constructing the path for flow_from_directory(). Before we get started, here's a structure of the training data I am using. "training_dataset" is present in the dataiku-managed-server.
def get_dataset_objects(source_id, target_id): # Read recipe inputs training_dataset = dataiku.Folder(source_id) print("Training Dataset Info.") print(training_dataset.get_info()) # Write recipe outputs models_folder = dataiku.Folder(target_id) print("Models Folder Info.") print(models_folder.get_info()) return training_dataset, models_folder source_id = "FvaDN1ly" #This is the folder id corresponding to 'training_dataset' folder target_id = "EMHPYeer" #This is where my models should be saved training_dataset, models_folder = get_dataset_objects(source_id, target_id)
Below is the code snippet with necessary comments.
Training Dataset Info. {'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'training_dataset', 'id': 'FvaDN1ly', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/FvaDN1ly'}, 'type': 'S3'} Models Folder Info. {'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'models', 'id': 'EMHPYeer', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/EMHPYeer'}, 'type': 'S3'}
Output :
training_dataset ââââtest â ââââCat â ââââDog ââââtrain â ââââCat â ââââDog ââââvalid ââââCat ââââDog
Code for model creation.
def train_model(train_data_path, valid_data_path, models_path, save_labels_path): train_datagen = ImageDataGenerator(rescale=1./255) valid_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow_from_directory( directory='/train', target_size=(64, 64), batch_size=32, class_mode='binary' ) valid_generator = valid_datagen.flow_from_directory( directory='/valid', target_size=(64, 64), batch_size=32, class_mode='binary' ) # Save class labels class_labels = list(train_generator.class_indices.keys()) np.save(save_labels_path, class_labels) model = create_model() checkpoint_callback = ModelCheckpoint( filepath=os.path.join(models_path, 'model_{epoch:02d}.h5'), save_freq='epoch', save_best_only=False, save_weights_only=False, verbose=1 ) model.fit( train_generator, epochs=5, validation_data=valid_generator, callbacks=[checkpoint_callback] ) # Save the final model model.save(os.path.join(models_path, 'final_model.h5')) if __name__ == "__main__": root_path = "/" train_data_path = os.path.join(root_path, "training_dataset") models_path = os.path.join(root_path, "models") save_labels_path = os.path.join(models_path, "class_labels.npy") train_model(train_data_path, train_data_path, models_path, save_labels_path)
The problem is I am not able to figure out the right way to construct the path in the 'directory' attribute inside flow_from_directory(). How can I modify the last code block so as to construct the path properly and train a model using custom script? Also, what are the possible scenarios in which I can get the flow_from_directory() function working, even if the data resides in the managed s3 storage? I have tried using the get_path() function, but it won't work since the datais hosted in S3 and not in local. So not able to understand, how to set the correct directory so that flow_from_directory() works.
For reference, I have attached the PDF version of the notebook I am using in Dataiku.
Operating system used: Windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi,
get_path is not available from S3 managed folder as is non-local.
In this case you should use tempdir/tempfile use get_download_stream, and then feed the tempfile to flow_from_directory.
You can find a detailed example of this usage pattern here:
https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples
More information on managed folder usage here:
https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local
Thanks