Path issues while creating a Keras ImageDataGenerator.flow_from_directory for image classification.

saugatapaul2020
Level 1
Path issues while creating a Keras ImageDataGenerator.flow_from_directory for image classification.

Hello everyone. I am trying to train a simple Cat vs Dog image classification model using Keras in Dataiku DSS. However, I am having certain difficulties in constructing the path for flow_from_directory(). Before we get started, here's a structure of the training data I am using. "training_dataset" is present in the dataiku-managed-server.

 

 

 

 

training_dataset
    โ”œโ”€โ”€โ”€test
    โ”‚   โ”œโ”€โ”€โ”€Cat
    โ”‚   โ””โ”€โ”€โ”€Dog
    โ”œโ”€โ”€โ”€train
    โ”‚   โ”œโ”€โ”€โ”€Cat
    โ”‚   โ””โ”€โ”€โ”€Dog
    โ””โ”€โ”€โ”€valid
        โ”œโ”€โ”€โ”€Cat
        โ””โ”€โ”€โ”€Dog

 

 

 

 

Below is the code snippet with necessary comments.

 

 

 

def get_dataset_objects(source_id, target_id):
    # Read recipe inputs
    training_dataset = dataiku.Folder(source_id)
    print("Training Dataset Info.")
    print(training_dataset.get_info())
    
    # Write recipe outputs
    models_folder = dataiku.Folder(target_id)
    print("Models Folder Info.")
    print(models_folder.get_info())
    return training_dataset, models_folder

source_id = "FvaDN1ly" #This is the folder id corresponding to 'training_dataset' folder
target_id = "EMHPYeer" #This is where my models should be saved

training_dataset, models_folder = get_dataset_objects(source_id, target_id)

 

 

 

 

Output : 

Training Dataset Info.
{'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'training_dataset', 'id': 'FvaDN1ly', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/FvaDN1ly'}, 'type': 'S3'}
Models Folder Info.
{'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'models', 'id': 'EMHPYeer', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/EMHPYeer'}, 'type': 'S3'}

  

Code for model creation.

 

 

 

def train_model(train_data_path, valid_data_path, models_path, save_labels_path):
    train_datagen = ImageDataGenerator(rescale=1./255)
    valid_datagen = ImageDataGenerator(rescale=1./255)

    train_generator = train_datagen.flow_from_directory(
        directory='/train',
        target_size=(64, 64),
        batch_size=32,
        class_mode='binary' 
    )

    valid_generator = valid_datagen.flow_from_directory(
        directory='/valid',
        target_size=(64, 64),
        batch_size=32,
        class_mode='binary'
    )

    # Save class labels
    class_labels = list(train_generator.class_indices.keys())
    np.save(save_labels_path, class_labels)

    model = create_model()

    checkpoint_callback = ModelCheckpoint(
        filepath=os.path.join(models_path, 'model_{epoch:02d}.h5'),
        save_freq='epoch',
        save_best_only=False,
        save_weights_only=False,
        verbose=1
    )

    model.fit(
        train_generator,
        epochs=5,
        validation_data=valid_generator,
        callbacks=[checkpoint_callback]
    )

    # Save the final model
    model.save(os.path.join(models_path, 'final_model.h5'))

if __name__ == "__main__":
    root_path = "/"
    train_data_path = os.path.join(root_path, "training_dataset")
    models_path = os.path.join(root_path, "models")
    save_labels_path = os.path.join(models_path, "class_labels.npy")

    train_model(train_data_path, train_data_path, models_path, save_labels_path)

 

 

 

 

 

The problem is I am not able to figure out the right way to construct the path in the 'directory' attribute inside flow_from_directory(). How can I modify the last code block so as to construct the path properly and train a model using custom script? Also, what are the possible scenarios in which I can get the flow_from_directory() function working, even if the data resides in the managed s3 storage? I have tried using the get_path() function, but it won't work since the datais hosted in S3 and not in local. So not able to understand, how to set the correct directory so that flow_from_directory() works.

 

For reference, I have attached the PDF version of the notebook I am using in Dataiku.

 


Operating system used: Windows

0 Kudos
1 Reply
AlexT
Dataiker

Hi,
get_path is not available from S3 managed folder as is non-local.

In this case you should use tempdir/tempfile use get_download_stream, and then feed the tempfile to flow_from_directory.

You can find a detailed example of this usage pattern here:

https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples

More information on managed folder usage here:

https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local

Thanks

0 Kudos