Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello everyone. I am trying to train a simple Cat vs Dog image classification model using Keras in Dataiku DSS. However, I am having certain difficulties in constructing the path for flow_from_directory(). Before we get started, here's a structure of the training data I am using. "training_dataset" is present in the dataiku-managed-server.
training_dataset
โโโโtest
โ โโโโCat
โ โโโโDog
โโโโtrain
โ โโโโCat
โ โโโโDog
โโโโvalid
โโโโCat
โโโโDog
Below is the code snippet with necessary comments.
def get_dataset_objects(source_id, target_id):
# Read recipe inputs
training_dataset = dataiku.Folder(source_id)
print("Training Dataset Info.")
print(training_dataset.get_info())
# Write recipe outputs
models_folder = dataiku.Folder(target_id)
print("Models Folder Info.")
print(models_folder.get_info())
return training_dataset, models_folder
source_id = "FvaDN1ly" #This is the folder id corresponding to 'training_dataset' folder
target_id = "EMHPYeer" #This is where my models should be saved
training_dataset, models_folder = get_dataset_objects(source_id, target_id)
Output :
Training Dataset Info. {'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'training_dataset', 'id': 'FvaDN1ly', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/FvaDN1ly'}, 'type': 'S3'} Models Folder Info. {'projectKey': 'IMAGECLASSIFICATION', 'directoryBasedPartitioning': False, 'name': 'models', 'id': 'EMHPYeer', 'accessInfo': {'bucket': 'gis-data-ap-southeast-1', 'root': '/space-5dfbbb07-dku/node-2f7d11aa/managed-dss-data/IMAGECLASSIFICATION/EMHPYeer'}, 'type': 'S3'}
Code for model creation.
def train_model(train_data_path, valid_data_path, models_path, save_labels_path):
train_datagen = ImageDataGenerator(rescale=1./255)
valid_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
directory='/train',
target_size=(64, 64),
batch_size=32,
class_mode='binary'
)
valid_generator = valid_datagen.flow_from_directory(
directory='/valid',
target_size=(64, 64),
batch_size=32,
class_mode='binary'
)
# Save class labels
class_labels = list(train_generator.class_indices.keys())
np.save(save_labels_path, class_labels)
model = create_model()
checkpoint_callback = ModelCheckpoint(
filepath=os.path.join(models_path, 'model_{epoch:02d}.h5'),
save_freq='epoch',
save_best_only=False,
save_weights_only=False,
verbose=1
)
model.fit(
train_generator,
epochs=5,
validation_data=valid_generator,
callbacks=[checkpoint_callback]
)
# Save the final model
model.save(os.path.join(models_path, 'final_model.h5'))
if __name__ == "__main__":
root_path = "/"
train_data_path = os.path.join(root_path, "training_dataset")
models_path = os.path.join(root_path, "models")
save_labels_path = os.path.join(models_path, "class_labels.npy")
train_model(train_data_path, train_data_path, models_path, save_labels_path)
The problem is I am not able to figure out the right way to construct the path in the 'directory' attribute inside flow_from_directory(). How can I modify the last code block so as to construct the path properly and train a model using custom script? Also, what are the possible scenarios in which I can get the flow_from_directory() function working, even if the data resides in the managed s3 storage? I have tried using the get_path() function, but it won't work since the datais hosted in S3 and not in local. So not able to understand, how to set the correct directory so that flow_from_directory() works.
For reference, I have attached the PDF version of the notebook I am using in Dataiku.
Operating system used: Windows
Hi,
get_path is not available from S3 managed folder as is non-local.
In this case you should use tempdir/tempfile use get_download_stream, and then feed the tempfile to flow_from_directory.
You can find a detailed example of this usage pattern here:
https://developer.dataiku.com/latest/concepts-and-examples/managed-folders.html#detailed-examples
More information on managed folder usage here:
https://doc.dataiku.com/dss/latest/connecting/managed_folders.html#local-vs-non-local
Thanks