The logs do not bring much more information. An error 132 corresponds to a SIGILL signal, which is an illegal instruction sent to the hardware. So it probably comes from a bug inside Keras or Tensorflow.
To try to reproduce the error, could you please attach, if it's possible: - The architecture that you used (content of the "Architecture" tab) - The definition of the code-env that you used to run the model - a sample of the data (or at least what it looks like) on which the model is trained
from keras.layers import Input, Dense, Flatten, GlobalAveragePooling2D from keras.models import Model from keras.applications import Xception import os import dataiku
def build_model(input_shapes, n_classes=None):
#### DEFINING INPUT AND BASE ARCHITECTURE # You need to modify the name and shape of the "image_input" # according to the preprocessing and name of your # initial feature. # This feature should to be preprocessed as an "Image", with a # custom preprocessing. image_shape = (299, 299, 3) image_input_name = "path_preprocessed" image_input = Input(shape=image_shape, name=image_input_name)
#### LOADING WEIGHTS OF PRE TRAINED MODEL # To leverage this architecture, it is better to use weights # computed on a previous training on a large dataset (Imagenet). # To do so, you need to download the file containing the weights # and load them into your model. # You can do it by using the macro "Download pre-trained model" # of the "Deep Learning image" plugin (CPU or GPU version depending # on your setup) available in the plugin store. For this architecture, # you need to select: # "Xception trained on Imagenet" # This will download the weights and put them into a managed folder folder = dataiku.Folder("xception_weights") weights_path = "xception_imagenet_weights_notop.h5"
The Environment is a Python environment. I am not sure what definition means in this case. The input data are images "Cats_Dogs" in the Transfer learning section of this HowTo:
The code-env is the Python that you had to set-up to be able to run Keras/Tensorflow code. The steps to create it are mentioned in the the "Prerequisites" of the tutorial.
To access it afterwards, you can go to Administration > Code Envs, select it and go to installed packages. You can then send us the list of installed packages.
On which type of server your DSS instance is installed ?
It seems that the issue does not come from your code and we've never seen a similar error when we run the tutorial on our side.
What you can try, but there is no certainty that it will change anything: - re-install the code-env, i.e. go to the page of the code-env, select "rebuild env" and click on update - decrease the batch size, in case this would be a hidden out of memory error. To do so, go to your model page, to the "Training" tab, and select for example 10 as a batch size.
In any case, the bug seems to come from Keras/Tensorflow and how it interacts with your server, not from DSS.
Send the installed packages. But stuck in the spam filter. I dont know which server this on. The server is on my univerities sites since its project I am doing. I tried to lower the batch size, but it doesnt work. I even tried to copy the finished modell project of this tutorial step by step. Didnt work. Maybe its a problem with the Training code?
The Code:
from dataiku.doctor.deep_learning.sequences import DataAugmentationSequence from keras.preprocessing.image import ImageDataGenerator from keras import callbacks
# A function that builds train and validation sequences. # You can define your custom data augmentation based on the original train and validation sequences
# build_train_sequence_with_batch_size - function that returns train data sequence depending on # batch size # build_validation_sequence_with_batch_size - function that returns validation data sequence depending on # def build_sequences(build_train_sequence_with_batch_size, build_validation_sequence_with_batch_size):
# The actual batch size of the train sequence will be (batch_size * n_augmentation) batch_size = 32 n_augmentation = 1 # Number of augmentation per batch, lower means better learning but also slower
# model - compiled model # train_sequence - train data sequence, returned in build_sequence # validation_sequence - validation data sequence, returned in build_sequence # base_callbacks - a list of Dataiku callbacks, that are not to be removed. User callbacks can be added to this list def fit_model(model, train_sequence, validation_sequence, base_callbacks): epochs = 5
# Adding a callback that will reduce the 'learning rate' when the model # has difficulty to improve itself on the validation data. callback = callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.2, patience=5 )
The issue does not come from the code of the tutorial.
After a quick search on the internet, it seems that tensorflow >= 1.6 does not work for servers with CPUs that do not support AVX instruction sets, which can end up with exit 132 errors (more info here https://github.com/tensorflow/tensorflow/issues/19584). It is maybe the case for your server.
Can you try to downgrade the version of your tensorflow to 1.4.0 and see if this works.
To achieve that, go to Administration > Code Envs > Your code-env > Packages to Install > then to replace "tensorflow==1.8.0" with "tensorflow==1.4.0" and click on "Save and update"