What does exit code 132 mean when training a deep learning model?
Dss28
Registered Posts: 8 ✭✭✭✭
Hey I am doing the HowTo "Deep Learning Image Classification" with the Dataset "Cats and Dogs".
But when training the model I get the an error with the exit code 132.
I already tried to post this question but it seems my post got stuck in the filter (I posted the logs too in the previous question).
Can somebody help me with this issue?
But when training the model I get the an error with the exit code 132.
I already tried to post this question but it seems my post got stuck in the filter (I posted the logs too in the previous question).
Can somebody help me with this issue?
Tagged:
Answers
-
Hello,
Could you please try to attach the logs of your training ? Otherwise it will be complex to investigate ?
Regards,
Nicolas -
Hi Nicolas, here is the last part of the logs (couldnt post the whole thing)
[2018-11-13 17:57:21,791] [7113/MainThread] [INFO] [root] Realign target series = (1598,)
[2018-11-13 17:57:21,792] [7113/MainThread] [INFO] [root] After realign target: (1598,)
[2018-11-13 17:57:21,792] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DropRowsWhereNoTarget
[2018-11-13 17:57:21,793] [7113/MainThread] [INFO] [root] Deleting 0 rows because no target
[2018-11-13 17:57:21,793] [7113/MainThread] [INFO] [root] MF before = (0, 0) target before = (1598,)
[2018-11-13 17:57:21,795] [7113/MainThread] [INFO] [root] MultiFrame, dropping rows: []
[2018-11-13 17:57:21,798] [7113/MainThread] [INFO] [root] After DRWNT input_df=(1598, 2)
[2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] MF after = (0, 0) target after = (1598,)
[2018-11-13 17:57:21,799] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
[2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] ********* Pipieline state (Before feature selection)
[2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] input_df= (1598, 2)
[2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] current_mf=(0, 0)
[2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] PPR:
[2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] target = ((1598,))
[2018-11-13 17:57:21,800] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:EmitCurrentMFAsResult
[2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] Set MF index len 1598
[2018-11-13 17:57:21,801] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
[2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] ********* Pipieline state (At end)
[2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] input_df= (1598, 2)
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] current_mf=(0, 0)
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] PPR:
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] UNPROCESSED = ((1598, 2))
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] TRAIN = ((0, 0))
[2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] target = ((1598,))
[2018-11-13 17:57:21,804] [7113/MainThread] [INFO] [root] END - Fitting preprocessors
[2018-11-13 17:57:21,804] [7113/MainThread] [INFO] [root] START - Preprocessing train set
[2018-11-13 17:57:21,805] [7113/MainThread] [INFO] [root] END - Preprocessing train set
[2018-11-13 17:57:21,805] [7113/MainThread] [INFO] [root] START - Preprocessing test set
[2018-11-13 17:57:21,811] [7113/MainThread] [INFO] [root] END - Preprocessing test set
[2018-11-13 17:57:21,818] [7113/MainThread] [INFO] [root] START - Fitting model
/home/dataiku/dss/code-envs/python/Python/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[2018/11/13-17:57:22.081] [KNL-python-single-command-kernel-monitor-18717] [INFO] [dku.kernels] - Process done with code 132
[2018/11/13-17:57:22.082] [KNL-python-single-command-kernel-monitor-18717] [INFO] [dip.tickets] - Destroying API ticket for analysis-ml-CATS_DOGS-SrUtxam on behalf of Dataiku28
[2018/11/13-17:57:22.083] [MRT-18713] [INFO] [dku.kernels] - Getting kernel tail
[2018/11/13-17:57:22.084] [MRT-18713] [INFO] [dku.kernels] - Trying to enrich exception: com.dataiku.dip.io.SocketBlockLinkIOException: Failed to get result from kernel from kernel com.dataiku.dip.analysis.coreservices.AnalysisMLKernel@4842e569 process=null pid=?? retcode=132
[2018/11/13-17:57:22.184] [MRT-18713] [INFO] [dku.kernels] - Getting kernel tail
[2018/11/13-17:57:22.186] [MRT-18713] [WARN] [dku.analysis.ml.python] - Training failed
com.dataiku.dip.exceptions.ProcessDiedException: Process died (exit code: 132)
at com.dataiku.dip.kernels.DSSKernelBase.maybeRethrowAsProcessDied(DSSKernelBase.java:219)
at com.dataiku.dip.analysis.ml.prediction.PredictionTrainAdditionalThread.process(PredictionTrainAdditionalThread.java:78)
at com.dataiku.dip.analysis.ml.shared.PRNSTrainThread.run(PRNSTrainThread.java:130)
[2018/11/13-17:57:22.193] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Processing thread joined ...
[2018/11/13-17:57:22.193] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Joining processing thread ...
[2018/11/13-17:57:22.194] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Processing thread joined ...
[2018/11/13-17:57:22.195] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Train done
[2018/11/13-17:57:22.195] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Train done
[2018/11/13-17:57:22.202] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Publishing mltask-train-done reflected event -
Hey Nicolas I attached the Logs.
Are they not enough or is there simply no solution to the problem? -
Hello,
The logs do not bring much more information. An error 132 corresponds to a SIGILL signal, which is an illegal instruction sent to the hardware. So it probably comes from a bug inside Keras or Tensorflow.
To try to reproduce the error, could you please attach, if it's possible:
- The architecture that you used (content of the "Architecture" tab)
- The definition of the code-env that you used to run the model
- a sample of the data (or at least what it looks like) on which the model is trained
Thanks in advance,
Nicolas -
I am going to split this.
The Code in the feauture handling tab:
from keras.preprocessing.image import img_to_array, load_img
# Custom image preprocessing function.
# Must return a numpy ndarray representing the image.
# - image_file is a file like object
def preprocess_image(image_file):
img = load_img(image_file,target_size=(299, 299, 3))
array = img_to_array(img)
# Normalize image between 0 and 1.
array /= 255
return array -
The Code in Architecture
from keras.layers import Input, Dense, Flatten, GlobalAveragePooling2D
from keras.models import Model
from keras.applications import Xception
import os
import dataiku
def build_model(input_shapes, n_classes=None):
#### DEFINING INPUT AND BASE ARCHITECTURE
# You need to modify the name and shape of the "image_input"
# according to the preprocessing and name of your
# initial feature.
# This feature should to be preprocessed as an "Image", with a
# custom preprocessing.
image_shape = (299, 299, 3)
image_input_name = "path_preprocessed"
image_input = Input(shape=image_shape, name=image_input_name)
base_model = Xception(include_top=False, weights=None, input_tensor=image_input)
#### LOADING WEIGHTS OF PRE TRAINED MODEL
# To leverage this architecture, it is better to use weights
# computed on a previous training on a large dataset (Imagenet).
# To do so, you need to download the file containing the weights
# and load them into your model.
# You can do it by using the macro "Download pre-trained model"
# of the "Deep Learning image" plugin (CPU or GPU version depending
# on your setup) available in the plugin store. For this architecture,
# you need to select:
# "Xception trained on Imagenet"
# This will download the weights and put them into a managed folder
folder = dataiku.Folder("xception_weights")
weights_path = "xception_imagenet_weights_notop.h5"
base_model.load_weights(os.path.join(folder.get_path(), weights_path))
for layer in base_model.layers:
layer.trainable = False
#### ADDING FULLY CONNECTED CLASSIFICATION LAYER
x = base_model.layers[-1].output
x = Flatten()(x)
predictions = Dense(n_classes, activation="softmax")(x)
model = Model(input=base_model.input, output=predictions)
return model
def compile_model(model):
model.compile(
optimizer="adam",
loss="categorical_crossentropy"
)
return model -
The Environment is a Python environment. I am not sure what definition means in this case.
The input data are images "Cats_Dogs" in the Transfer learning section of this HowTo:
https://www.dataiku.com/learn/guide/visual/machine-learning/deep-learning-images.html
Thanks for your help. I really dont know what I am doing wrong -
The code-env is the Python that you had to set-up to be able to run Keras/Tensorflow code. The steps to create it are mentioned in the the "Prerequisites" of the tutorial.
To access it afterwards, you can go to Administration > Code Envs, select it and go to installed packages. You can then send us the list of installed packages.
On which type of server your DSS instance is installed ?
It seems that the issue does not come from your code and we've never seen a similar error when we run the tutorial on our side.
What you can try, but there is no certainty that it will change anything:
- re-install the code-env, i.e. go to the page of the code-env, select "rebuild env" and click on update
- decrease the batch size, in case this would be a hidden out of memory error. To do so, go to your model page, to the "Training" tab, and select for example 10 as a batch size.
In any case, the bug seems to come from Keras/Tensorflow and how it interacts with your server, not from DSS.
Regards,
Nicolas -
Send the installed packages. But stuck in the spam filter. I dont know which server this on. The server is on my univerities sites since its project I am doing. I tried to lower the batch size, but it doesnt work. I even tried to copy the finished modell project of this tutorial step by step. Didnt work. Maybe its a problem with the Training code?
The Code:
from dataiku.doctor.deep_learning.sequences import DataAugmentationSequence
from keras.preprocessing.image import ImageDataGenerator
from keras import callbacks
# A function that builds train and validation sequences.
# You can define your custom data augmentation based on the original train and validation sequences
# build_train_sequence_with_batch_size - function that returns train data sequence depending on
# batch size
# build_validation_sequence_with_batch_size - function that returns validation data sequence depending on
#
def build_sequences(build_train_sequence_with_batch_size, build_validation_sequence_with_batch_size):
# The actual batch size of the train sequence will be (batch_size * n_augmentation)
batch_size = 32
n_augmentation = 1 # Number of augmentation per batch, lower means better learning but also slower
train_sequence = build_train_sequence_with_batch_size(batch_size)
validation_sequence = build_validation_sequence_with_batch_size(batch_size)
augmentator = ImageDataGenerator(
zoom_range=0.2,
shear_range=0.2,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
)
augmented_sequence = DataAugmentationSequence(
train_sequence,
'path_preprocessed',
augmentator,
n_augmentation
)
return augmented_sequence, validation_sequence
# A function that contains a call to fit a model.
# model - compiled model
# train_sequence - train data sequence, returned in build_sequence
# validation_sequence - validation data sequence, returned in build_sequence
# base_callbacks - a list of Dataiku callbacks, that are not to be removed. User callbacks can be added to this list
def fit_model(model, train_sequence, validation_sequence, base_callbacks):
epochs = 5
# Adding a callback that will reduce the 'learning rate' when the model
# has difficulty to improve itself on the validation data.
callback = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.2,
patience=5
)
base_callbacks.append(callback)
model.fit_generator(train_sequence,
validation_data=validation_sequence,
epochs=epochs,
callbacks=base_callbacks,
shuffle=True) -
List of packages:
absl-py==0.6.1
astor==0.7.1
backports-abc==0.5
backports.shutil-get-terminal-size==1.0.0
backports.ssl-match-hostname==3.5.0.1
backports.weakref==1.0.post1
bleach==1.5.0
certifi==2018.10.15
chardet==3.0.4
Click==7.0
decorator==4.2.1
enum34==1.1.6
Flask==0.12.4
funcsigs==1.0.2
futures==3.2.0
gast==0.2.0
grpcio==1.16.0
h5py==2.7.1
html5lib==0.9999999
idna==2.6
ipykernel==4.8.2
ipython==5.8.0
ipython-genutils==0.2.0
itsdangerous==1.1.0
Jinja2==2.10
jupyter-client==5.2.2
jupyter-core==4.4.0
Keras==2.1.5
Markdown==3.0.1
MarkupSafe==1.1.0
mock==2.0.0
numpy==1.15.4
pandas==0.20.3
pathlib2==2.3.2
patsy==0.5.1
pbr==5.1.1
pexpect==4.4.0
pickleshare==0.7.4
Pillow==5.1.0
prompt-toolkit==1.0.15
protobuf==3.6.1
ptyprocess==0.5.2
Pygments==2.2.0
python-dateutil==2.6.1
pytz==2018.3
PyYAML==3.13
pyzmq==16.0.4
requests==2.18.4
scandir==1.9.0
scikit-learn==0.19.2
scipy==1.1.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
statsmodels==0.8.0
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
tornado==4.5.3
traitlets==4.3.2
urllib3==1.22
wcwidth==0.1.7
Werkzeug==0.14.1
xgboost==0.71 -
Hello again,
The issue does not come from the code of the tutorial.
After a quick search on the internet, it seems that tensorflow >= 1.6 does not work for servers with CPUs that do not support AVX instruction sets, which can end up with exit 132 errors (more info here https://github.com/tensorflow/tensorflow/issues/19584). It is maybe the case for your server.
Can you try to downgrade the version of your tensorflow to 1.4.0 and see if this works.
To achieve that, go to Administration > Code Envs > Your code-env > Packages to Install > then to replace "tensorflow==1.8.0" with "tensorflow==1.4.0" and click on "Save and update"
Regards,
Nicolas