What does exit code 132 mean when training a deep learning model?

Dss28
Dss28 Registered Posts: 8 ✭✭✭✭
Hey I am doing the HowTo "Deep Learning Image Classification" with the Dataset "Cats and Dogs".

But when training the model I get the an error with the exit code 132.

I already tried to post this question but it seems my post got stuck in the filter (I posted the logs too in the previous question).



Can somebody help me with this issue?

Answers

  • Nicolas_Servel
    Nicolas_Servel Dataiker Posts: 37 Dataiker
    Hello,

    Could you please try to attach the logs of your training ? Otherwise it will be complex to investigate ?

    Regards,

    Nicolas
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    Hi Nicolas, here is the last part of the logs (couldnt post the whole thing)

    [2018-11-13 17:57:21,791] [7113/MainThread] [INFO] [root] Realign target series = (1598,)
    [2018-11-13 17:57:21,792] [7113/MainThread] [INFO] [root] After realign target: (1598,)
    [2018-11-13 17:57:21,792] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DropRowsWhereNoTarget
    [2018-11-13 17:57:21,793] [7113/MainThread] [INFO] [root] Deleting 0 rows because no target
    [2018-11-13 17:57:21,793] [7113/MainThread] [INFO] [root] MF before = (0, 0) target before = (1598,)
    [2018-11-13 17:57:21,795] [7113/MainThread] [INFO] [root] MultiFrame, dropping rows: []
    [2018-11-13 17:57:21,798] [7113/MainThread] [INFO] [root] After DRWNT input_df=(1598, 2)
    [2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] MF after = (0, 0) target after = (1598,)
    [2018-11-13 17:57:21,799] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
    [2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] ********* Pipieline state (Before feature selection)
    [2018-11-13 17:57:21,799] [7113/MainThread] [INFO] [root] input_df= (1598, 2)
    [2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] current_mf=(0, 0)
    [2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] PPR:
    [2018-11-13 17:57:21,800] [7113/MainThread] [INFO] [root] target = ((1598,))
    [2018-11-13 17:57:21,800] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:EmitCurrentMFAsResult
    [2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] Set MF index len 1598
    [2018-11-13 17:57:21,801] [7113/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
    [2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] ********* Pipieline state (At end)
    [2018-11-13 17:57:21,801] [7113/MainThread] [INFO] [root] input_df= (1598, 2)
    [2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] current_mf=(0, 0)
    [2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] PPR:
    [2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] UNPROCESSED = ((1598, 2))
    [2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] TRAIN = ((0, 0))
    [2018-11-13 17:57:21,802] [7113/MainThread] [INFO] [root] target = ((1598,))
    [2018-11-13 17:57:21,804] [7113/MainThread] [INFO] [root] END - Fitting preprocessors
    [2018-11-13 17:57:21,804] [7113/MainThread] [INFO] [root] START - Preprocessing train set
    [2018-11-13 17:57:21,805] [7113/MainThread] [INFO] [root] END - Preprocessing train set
    [2018-11-13 17:57:21,805] [7113/MainThread] [INFO] [root] START - Preprocessing test set
    [2018-11-13 17:57:21,811] [7113/MainThread] [INFO] [root] END - Preprocessing test set
    [2018-11-13 17:57:21,818] [7113/MainThread] [INFO] [root] START - Fitting model
    /home/dataiku/dss/code-envs/python/Python/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
    from ._conv import register_converters as _register_converters
    Using TensorFlow backend.
    [2018/11/13-17:57:22.081] [KNL-python-single-command-kernel-monitor-18717] [INFO] [dku.kernels] - Process done with code 132
    [2018/11/13-17:57:22.082] [KNL-python-single-command-kernel-monitor-18717] [INFO] [dip.tickets] - Destroying API ticket for analysis-ml-CATS_DOGS-SrUtxam on behalf of Dataiku28
    [2018/11/13-17:57:22.083] [MRT-18713] [INFO] [dku.kernels] - Getting kernel tail
    [2018/11/13-17:57:22.084] [MRT-18713] [INFO] [dku.kernels] - Trying to enrich exception: com.dataiku.dip.io.SocketBlockLinkIOException: Failed to get result from kernel from kernel com.dataiku.dip.analysis.coreservices.AnalysisMLKernel@4842e569 process=null pid=?? retcode=132
    [2018/11/13-17:57:22.184] [MRT-18713] [INFO] [dku.kernels] - Getting kernel tail
    [2018/11/13-17:57:22.186] [MRT-18713] [WARN] [dku.analysis.ml.python] - Training failed
    com.dataiku.dip.exceptions.ProcessDiedException: Process died (exit code: 132)
    at com.dataiku.dip.kernels.DSSKernelBase.maybeRethrowAsProcessDied(DSSKernelBase.java:219)
    at com.dataiku.dip.analysis.ml.prediction.PredictionTrainAdditionalThread.process(PredictionTrainAdditionalThread.java:78)
    at com.dataiku.dip.analysis.ml.shared.PRNSTrainThread.run(PRNSTrainThread.java:130)
    [2018/11/13-17:57:22.193] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Processing thread joined ...
    [2018/11/13-17:57:22.193] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Joining processing thread ...
    [2018/11/13-17:57:22.194] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.ml.python] T-I1AGUr2h - Processing thread joined ...
    [2018/11/13-17:57:22.195] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Train done
    [2018/11/13-17:57:22.195] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Train done
    [2018/11/13-17:57:22.202] [FT-TrainWorkThread-ZOFmW2lG-18712] [INFO] [dku.analysis.prediction] T-I1AGUr2h - Publishing mltask-train-done reflected event
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    Hey Nicolas I attached the Logs.
    Are they not enough or is there simply no solution to the problem?
  • Nicolas_Servel
    Nicolas_Servel Dataiker Posts: 37 Dataiker
    Hello,

    The logs do not bring much more information. An error 132 corresponds to a SIGILL signal, which is an illegal instruction sent to the hardware. So it probably comes from a bug inside Keras or Tensorflow.

    To try to reproduce the error, could you please attach, if it's possible:
    - The architecture that you used (content of the "Architecture" tab)
    - The definition of the code-env that you used to run the model
    - a sample of the data (or at least what it looks like) on which the model is trained

    Thanks in advance,

    Nicolas
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    I am going to split this.
    The Code in the feauture handling tab:

    from keras.preprocessing.image import img_to_array, load_img

    # Custom image preprocessing function.
    # Must return a numpy ndarray representing the image.
    # - image_file is a file like object
    def preprocess_image(image_file):
    img = load_img(image_file,target_size=(299, 299, 3))
    array = img_to_array(img)

    # Normalize image between 0 and 1.
    array /= 255

    return array
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    The Code in Architecture

    from keras.layers import Input, Dense, Flatten, GlobalAveragePooling2D
    from keras.models import Model
    from keras.applications import Xception
    import os
    import dataiku

    def build_model(input_shapes, n_classes=None):

    #### DEFINING INPUT AND BASE ARCHITECTURE
    # You need to modify the name and shape of the "image_input"
    # according to the preprocessing and name of your
    # initial feature.
    # This feature should to be preprocessed as an "Image", with a
    # custom preprocessing.
    image_shape = (299, 299, 3)
    image_input_name = "path_preprocessed"
    image_input = Input(shape=image_shape, name=image_input_name)

    base_model = Xception(include_top=False, weights=None, input_tensor=image_input)

    #### LOADING WEIGHTS OF PRE TRAINED MODEL
    # To leverage this architecture, it is better to use weights
    # computed on a previous training on a large dataset (Imagenet).
    # To do so, you need to download the file containing the weights
    # and load them into your model.
    # You can do it by using the macro "Download pre-trained model"
    # of the "Deep Learning image" plugin (CPU or GPU version depending
    # on your setup) available in the plugin store. For this architecture,
    # you need to select:
    # "Xception trained on Imagenet"
    # This will download the weights and put them into a managed folder
    folder = dataiku.Folder("xception_weights")
    weights_path = "xception_imagenet_weights_notop.h5"

    base_model.load_weights(os.path.join(folder.get_path(), weights_path))

    for layer in base_model.layers:
    layer.trainable = False

    #### ADDING FULLY CONNECTED CLASSIFICATION LAYER
    x = base_model.layers[-1].output
    x = Flatten()(x)
    predictions = Dense(n_classes, activation="softmax")(x)

    model = Model(input=base_model.input, output=predictions)
    return model

    def compile_model(model):
    model.compile(
    optimizer="adam",
    loss="categorical_crossentropy"
    )
    return model
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    The Environment is a Python environment. I am not sure what definition means in this case.
    The input data are images "Cats_Dogs" in the Transfer learning section of this HowTo:

    https://www.dataiku.com/learn/guide/visual/machine-learning/deep-learning-images.html

    Thanks for your help. I really dont know what I am doing wrong
  • Nicolas_Servel
    Nicolas_Servel Dataiker Posts: 37 Dataiker
    The code-env is the Python that you had to set-up to be able to run Keras/Tensorflow code. The steps to create it are mentioned in the the "Prerequisites" of the tutorial.

    To access it afterwards, you can go to Administration > Code Envs, select it and go to installed packages. You can then send us the list of installed packages.

    On which type of server your DSS instance is installed ?

    It seems that the issue does not come from your code and we've never seen a similar error when we run the tutorial on our side.

    What you can try, but there is no certainty that it will change anything:
    - re-install the code-env, i.e. go to the page of the code-env, select "rebuild env" and click on update
    - decrease the batch size, in case this would be a hidden out of memory error. To do so, go to your model page, to the "Training" tab, and select for example 10 as a batch size.

    In any case, the bug seems to come from Keras/Tensorflow and how it interacts with your server, not from DSS.

    Regards,

    Nicolas
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    Send the installed packages. But stuck in the spam filter. I dont know which server this on. The server is on my univerities sites since its project I am doing. I tried to lower the batch size, but it doesnt work. I even tried to copy the finished modell project of this tutorial step by step. Didnt work. Maybe its a problem with the Training code?

    The Code:

    from dataiku.doctor.deep_learning.sequences import DataAugmentationSequence
    from keras.preprocessing.image import ImageDataGenerator
    from keras import callbacks


    # A function that builds train and validation sequences.
    # You can define your custom data augmentation based on the original train and validation sequences

    # build_train_sequence_with_batch_size - function that returns train data sequence depending on
    # batch size
    # build_validation_sequence_with_batch_size - function that returns validation data sequence depending on
    #
    def build_sequences(build_train_sequence_with_batch_size, build_validation_sequence_with_batch_size):

    # The actual batch size of the train sequence will be (batch_size * n_augmentation)
    batch_size = 32
    n_augmentation = 1 # Number of augmentation per batch, lower means better learning but also slower

    train_sequence = build_train_sequence_with_batch_size(batch_size)
    validation_sequence = build_validation_sequence_with_batch_size(batch_size)

    augmentator = ImageDataGenerator(
    zoom_range=0.2,
    shear_range=0.2,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
    )
    augmented_sequence = DataAugmentationSequence(
    train_sequence,
    'path_preprocessed',
    augmentator,
    n_augmentation
    )

    return augmented_sequence, validation_sequence


    # A function that contains a call to fit a model.

    # model - compiled model
    # train_sequence - train data sequence, returned in build_sequence
    # validation_sequence - validation data sequence, returned in build_sequence
    # base_callbacks - a list of Dataiku callbacks, that are not to be removed. User callbacks can be added to this list
    def fit_model(model, train_sequence, validation_sequence, base_callbacks):
    epochs = 5

    # Adding a callback that will reduce the 'learning rate' when the model
    # has difficulty to improve itself on the validation data.
    callback = callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.2,
    patience=5
    )

    base_callbacks.append(callback)

    model.fit_generator(train_sequence,
    validation_data=validation_sequence,
    epochs=epochs,
    callbacks=base_callbacks,
    shuffle=True)
  • Dss28
    Dss28 Registered Posts: 8 ✭✭✭✭
    List of packages:

    absl-py==0.6.1
    astor==0.7.1
    backports-abc==0.5
    backports.shutil-get-terminal-size==1.0.0
    backports.ssl-match-hostname==3.5.0.1
    backports.weakref==1.0.post1
    bleach==1.5.0
    certifi==2018.10.15
    chardet==3.0.4
    Click==7.0
    decorator==4.2.1
    enum34==1.1.6
    Flask==0.12.4
    funcsigs==1.0.2
    futures==3.2.0
    gast==0.2.0
    grpcio==1.16.0
    h5py==2.7.1
    html5lib==0.9999999
    idna==2.6
    ipykernel==4.8.2
    ipython==5.8.0
    ipython-genutils==0.2.0
    itsdangerous==1.1.0
    Jinja2==2.10
    jupyter-client==5.2.2
    jupyter-core==4.4.0
    Keras==2.1.5
    Markdown==3.0.1
    MarkupSafe==1.1.0
    mock==2.0.0
    numpy==1.15.4
    pandas==0.20.3
    pathlib2==2.3.2
    patsy==0.5.1
    pbr==5.1.1
    pexpect==4.4.0
    pickleshare==0.7.4
    Pillow==5.1.0
    prompt-toolkit==1.0.15
    protobuf==3.6.1
    ptyprocess==0.5.2
    Pygments==2.2.0
    python-dateutil==2.6.1
    pytz==2018.3
    PyYAML==3.13
    pyzmq==16.0.4
    requests==2.18.4
    scandir==1.9.0
    scikit-learn==0.19.2
    scipy==1.1.0
    simplegeneric==0.8.1
    singledispatch==3.4.0.3
    six==1.11.0
    statsmodels==0.8.0
    tensorboard==1.8.0
    tensorflow==1.8.0
    termcolor==1.1.0
    tornado==4.5.3
    traitlets==4.3.2
    urllib3==1.22
    wcwidth==0.1.7
    Werkzeug==0.14.1
    xgboost==0.71
  • Nicolas_Servel
    Nicolas_Servel Dataiker Posts: 37 Dataiker
    Hello again,

    The issue does not come from the code of the tutorial.

    After a quick search on the internet, it seems that tensorflow >= 1.6 does not work for servers with CPUs that do not support AVX instruction sets, which can end up with exit 132 errors (more info here https://github.com/tensorflow/tensorflow/issues/19584). It is maybe the case for your server.

    Can you try to downgrade the version of your tensorflow to 1.4.0 and see if this works.

    To achieve that, go to Administration > Code Envs > Your code-env > Packages to Install > then to replace "tensorflow==1.8.0" with "tensorflow==1.4.0" and click on "Save and update"

    Regards,

    Nicolas
Setup Info
    Tags
      Help me…