Multivariate LSTM where many features are generated from one feature by dummy encoding of sequence

pratikgujral-sf
pratikgujral-sf Registered Posts: 8
edited July 16 in Using Dataiku

Hi,

I am trying to train a multivariate LSTM model. I have two features that are sequences of values.

Example of one row of my dataset:

Feature 1: [1,2,3,4,5,6]

Feature 2: ['a', 'b', 'c', 'd', 'e', 'f']

Feature 2 has to be dummy encoded. Hence, given the example dataset above, my LSTM expects a shape of (num_records, 6, 7).

6 because the time sequence has 6 values.

7 because when Feature-2 is dummy encoded, it creates 6 columns + 1 column from Feature-1.

Hence, (my second_dim, third_dim) matrix looks like this:

lstm.PNG

I have a windowprocessor.py file in library where I defined the class windowProcessor

import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

class windowProcessor:
    def __init__(self, window_size):     
        self.window_size = window_size

    def _convert(self, x, feature_type:str):
        """
        feature_type: str
            Set this value to 'int', if dummy encoding is not to be done
        """
        m = np.empty((len(x), self.window_size))
        x = pd.Series(x).apply(eval)
        
        ## pad sequences to the correct sequence length
        m = pad_sequences(x, maxlen=self.window_size,padding='pre',dtype=np.uint16)
        
        if feature_type == 'int':
            ## don't do anything.
            pass
        else:
            ## perform dummy encoding of the padded sequence. done if input had categorical values
            m = to_categorical(m, dtype=np.uint16) 
        return m

    def fit(self, x):
        m = self._convert(x)

    def transform(self, x):
        m = self._convert(x)
        return m

Furthermore, in the Feature handling section, I have created an object of windowProcessor for both features.

processor = windowProcessor(window_size, 'int')
processor = windowProcessor(window_size, 'categorical')

My dataset is partitioned, and my window size depends on which partition the model is being trained.

Q1) What should be my window size? Should it be 6?

Q2) Where should I declare my window_size variable such that I can reuse it in the Feature Handling of both the features as well as in the architecture definition section without having to define the value at every place independently?

Q3) How do I ensure that my LSTM input will be indeed (num_samples, 6, 7)? Where do I write the code that tells Dataiku that the 6 features from Feature-1 and 1 feature from Feature-2 must be concatenated together, so there are ultimately 7 features?


Operating system used: Red Hat Enterprise Linux

Setup Info
    Tags
      Help me…