Process multiple features through code in Visual Analysis

pratikgujral-sf Registered Posts: 8

Hi Community,

I need to develop a Deep Learning model on sequential data. My dataset has two features Column-1 and Column-2. Both these columns have sequential data. Data in these columns exist in the form of a list., where values of the list indicate chronological sequence. Please check the reference image below for clarity. 1.PNG

For example, for the first record, in Column-1, [a,b,c,d,e,f] refer to the values at six-time points.

  • Data in Column-1 has text categories and has to be one-hot encoded.
  • Data in Column-2 are integers, and no encoding is to be done

I need to prepare the data for an LSTM model, for which I already have a Python function that receives the Dat6aFrame as input, transforms both Column-1 and Column-2, and returns a 3-dimensional Numpy array of shape (num_samples, time_sequence, num_features).

For the example dataset above, one-hot encoding of Column-1 creates 6 columns, and so the final Numpy array would have shape (3,6,7).


I have the following questions:

  • Where do I place the function? In Visual Analysis -> Models -> Design -> Feature Handling, I can only see the ability to transform one feature at a time. However, my prepare_data Python function takes in the entire DataFrame, transforms both the features and returns a 3d Numpy array. Where should I write my Python function that handles both Columns in one go?
  • How do I let my function return a 3d Numpy and let Dataiku use it for training? The documentation (available here) mentions that in the custom preprocessor, the transform function must return a 2d Numpy array. However, as explained, because of a one-hot step in my Python code, for Column-1, the transformation step returns a bunch of features, thereby making my Numpy array 3-dimensional.

Operating system used: Red Hat Enterprise Linux



Setup Info
      Help me…