Process multiple features through code in Visual Analysis

pratikgujral-sf
Level 2
Process multiple features through code in Visual Analysis

Hi Community,

I need to develop a Deep Learning model on sequential data. My dataset has two features Column-1 and Column-2. Both these columns have sequential data. Data in these columns exist in the form of a list., where values of the list indicate chronological sequence. Please check the reference image below for clarity. 1.PNG

 

For example, for the first record, in Column-1, [a,b,c,d,e,f] refer to the values at six-time points.

  • Data in Column-1 has text categories and has to be one-hot encoded.
  • Data in Column-2 are integers, and no encoding is to be done

 

I need to prepare the data for an LSTM model, for which I already have a Python function that receives the Dat6aFrame as input, transforms both Column-1 and Column-2, and returns a 3-dimensional Numpy array of shape (num_samples, time_sequence, num_features).

For the example dataset above, one-hot encoding of Column-1 creates 6 columns, and so the final Numpy array would have shape (3,6,7).

2.PNG

I have the following questions:

  • Where do I place the function? In Visual Analysis -> Models -> Design -> Feature Handling, I can only see the ability to transform one feature at a time. However, my prepare_data Python function takes in the entire DataFrame, transforms both the features and returns a 3d Numpy array. Where should I write my Python function that handles both Columns in one go?
  • How do I let my function return a 3d Numpy and let Dataiku use it for training? The documentation (available here) mentions that in the custom preprocessor, the transform function must return a 2d Numpy array. However, as explained, because of a one-hot step in my Python code, for Column-1, the transformation step returns a bunch of features, thereby making my Numpy array 3-dimensional. 

Operating system used: Red Hat Enterprise Linux

0 Kudos
1 Reply
JordanB
Dataiker

Hi @pratikgujral-sf,

That's correct, the transform method must return either a pandas DataFrame or a 2-D numpy array or scipy.sparse.csr_matrix containing the preprocessed result. A single processor may output several numerical features, corresponding to several columns of the output, however, having the possibility to return 3-D array is currently a feature request. Please see the following document for details: https://doc.dataiku.com/dss/latest/machine-learning/features-handling/custom.html#implementing-a-cus...

Please let us know if you have any further questions.

Kind regards,

Jordan

0 Kudos