Process multiple features through code in Visual Analysis
Hi Community,
I need to develop a Deep Learning model on sequential data. My dataset has two features Column1 and Column2. Both these columns have sequential data. Data in these columns exist in the form of a list., where values of the list indicate chronological sequence. Please check the reference image below for clarity.
For example, for the first record, in Column1, [a,b,c,d,e,f] refer to the values at sixtime points.
 Data in Column1 has text categories and has to be onehot encoded.
 Data in Column2 are integers, and no encoding is to be done
I need to prepare the data for an LSTM model, for which I already have a Python function that receives the Dat6aFrame as input, transforms both Column1 and Column2, and returns a 3dimensional Numpy array of shape (num_samples, time_sequence, num_features).
For the example dataset above, onehot encoding of Column1 creates 6 columns, and so the final Numpy array would have shape (3,6,7).
I have the following questions:
 Where do I place the function? In Visual Analysis > Models > Design > Feature Handling, I can only see the ability to transform one feature at a time. However, my prepare_data Python function takes in the entire DataFrame, transforms both the features and returns a 3d Numpy array. Where should I write my Python function that handles both Columns in one go?
 How do I let my function return a 3d Numpy and let Dataiku use it for training? The documentation (available here) mentions that in the custom preprocessor, the transform function must return a 2d Numpy array. However, as explained, because of a onehot step in my Python code, for Column1, the transformation step returns a bunch of features, thereby making my Numpy array 3dimensional.
Operating system used: Red Hat Enterprise Linux
Answers

JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 295 DataikerOptions
Hi @pratikgujralsf
,That's correct, the transform method must return either a pandas DataFrame or a 2D numpy array or scipy.sparse.csr_matrix containing the preprocessed result. A single processor may output several numerical features, corresponding to several columns of the output, however, having the possibility to return 3D array is currently a feature request. Please see the following document for details: https://doc.dataiku.com/dss/latest/machinelearning/featureshandling/custom.html#implementingacustomprocessor
Please let us know if you have any further questions.
Kind regards,
Jordan