Process multiple features through code in Visual Analysis
Hi Community,
I need to develop a Deep Learning model on sequential data. My dataset has two features Column-1 and Column-2. Both these columns have sequential data. Data in these columns exist in the form of a list., where values of the list indicate chronological sequence. Please check the reference image below for clarity.
For example, for the first record, in Column-1, [a,b,c,d,e,f] refer to the values at six-time points.
- Data in Column-1 has text categories and has to be one-hot encoded.
- Data in Column-2 are integers, and no encoding is to be done
I need to prepare the data for an LSTM model, for which I already have a Python function that receives the Dat6aFrame as input, transforms both Column-1 and Column-2, and returns a 3-dimensional Numpy array of shape (num_samples, time_sequence, num_features).
For the example dataset above, one-hot encoding of Column-1 creates 6 columns, and so the final Numpy array would have shape (3,6,7).
I have the following questions:
- Where do I place the function? In Visual Analysis -> Models -> Design -> Feature Handling, I can only see the ability to transform one feature at a time. However, my prepare_data Python function takes in the entire DataFrame, transforms both the features and returns a 3d Numpy array. Where should I write my Python function that handles both Columns in one go?
- How do I let my function return a 3d Numpy and let Dataiku use it for training? The documentation (available here) mentions that in the custom preprocessor, the transform function must return a 2d Numpy array. However, as explained, because of a one-hot step in my Python code, for Column-1, the transformation step returns a bunch of features, thereby making my Numpy array 3-dimensional.
Operating system used: Red Hat Enterprise Linux
Answers
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 297 Dataiker
Hi @pratikgujral-sf
,That's correct, the transform method must return either a pandas DataFrame or a 2-D numpy array or scipy.sparse.csr_matrix containing the preprocessed result. A single processor may output several numerical features, corresponding to several columns of the output, however, having the possibility to return 3-D array is currently a feature request. Please see the following document for details: https://doc.dataiku.com/dss/latest/machine-learning/features-handling/custom.html#implementing-a-custom-processor
Please let us know if you have any further questions.
Kind regards,
Jordan