Improvements to how Arrays are used in DSS

Jason
Jason Registered Posts: 33 ✭✭✭✭✭

This request comes after reading a post by user @Grixis6
and realizing I have many of the same issues with arrays. I don't wan to make this a dumping ground of ideas, but considering they are all related to arrays, I think I'll start by putting them here. If it is more appropriate to separate these out, please let me know and I will do so. So without further ado, here are some suggestions related to arrays:

1) Labels are important. In python, an array is just a list of values, i.e: [1,4, 5.55, 17] but these are only half as nice as the labeled data available through a python dictionary. Please support the python dictionary as a labeled version of the same arrays mentioned above. This would allow a user to pass values such as {"week": 1, "Day": 4, "height": 5.55, "delta": 17}. By representing data in this way, more value and intuition can be gained through the interface (i.e. "Most Important Variables" shows "Element #815" when it could be showing me the name of that data point)

2) Arrays during visual analysis: It is currently not possible to perform any meaningful analytics on arrays. Arrays are treated differently depending on where you are in the interface, but in most regards the discrete values within the array are not available. It would be very useful to be able to cherry pick values or sets of values from the array (think python array slices). It would also be useful to have a suite of analysis that could be performed and visualized on the whole array (i.e. if the array is fully numeric, use array position as x and value as y) This would allow summation, average, and heatmaps etc. that operate in 2D space. Currently I am unable to find any analysis tools in DSS that work on arrays (by directly looking in the UI, as well as searching the Dataiku docs)

3) Arrays as targets of models. I have a series of samples from a laboratory instrument (spectrum data from FTIR). Each sample contains 3500 numeric values as floats. From this data I want to predict the presence of analytes (say 25 or so distinct analytes) and furthermore I want to know the predicted percentage of each. (this would be like getting the raw values of the final softmax). The current UI only allows me to pick a single target. Therefore, I cannot perform true multi-label prediction (by one-hot-encoding the target classes), and similarly, I cannot pick an array as a target for regression, which could plausibly give me the numeric output for the relative percentages. It occurs to me that a hacky way to accomplish this may be to re-encode the data stream into a 1x3500 pixel image, and try image classification, but it's unclear to me at this time how to trick the bounding box system to do what I want, and then still I am left without a way to convey the relative portions.

3
3 votes

New · Last Updated

Setup Info
    Tags
      Help me…