Advice for working around MSSQL limitations

Jason · March 2022

I'm very new to Dataiku, so maybe there is an answer to this already (I searched and could not find an answer)

I am working with a wide data set (3500 columns of integers that come from a TFIR spectrum analyzer) and I wanted to use MSSQL to store the intermediate data. However SQL screams at me for trying to create a table wider than 1024 columns. What is the best practice for dealing with super wide data like this?

I am using a custom python script to extract the data from the proprietary file format, so I can imagine a few options: a) as part of the extraction script, break the data into sets of 1000 and writing them out to separate temporary data stores. b) trying to find a pre-processing scheme and only storing the valuable data (we only care about the peaks and their magnitudes) c) store it as a blob of some kind (like a json string) so it can be stored in a varchar field (this limits the ability to detect trends in the data directly and requires a decode step later as the data is passed to a later ML process) or d) something else I'm not seeing as an option right now.

Any help would be appreciated. Thanks.

Manuel · March 2022

Hi Jason,

Assuming your rows have a key, can you fold the data into a long format, rather than the wide you seem to be using now? Such as in this example using our Fold processor.

Then the Window recipe may help you with calculations required with a key value.

I hope this helps.

Advice for working around MSSQL limitations

Answers

Categories

Setup Info

Tags