Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello,
I have a varbinary field in a MS SQL Server table that represents a document attachment (pdf, zip, xml, etc). Is there a way I can read that raw format into a Python Dataiku recipe? I've stumbled upon this [raw formatted data](https://doc.dataiku.com/dss/latest/python-api/datasets.html#dataiku.Dataset.raw_formatted_data) method of the `dataset` class, which I think might be key to what I'm trying to do. I've even looked at what appears to be the code repo for the Python dataiku package (https://github.com/dataiku/dataiku-api-client-python/blob/master/dataikuapi/dss/dataset.py) but was unable to find anything on this specific method.
Questions:
The current behavior when I read this data into Python from a DSS Managed MS SQL Table dataset using `get_dataframe()` is that I get a very small string of incorrectly encoded garbage str characters. I was hoping it would at least give me the whole blob of binary in a garbage str of characters, but instead, it's only about three characters worth of data.
**Note: I already have proven that I can convert the varbinary field into a varchar of hex data when going from SQL Table to SQL Table and then process that string of hex with Python. But for reasons that aren't relevant to this post, I need a solution for directly reading in this varbinary field into a DSS Python recipe.**
I definitely understand that this is an odd "edge-case" type of scenario, but just wanted to reach out to the community to see if anyone had any ideas!
Thanks!
Hi,
It isn't possible to do what you want.
The raw_formatted_data method is about getting a stream of data for a dataset, but that dataset has already been parsed by Dataiku. On SQL datasets, there is no such notion as a "raw string of bytes from the database", since it will always have gone first through the JDBC driver in the server part of Dataiku. raw_formatted_data asks DSS to re-encode the data that the JDBC driver has already processed.
DSS has no native support for binary fields, and the exact behavior when retrieving binary fields is database-dependent and JDBC-driver dependent, with most of the case printing only a pointer (the few hex garbage that you got) or nothing.
Doing a SQL table -> SQL table sync first forces the database itself to transform the binary into a string field, which appears to be doing what you want (I guess base64-encoding it) - a contrario, reading the first SQL table puts the JDBC Driver in the loop, which produces the unwanted behavior.
Thus, the only option that we can recommend is to keep doing your SQL table -> SQL table sync first in order to force the database to transform the binary field.
Hi,
It isn't possible to do what you want.
The raw_formatted_data method is about getting a stream of data for a dataset, but that dataset has already been parsed by Dataiku. On SQL datasets, there is no such notion as a "raw string of bytes from the database", since it will always have gone first through the JDBC driver in the server part of Dataiku. raw_formatted_data asks DSS to re-encode the data that the JDBC driver has already processed.
DSS has no native support for binary fields, and the exact behavior when retrieving binary fields is database-dependent and JDBC-driver dependent, with most of the case printing only a pointer (the few hex garbage that you got) or nothing.
Doing a SQL table -> SQL table sync first forces the database itself to transform the binary into a string field, which appears to be doing what you want (I guess base64-encoding it) - a contrario, reading the first SQL table puts the JDBC Driver in the loop, which produces the unwanted behavior.
Thus, the only option that we can recommend is to keep doing your SQL table -> SQL table sync first in order to force the database to transform the binary field.