Reading varbinary from SQL Server into Python

Taylor
Taylor Registered Posts: 15 ✭✭✭✭

Hello,

I have a varbinary field in a MS SQL Server table that represents a document attachment (pdf, zip, xml, etc). Is there a way I can read that raw format into a Python Dataiku recipe? I've stumbled upon this [raw formatted data](https://doc.dataiku.com/dss/latest/python-api/datasets.html#dataiku.Dataset.raw_formatted_data) method of the `dataset` class, which I think might be key to what I'm trying to do. I've even looked at what appears to be the code repo for the Python dataiku package (https://github.com/dataiku/dataiku-api-client-python/blob/master/dataikuapi/dss/dataset.py) but was unable to find anything on this specific method.

Questions:

  • Am I on the right track?
  • Is there any more documentation on the `format` and `format_params` options of this method? I doubt that I want to use the `tsv-excel-no-header` option, but that is the only one I see listed in documentation or docstring of the method.
  • Is there a "raw" or "bytes" option for the "format" parameter since I'm dealing with binary of a document that I'm already able to handle from within Python if I can just convert it to a string of hex (or raw bytes object)?
  • Are there any other solutions to what I'm trying to do?

The current behavior when I read this data into Python from a DSS Managed MS SQL Table dataset using `get_dataframe()` is that I get a very small string of incorrectly encoded garbage str characters. I was hoping it would at least give me the whole blob of binary in a garbage str of characters, but instead, it's only about three characters worth of data.

**Note: I already have proven that I can convert the varbinary field into a varchar of hex data when going from SQL Table to SQL Table and then process that string of hex with Python. But for reasons that aren't relevant to this post, I need a solution for directly reading in this varbinary field into a DSS Python recipe.**

I definitely understand that this is an odd "edge-case" type of scenario, but just wanted to reach out to the community to see if anyone had any ideas!

Thanks!

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    Hi,

    It isn't possible to do what you want.

    The raw_formatted_data method is about getting a stream of data for a dataset, but that dataset has already been parsed by Dataiku. On SQL datasets, there is no such notion as a "raw string of bytes from the database", since it will always have gone first through the JDBC driver in the server part of Dataiku. raw_formatted_data asks DSS to re-encode the data that the JDBC driver has already processed.

    DSS has no native support for binary fields, and the exact behavior when retrieving binary fields is database-dependent and JDBC-driver dependent, with most of the case printing only a pointer (the few hex garbage that you got) or nothing.

    Doing a SQL table -> SQL table sync first forces the database itself to transform the binary into a string field, which appears to be doing what you want (I guess base64-encoding it) - a contrario, reading the first SQL table puts the JDBC Driver in the loop, which produces the unwanted behavior.

    Thus, the only option that we can recommend is to keep doing your SQL table -> SQL table sync first in order to force the database to transform the binary field.

Answers

  • Taylor
    Taylor Registered Posts: 15 ✭✭✭✭
    That makes perfect sense! I suspected this would either not be possible or not worth the effort to figure out (in reference to this being database / JDBC driver dependent). Thanks for the quick and in-depth reply!
Setup Info
    Tags
      Help me…