controlled vocabularies and data dictionaries

Options
wormboy
wormboy Registered Posts: 4

Forgive the newbie question. I am a data scientist, and I do a lot of curation, harmonization and ontology work on clinical datasets. A common, and time consuming, task involves coding or decoding data according to a data dictionary. Here is a short example. The sex and diagnosis variables in this dataset are encoded, and the dictionary provides the decoding information.

Variable / Field Name

Form Name

Section Header

Field Type

Field Label

Choices, Calculations, OR Slider Labels

subject_id

enrollment

text

Subject ID

enrollment_date

enrollment

text

Date of Enrollment

sex

enrollment

radio

Sex

1, Male | 2, Female

diagnosis

enrollment

radio

Diagnosis

1, Crohn's disease | 2, Ulcerative colitis | 3, Indeterminate colitis | 4, Control | 99, Unknown

If I import the associated dataset into DSS, it will apply a schema that is based on the coded values (Sex and Diagnosis columns will be assigned Meaning: Integer and have numeric values.) If I need to decode the data, I know I can do find&replace on a column-by-column basis using replace values extracted from this dictionary, but I routinely encounter datasets with thousands of variables, hundreds of which are coded by unique vocabularies. Is there a reasonable way decode a dataset and define the schema from such a dictionary? I can write a script that will do this, but then Dataiku isn't helping me save time on this task. I am brand spanking new to Dataiku, but I have spent a few hours noodling around in the documentation to no avail. Yet it seems like this should be a common enough use-case that there may be such a function built-in...

Answers

  • Sean
    Sean Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer Posts: 168 Dataiker
    Options

    Hi @wormboy
    , there are likely some visual processors that can help with this kind of task.

    One is the ability to apply certain processor steps to multiple columns. If you switch from the table view to the columns view (look for an icon near the top right), you can select multiple columns and apply the same step to all of them.

    More specifically to what it sounds like you're trying here: try the "Switch case" processor. Click "Raw text edit", and you can copy-paste rules mapping sex -> Sex directly from a spreadsheet.

    More broadly, you be want to look at this tutorial which introduces some other features like the "if-then-else" processor.

  • wormboy
    wormboy Registered Posts: 4
    Options

    Yep. This was helpful. It looks like the

    switch(variable_name, "1", "Male", "2", "Female")

    functionality should be fairly easy to implement and is just what I was looking for. Thanks for the pointer!

  • wormboy
    wormboy Registered Posts: 4
    Options
  • wormboy
    wormboy Registered Posts: 4
    Options

    Welp, it seems that the "switch" function can only be applied to a single column. As I routinely deal with datasets with thousands of columns, this solution is not very helpful.

    I created valid "switch" strings for every needed column transformation in my current dataset (there are 392 distinct columns that need to be remapped). I see two shortcomings though... 1) I can't apply the same formula to multiple columns at once, and 2) a formula with multiple "switch" commands is invalid.

    I was hoping I could apply multiple switch functions to each column such as:

    switch(sex, 1, "Male", 2, "Female")
    switch(ibd_diagnosis, 1, "Crohn's Disease", 2, "Ulcerative Colitis", 3, "Indeterminate Colitis")

    ...seeing as the column name is part of the switch function, you might expect each column would only execute the switch commands that included their column name. But no! A function that is passed is malformed. The "s" in the word "switch" on line 2 is highlighted and tagged with the following error:

    Incorrect formula: Unidentified token "switch". (Parsing error at offset 36), caused by: ParsingException: Unidentified token "switch". (Parsing error at offset 36)

    Is this transformation possible with a script in Dataiku? Or is this ultimately the wrong tool for this sort of data wrangling?

Setup Info
    Tags
      Help me…