controlled vocabularies and data dictionaries

wormboy
Level 1
controlled vocabularies and data dictionaries

Forgive the newbie question. I am a data scientist, and I do a lot of curation, harmonization and ontology work on clinical datasets. A common, and time consuming, task involves coding or decoding data according to a data dictionary. Here is a short example. The sex and diagnosis variables in this dataset are encoded, and the dictionary provides the decoding information.

Variable / Field Name

Form Name

Section Header

Field Type

Field Label

Choices, Calculations, OR Slider Labels

subject_id

enrollment

 

text

Subject ID

 

enrollment_date

enrollment

 

text

Date of Enrollment

 

sex

enrollment

 

radio

Sex

1, Male | 2, Female

diagnosis

enrollment

 

radio

Diagnosis

1, Crohn's disease | 2, Ulcerative colitis | 3, Indeterminate colitis | 4, Control | 99, Unknown

 

If I import the associated dataset into DSS, it will apply a schema that is based on the coded values (Sex and Diagnosis columns will be assigned Meaning: Integer and have numeric values.) If I need to decode the data, I know I can do find&replace on a column-by-column basis using replace values extracted from this dictionary, but I routinely encounter datasets with thousands of variables, hundreds of which are coded by unique vocabularies. Is there a reasonable way decode a dataset and define the schema from such a dictionary? I can write a script that will do this, but then Dataiku isn't helping me save time on this task. I am brand spanking new to Dataiku, but I have spent a few hours noodling around in the documentation to no avail. Yet it seems like this should be a common enough use-case that there may be such a function built-in...

0 Kudos
4 Replies
SeanA
Community Manager
Community Manager

Hi @wormboy , there are likely some visual processors that can help with this kind of task.

One is the ability to apply certain processor steps to multiple columns. If you switch from the table view to the columns view (look for an icon near the top right), you can select multiple columns and apply the same step to all of them.

More specifically to what it sounds like you're trying here: try the "Switch case" processor. Click "Raw text edit", and you can copy-paste rules mapping sex -> Sex directly from a spreadsheet.

More broadly, you be want to look at this tutorial which introduces some other features like the "if-then-else" processor.

 

Dataiku
0 Kudos
wormboy
Level 1
Author

Welp, it seems that the "switch" function can only be applied to a single column. As I routinely deal with datasets with thousands of columns, this solution is not very helpful. 

I created valid "switch" strings for every needed column transformation in my current dataset (there are 392 distinct columns that need to be remapped). I see two shortcomings though... 1) I can't apply the same formula to multiple columns at once, and 2) a formula with multiple "switch" commands is invalid.

I was hoping I could apply multiple switch functions to each column such as:

switch(sex, 1, "Male", 2, "Female")
switch(ibd_diagnosis, 1, "Crohn's Disease", 2, "Ulcerative Colitis", 3, "Indeterminate Colitis")

...seeing as the column name is part of the switch function, you might expect each column would only execute the switch commands that included their column name. But no! A function that is passed is malformed. The "s" in the word "switch" on line 2 is highlighted and  tagged with the following error:

Incorrect formula: Unidentified token "switch". (Parsing error at offset 36), caused by: ParsingException: Unidentified token "switch". (Parsing error at offset 36)

Is this transformation possible with a script in Dataiku? Or is this ultimately the wrong tool for this sort of data wrangling?

0 Kudos
wormboy
Level 1
Author

Yep. This was helpful. It looks like the

switch(variable_name, "1", "Male", "2", "Female")

functionality should be fairly easy to implement and is just what I was looking for. Thanks for the pointer!

0 Kudos
wormboy
Level 1
Author

nope.

0 Kudos