Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Is it possible to make calls to OpenRefine Server from DSS?

Level 2
Is it possible to make calls to OpenRefine Server from DSS?
Hi,

Is it possible to make calls to DSS's Open Refine Server. We want to make connection to Open Refine server that DSS is already using, in "Python Recipe" instead of "Prepare Recipe"

Please let us know how we can do it.

Waiting for your kind response.

Regards,

Samriddhi
0 Kudos
4 Replies
Dataiker
Dataiker

Hi,



The clustering feature in a "Prepare" recipe will not dynamically update when the dataset changes, and is only designed for small datasets fitting into memory.



For clustering large datasets on text with automated updates, we advise using a clustering recipe: https://doc.dataiku.com/dss/latest/machine_learning/unsupervised.html



Cheers,



Alex

0 Kudos
Level 2
Alex,

Traditional clustering, e.g. unsupervised learning that is provided in DSS clustering recipes, is very different from the type of fuzzy matching that is being done here. The problems you describe above, though, are exactly why we need to be able to do this programmatically. The link that Sam posted: http://www.padjo.org/tutorials/open-refine/clustering/ shows the OpenRefine text facet clustering that appears to be the capability that DSS is leveraging within prepare recipes.

So the question is simply whether it is possible to make calls from Python to the OpenRefine server that we believe to be running with DSS (as this shows: https://doomicile.de/story/simple-text-analysis-using-python-identifying-named-entities-tagging-fuzzy-string-matching-and ), or whether we need to install our own OpenRefine server or seek a different programmatic solution.

Thank you for your time and help.

Best,
John
0 Kudos
Dataiker
Dataiker
Hi John, Sam,
Thanks for the explanation. Access to the OpenRefine server included in DSS is not currently supported. I have relayed your request to our R&D team.
There are several ways to implement this in DSS.
1. Without code, with visual DSS features: using a clustering algorithm on vectorized text with a high number of cluster - I have used it successfully myself at several occasions, it works well for a moderate amount of cluster (<300)
2. With code: many python libraries offer fuzzy matching functionalities. The closest one to your need would be https://github.com/OpenRefine/refine-client-py/blob/master/README.rst. That requires to install an Open Refine server alongside Dataiku DSS. Else, you can use the fuzzywuzzy python library, which does not require to install open refine.
Hope it helps,
Alex
0 Kudos
Level 2
Alex,

Thank you very much. Yes I have successfully used the visual prepare recipe to merge around ~3K clusters found from ~700K rows, but the browser becomes very unresponsive. Packages like fuzzywuzzy and fuzzyset are great for matching mis-spelled terms to a dictionary of known correct terms, but what we have here is a bit different. We have a big list of terms and have no idea which, if any, are spelled correctly, and just need to cluster together the ones that likely refer to the same entity.

Thanks for the help and the github link. We'll check it out and get something working!

Best,
John
0 Kudos