In a visual prepare recipe, there is the Python function that can be used with real python so that libraries can be used on a row of data, without having to go out and create a separate python recipe.
I'm working on trying to clean up some phone numbers. So far I have a function, listed below, that is working. However, it is not very performant. Processing about 170 thousand phone numbers per minute. And It does not capture a few edge cases where more than one phone number appears in a few single phone number cells.
The current process already does some really great things to eliminate non-phone number text from the phone number columns. (Like when there is a person's name in the phone column as well as the phone number.) And this process recognizes phone numbers that are bogus.
I'm hoping to do a bit better and learn something in the process.
I'm submitting two columns 'MailingCountryCode' and 'Phone_No'. Here is the setup of the visual recipe:
Here is the code snippet that I'm currently using.
#the library phonenumbers was added to my code environment import phonenumbers def process(row): country = row['MailingCountryCode'] or "US" for match in phonenumbers.PhoneNumberMatcher(row['Phone_No'], country): return phonenumbers.format_number(match.number , phonenumbers.PhoneNumberFormat.NATIONAL)
First of all. Yes, I know that the for loop only runs once, and will return just the first valid phone number in the phone number column. And yes in a small number of cases I do have multiple phone numbers in a single Phone_No cell. So it would be great to return more than one output row in those cases where a single input cell contains two or more phone numbers.
I'm working on trying to clean up this code, however, I can not find any example of a successful Vectorized Python operation in a visual recipe. I've not found anything in the knowledge base.
I'm wondering if anyone would be willing to share any examples of code and visual recipes you are using. This is complicated enough to set up that we could use a visual cookbook with some examples. So any examples that folks want to share even if it is not on phone numbers would be helpfull.
Thanks in advance for any insights you can share.
activating vectorized processing means that your `process()` python function will receive a dict of column name to Pandas series instead of a dict of column name to (string) cell value.
A simple UDF to split a string (on a regex):
def process(rows): return pd.Series(rows["Name"].str.split("[., ]+"), index=rows.index)
In your case, you want to (sometimes) produce additional rows, so you need the "rows" mode in the UDF, and produce a Pandas dataframe. An example here, multiply each row by the number of parts in the Name column (stored in Name_part)
def process(rows): df = pd.DataFrame(index=rows.index) for k in rows: if not k.endswith('_part'): df[k] = rows[k] split_names = df["Name"].str.split("[., ]+") return df.join(split_names.explode(), rsuffix="_part")
note the exclusion of the Name_part column from the `rows` input dict to `df` dataframe conversion: the added column will come back in the input once the code has noticed it, so if it's not excluded the output would hold duplicate columns (and lead to non-obvious errors)
ps: technically the `process()` doesn't get a true `dict` but a lazy-loading version of it, to avoid shuffling all the columns back and forth to the python subprocess
Thank you for this feedback.
Because of the lack of examples in the documentation. I’m wondering if others would be willing to share some more examples of python functions you use in Dataiku visual recipes. It would be great to gather together other examples of using this supper powerful feature.
I've done a bit of testing of your first example. It seems to have the need for a small update.
import pandas as pd def process(rows): return pd.Series(rows["Customer_Name"].str.split("[., ]+"), index=rows.index)
My column name is now a bit different. I changed to "Customer_Name". Not a real issue there.
However, that did not solve things for me I seem to have to import pandas within the context of the recipe step. Is that the case? What impact on performance is re-importing pandas potentially thousands of times having?
This was tested in a DSS V8.0.2 instance.
Here is what I have so far to clean up the Capitalization of Names.
Here is the code segment.
from nameparser import HumanName import pandas as pd def Cap_Name(text): name = HumanName(text) name.capitalize(force=True) return str(name) def process(rows): return pd.Series(rows["Customer_Name"].apply(Cap_Name), index=rows.index)
Is there any more efficient way of doing this?
the output of `rows["Customer_Name"].apply(Cap_Name)` should already be a pd.Series, so you probably don't need to wrap it explicitely in a `pd.Series(...)` . In terms of performance, you probably won't get more out of vectorization, it'll all depend on what Pandas achieves for the `.apply(...)` on the Cap_Name function
Dropping the extra
Did not break anything. However, it also did not improve performance.
That said. The temple above is not too bad for changing a single column with the results of a Python Library. Without having to go to a separate Python Recipe.
Do you have any sense of the performance differences on might get between a visual recipe with a python recipe step? VS A Python recipe by itself. I've got to imagine that there is at least some overhead in embedding Python in a visual recipe.
The next step is to work out the one-to-many row variant of this question.
(sorry for the late reply)
no need to worry about `import pandas` being run many times: the python code is run only once, to retrieve the `process()` function object, and it's the function object which is called thousands of times with rows (or row batches). That'd be a different story if the `import pandas` was located inside the function, though.
Using this approach saves me about 1/3 of the compute time given my current setup.
Is this amount of performance improvement what one should be expecting?
Can I do better?
For example, If I could figure out how to get rid of the second function that I'm "applying" to the dataset and make things a single function. Would I expect to get significant performance improvements? (My understanding is that going through a for loop is rather expensive. However, I got to imagine that making a call to a function is also expensive.)
the python processor is a convenience tool, for raw speed I'd recommend a Spark-scala recipe, or thoroughly vectorizing your code (but I don't see a function for capitalizing names being a good candidate).