How to parse / split the date column without split column or date parser processor with a formula

Tsurapaneni
Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

Hi Team,

Hope you are all doing good during Pandemic !

I have got a problem where I have to convert the date column with in the format " YYYY-MM-DDT00:00:00.00Z" to "YYYY-MM-DD" removing the timestamp using a formula as I cannot use a Date parser or a split column processors as the engine is SQL and has a limitation to use these processors (typically split should work but it is showing an error).

Also, can you please help to replace the column values which begins with "00+" with "XY" and the values which are blank with the values of another column

eg:

A B C (actual data)
00+0 "elb"

"Should replace the 00-0 with XY as it begins with "00+" and the blank values in column B should be replaced with the values of Column C

(should be converted to)

A B C.
XY "elb" "elb"

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17

    Hi @Tsurapaneni
    . I've two questions first:

    a) what SQL database are you using? I did a test with an Oracle db, and the date parser and the extract date part are working. What error are you getting?

    b) Do you really need to use the SQL engine? If the amount of data is not that much, and easily fits in memory, maybe it could be a nice idea to use the local engine to avoid the limitation you mentioned. (But if that is not the case, and database is big, my suggestion is not valid)

    About the second part of your question:

    • to replace a value in a column with another, you can use the "Find and replace" processor in the prepare recipe.
    • To fill the empty values in column B with values of column C, you can use a formula processor and the following code:
      if(isBlank(B), C, B)
      if instead of blank values you have null (empty) values, you should use isNull(B) instead of isBlank(B)

    I hope this helps!

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Thank you so much for the reply. I am using In-database sql engine as we are dealing with large datasets. Regarding the date parser, it is showing an error saying that "not supported settings for "date parser" processor. I am not able to remove the time in the date which is in the format "YYYY-MM-DDT00:00:00.00Z" to convert into "YYYY-MM-dd". I did test out with the functions trunc, asDate etc.. but in vain, not showing the desired results.

    In the second part of the question, sorry to use the word replace but it is begins with, if a value begins with "00+" I should replace the entire value with "XY". Find and replace processor is also in the list of limitations.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    No problem @Tsurapaneni
    .

    I understand now why you need to use the SQL engine.

    • did you try the following processor? (for me it does work with the SQL engine):
      date_parse.png
    • The "find and replace" processor can replace substrings, as show here:
      find_and_replace.png
    • Both steps are possible using the SQL engine (again, I'm testing with Oracle, I'm not sure this is valid for all databases)
    • Now, the "Formula" processor can apparently work also with the SLQ engine (again with Oracle at least), as seen here:
      formulasql.png

    I hope this help a little further!

    Ignacio

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    I have tried the format processor for the date but, I am getting the below error and thus the job failed sadly. Regarding the find and replace processor that still didn't server my purpose I have to replace the string which begins with "000+" which means convert the number which "BEGINS" with 000- with elb "000+0000000" into "elb", but not just replacing the "000+" the find and replace processor with regular expression helps but when regular expression is selected it is showing that it is not supported in the in-database sql and it is not helping me in this case.

    For date I am getting the below error. and for the string with replace processor, it said the findreplace processor is not supported (as I mentioned before). Do you have any helpful inputs for these problems please.

    "Job failed: ERROR: column "Date" is of type timestamp without time zone but expression is of type text Hint: You will need to rewrite or cast the expression".

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    @Tsurapaneni
    , to do the kind of replacement you want you should replace "substring" with regexp.

    About the problems with using the SQL engine, I'm not sure I can help you, specially if I still don't know what database are you using (Oracle, mysql, postgresql, etc.?) Maybe the database you are using is not fully supported by the dataiku SQL capabilities, or maybe you have a dataset where (as it might be hinted on the Error message) the timestamps don't have a timezone (and there is a way to deal with that when creating the dataset).

    Cheers

    Ignacio

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    The database is postgre sql @Ignacio_Toledo
    . Can we write a formula which helps to identify the values which begins with "000+...." so that we can replace them like if startswith "000+" then "elb" else the same column name. would there be any possibility for it this way ?

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    @Tsurapaneni
    , I'm confused with the "replacement algorithm" you want to implement. From the post description I understood that

    • "Should replace the 00-0 with XY as it begins with '00+' ".

    Have this changed? Could you please try to clarify the algorithm? Now you are saying something different apparently. Anyway, the "Find and replace" processor, if your SQL is postgresql, will only work with exact match or substring, but not with regexp.

    About the problem with the Date processor, I've just tried with a postgresql and again this works for me.

    Could you check that in the Dataset connected to postgresql you have activated the following?:

    sql_timezones.png

    If you have done that, and still doesn't work, I would either contact your DSS and/or database administrator, or Dataiku support.

    Cheers

    Ignacio

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi @Ignacio_Toledo
    ,

    The substring one didn't change, it's the same but the replace processor replaces the substring only but not the entire string (this is what I have seen after testing) and I want to replace the entire string which starts with "000+". regarding the date and time, I am still working on it.

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi @Ignacio_Toledo
    ,

    The isblank formula when ran was giving the error " ERROR: COALESCE types integer and character varying cannot be matched". Do you have any idea on why this is coming up and what factors to check for these.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17

    Hi @Tsurapaneni
    . Ok, I understand the issue now with the "find and replace" processor. In that case this should work (provided the name of the column is A):

    formula_replace.png

    Notice that you need to add explicitly the SQL code needed to make the find and replace, and the full code there is:

    CASE WHEN position('000+' IN "A") > 0 THEN 'XY' ELSE "A" END

    -----

    Last thing about the converting the timestamp to yyyy-MM-dd, there is one extra correction to make it work:

    extracttime.png

    You need to extract the date into a different column, or you can get an error.

    Good luck!

    Ignacio

Setup Info
    Tags
      Help me…