skip first row in file using regex

Options
smp
smp Registered Posts: 4 ✭✭✭

Hi community, I am trying to parse a badly shaped file so I need to parse it with Regex. I am currently looking at my dataset and setting the configuration in the Format / Preview tab, using "type : regular expression".

I am currently struggling to skip the first line, which I know in some regex languages can be done with something like

.*\n\K

the rest of the regex for now it's stupid (still working on it)

"(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*)"

The problem is that if I launch the regex Pattern, I get this error:

Tried format regexp but configuration is not OK: Illegal/unsupported escape sequence near index 5 .*\n\K"(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*)" ^

I have tried to add a new backslash to \K >>> \\K . I no longer have the error, but I also don't see any data being parsed.

Can you please suggest how to translate \K into dataiku regex language?
Or even just pointing to the official documentation of the regex language in use...

thanks a lot

Answers

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    edited July 17
    Options

    Hi, it is hard to provide you with some suggestions without having sample data. A good place to start will be to test your regex in one of the online regex playgrounds like, for example, regex101.com. If you want to use python to parse the file, you can probably try something like the below:

    import re
    regex = re.compile(r'\n(^.*);', re.MULTILINE)
    matches = [m.groups() for m in regex.finditer(string)]
    for m in matches:
        print(m)

    Screenshot 2022-05-06 at 15.31.04.png

    If the above is not what you were looking for, could you provide a file with the sample data?
    Best.

  • smp
    smp Registered Posts: 4 ✭✭✭
    Options

    Hi, thanks for the support!

    I had tried using regex 101 before, but I don't know what syntax is using Dataiky dataset node. With the default configuration in regex 101, I was able to use a special escape command \K that is not available in dataiku.

    I have found a different solution anyways: I have given more specific restrictions to my parsing, such as forcing the first column to be a numeric. This automatically made the parser skip any row that contained text instead of numbers in the first column. The result is that I am practically skipping the first row because it contains the name of the column instead of a numeric value.

    Cheers

Setup Info
    Tags
      Help me…