You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

skip first row in file using regex

smp
Level 1
skip first row in file using regex

Hi community, I am trying to parse a badly shaped file so I need to parse it with Regex. I am currently looking at my dataset and setting the configuration in the Format / Preview tab, using "type : regular expression".

I am currently struggling to skip the first line, which I know in some regex languages can be done with something like

.*\n\K

the rest of the regex for now it's stupid (still working on it)

"(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*)"

 

The problem is that if I launch the regex Pattern, I get this error:

Tried format regexp but configuration is not OK: Illegal/unsupported escape sequence near index 5 .*\n\K"(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*)" ^

 

I have tried to add a new backslash to \K >>> \\K . I no longer have the error, but I also don't see any data being parsed. 

Can you please suggest how to translate \K into dataiku regex language?
Or even just pointing to the official documentation of the regex language in use...

thanks a lot

0 Kudos
2 Replies
VitaliyD
Dataiker
Dataiker

Hi, it is hard to provide you with some suggestions without having sample data. A good place to start will be to test your regex in one of the online regex playgrounds like, for example, regex101.com. If you want to use python to parse the file, you can probably try something like the below:

 

import re
regex = re.compile(r'\n(^.*);', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(string)]
for m in matches:
    print(m)

 

Screenshot 2022-05-06 at 15.31.04.png

If the above is not what you were looking for, could you provide a file with the sample data?
Best.

0 Kudos
smp
Level 1
Author

Hi, thanks for the support!

I had tried using regex 101 before, but I don't know what syntax is using Dataiky dataset node. With the default configuration in regex 101, I was able to use a special escape command \K that is not available in dataiku.

I have found a different solution anyways: I have given more specific restrictions to my parsing, such as forcing the first column to be a numeric. This automatically made the parser skip any row that contained text instead of numbers in the first column. The result is that I am practically skipping the first row because it contains the name of the column instead of a numeric value.

Cheers

0 Kudos