Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi community, I am trying to parse a badly shaped file so I need to parse it with Regex. I am currently looking at my dataset and setting the configuration in the Format / Preview tab, using "type : regular expression".
I am currently struggling to skip the first line, which I know in some regex languages can be done with something like
the rest of the regex for now it's stupid (still working on it)
The problem is that if I launch the regex Pattern, I get this error:
Tried format regexp but configuration is not OK: Illegal/unsupported escape sequence near index 5 .*\n\K"(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*);(.*)" ^
I have tried to add a new backslash to \K >>> \\K . I no longer have the error, but I also don't see any data being parsed.
Can you please suggest how to translate \K into dataiku regex language?
Or even just pointing to the official documentation of the regex language in use...
thanks a lot
Hi, it is hard to provide you with some suggestions without having sample data. A good place to start will be to test your regex in one of the online regex playgrounds like, for example, regex101.com. If you want to use python to parse the file, you can probably try something like the below:
import re regex = re.compile(r'\n(^.*);', re.MULTILINE) matches = [m.groups() for m in regex.finditer(string)] for m in matches: print(m)
If the above is not what you were looking for, could you provide a file with the sample data?
Hi, thanks for the support!
I had tried using regex 101 before, but I don't know what syntax is using Dataiky dataset node. With the default configuration in regex 101, I was able to use a special escape command \K that is not available in dataiku.
I have found a different solution anyways: I have given more specific restrictions to my parsing, such as forcing the first column to be a numeric. This automatically made the parser skip any row that contained text instead of numbers in the first column. The result is that I am practically skipping the first row because it contains the name of the column instead of a numeric value.