RegEx issues

Jose Marro

Hello,
I'm trying to colorize an hex log, and I find a estrange behaviors.
E.g. I can colorize ok some bytes:

but others are not colorized at all

and, in a weird case, a value (0x77) colorize two bytes (0x77 and 0x57)

It seems that colorizes better low values (below 0x80) than high ones.

Another thing that could affect (?), could be that the log has been captured using "odd" parity... but ... not sure about it, as it colorizes correctly if I use 0x01 or 0x03

Regards!
Josep

Vladimir

Hello Josep,

You have to set "Force Latin-1 encoding" when you are colorizing raw byte sequences. Otherwise, RE2 will try to decode UTF-8, and this yields unexpected results when the data stream (or the pattern) contains invalid UTF-8 sequences -- which is a common thing in raw IO streams.

As a matter of fact, I think we should change the default behaviour -- i.e., use Latin-1 by default and only UTF-8-decode when explicitly asked for.

Vladimir

Also, regarding this:

and, in a weird case, a value (0x77) colorize two bytes (0x77 and 0x57)

This is actually fine. "Case sensitive" is set to OFF, so 0x77 (W) and 0x57 (w) both match.

Jose Marro

Hello Vladimir,
you are right! but I would have sworn I tried that (in fact, after writing the post I thought I have forgoten comment that "the issue" seems related to an UTF issue)
Sorry and thank you!
Regards
Josep

Vladimir

No worries; I'm happy to hear that the issue is resolved now!

If you tried the Latin-1 and still saw unexpected coloring, it could have been the case sensitivity issue -- this has caught me off guard a few times as well.

Regarding why we use UTF-8 in regex by default. UTF-8 is the default encoding across IO Ninja (log engine, terminal, transmit pane, etc). So it makes sense to use UTF-8 in regex for the sake of consistency. But here we have a dilemma. If the regex engine uses UTF-8 by default, then individual bytes could be uncolored -- RE2 could treat them as part of a UTF-8 sequence. If, on the other hand, we use Latin-1 by default, then multi-byte Unicode characters could be uncolored.

I guess we could try to be smarter and automatically choose Latin-1 or UTF-8 based on the pattern (i.e., force Latin-1 when the pattern contains \xHH, force UTF-8 when the pattern contains multi-byte Unicode characters). Forcing the encoding is not a good thing, though. Maybe have an "Auto" option or something like that.

If you encounter anything else, please let me know. Your feedback over the years has been invaluable—thank you so much!