RegEx issues

Hello,
I'm trying to colorize an hex log, and I find a estrange behaviors.
E.g. I can colorize ok some bytes:
2688f46f-0ccb-4798-88b5-22dacb6dd10a-image.png

but others are not colorized at all
1b2451ef-d820-4ceb-aba3-b5a1e9f09cc9-image.png

and, in a weird case, a value (0x77) colorize two bytes (0x77 and 0x57)
f86c427c-f0c8-44c8-8539-926dc408cc87-image.png

It seems that colorizes better low values (below 0x80) than high ones.

Another thing that could affect (?), could be that the log has been captured using "odd" parity... but ... not sure about it, as it colorizes correctly if I use 0x01 or 0x03
c39a59f2-2ed9-4a98-bd5a-4e525eb8b86d-image.png
3bc2eb3d-1eef-4dea-867a-53dc57a5b52d-image.png

Regards!
Josep

Hello Josep,

You have to set "Force Latin-1 encoding" when you are colorizing raw byte sequences. Otherwise, RE2 will try to decode UTF-8, and this yields unexpected results when the data stream (or the pattern) contains invalid UTF-8 sequences -- which is a common thing in raw IO streams.

As a matter of fact, I think we should change the default behaviour -- i.e., use Latin-1 by default and only UTF-8-decode when explicitly asked for.

Also, regarding this:

and, in a weird case, a value (0x77) colorize two bytes (0x77 and 0x57)

This is actually fine. "Case sensitive" is set to OFF, so 0x77 (W) and 0x57 (w) both match.

Hello Vladimir,
you are right! but I would have sworn I tried that (in fact, after writing the post I thought I have forgoten comment that "the issue" seems related to an UTF issue)
Sorry and thank you!
Regards
Josep

No worries; I'm happy to hear that the issue is resolved now!

If you tried the Latin-1 and still saw unexpected coloring, it could have been the case sensitivity issue -- this has caught me off guard a few times as well.

Regarding why we use UTF-8 in regex by default. UTF-8 is the default encoding across IO Ninja (log engine, terminal, transmit pane, etc). So it makes sense to use UTF-8 in regex for the sake of consistency. But here we have a dilemma. If the regex engine uses UTF-8 by default, then individual bytes could be uncolored -- RE2 could treat them as part of a UTF-8 sequence. If, on the other hand, we use Latin-1 by default, then multi-byte Unicode characters could be uncolored.

I guess we could try to be smarter and automatically choose Latin-1 or UTF-8 based on the pattern (i.e., force Latin-1 when the pattern contains \xHH, force UTF-8 when the pattern contains multi-byte Unicode characters). Forcing the encoding is not a good thing, though. Maybe have an "Auto" option or something like that.

If you encounter anything else, please let me know. Your feedback over the years has been invaluableā€”thank you so much!