Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.
Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).
Hello, I'm trying to colorize an hex log, and I find a estrange behaviors. E.g. I can colorize ok some bytes:
but others are not colorized at all
and, in a weird case, a value (0x77) colorize two bytes (0x77 and 0x57)
It seems that colorizes better low values (below 0x80) than high ones.
Another thing that could affect (?), could be that the log has been captured using "odd" parity... but ... not sure about it, as it colorizes correctly if I use 0x01 or 0x03
Regards! Josep
Hello Josep,
You have to set "Force Latin-1 encoding" when you are colorizing raw byte sequences. Otherwise, RE2 will try to decode UTF-8, and this yields unexpected results when the data stream (or the pattern) contains invalid UTF-8 sequences -- which is a common thing in raw IO streams.
As a matter of fact, I think we should change the default behaviour -- i.e., use Latin-1 by default and only UTF-8-decode when explicitly asked for.
Also, regarding this:
This is actually fine. "Case sensitive" is set to OFF, so 0x77 (W) and 0x57 (w) both match.
Hello Vladimir, you are right! but I would have sworn I tried that (in fact, after writing the post I thought I have forgoten comment that "the issue" seems related to an UTF issue) Sorry and thank you! Regards Josep
No worries; I'm happy to hear that the issue is resolved now!
If you tried the Latin-1 and still saw unexpected coloring, it could have been the case sensitivity issue -- this has caught me off guard a few times as well.
Regarding why we use UTF-8 in regex by default. UTF-8 is the default encoding across IO Ninja (log engine, terminal, transmit pane, etc). So it makes sense to use UTF-8 in regex for the sake of consistency. But here we have a dilemma. If the regex engine uses UTF-8 by default, then individual bytes could be uncolored -- RE2 could treat them as part of a UTF-8 sequence. If, on the other hand, we use Latin-1 by default, then multi-byte Unicode characters could be uncolored.
I guess we could try to be smarter and automatically choose Latin-1 or UTF-8 based on the pattern (i.e., force Latin-1 when the pattern contains \xHH, force UTF-8 when the pattern contains multi-byte Unicode characters). Forcing the encoding is not a good thing, though. Maybe have an "Auto" option or something like that.
\xHH
If you encounter anything else, please let me know. Your feedback over the years has been invaluableāthank you so much!