Monday, October 12, 2015

#C: fast parsing of hex strings to char buffers

the C library for parsing INI file syntax (cfg2) that i'm hobby-maintaining has recently been improved quite a lot. supports sections, the parser should be much better now, improved API and so on.

one of the features i've added is to be able to store HEX strings of binary data in the text file that can be optionally parsed into memory char buffers.

for example:
some_key=48656c6c6f77576f726c64

where the HEX string equals "HelloWorld".

problems at hand:
1) detect bad length
2) detect bad characters
3) support case-insensitive input

of course, most programmers can immediately start writing a solution for this type of parser and solve all 3 problems, but the question is - how to write this to be as efficient as possible?

solving 1) is quite easy. since we are working with 8 bit characters, we are going to have 256 values per character or a maximum of 0xFF (255). but instead of parsing bad characters we are strict and require the HEX string length to be divisible by 2:
if (len % 2)
    goto error;
one would approach solving 2) with "if, then, else" branches to see if a character in the HEX string is in the 0-9,A-F range of the ASCII table.

3) can be solved with toupper()/tolower()

but instead, a lookup table can be used to solve both 2) and 3) in a single pass. this lookup table could have 256 indexes, where most are filled with zero, while the valid indexes are filled with values of interested.

the table can be seen in this source file - named "hex_to_char_lookup".

so, when we first traverse the example HEX string we reach a couple of characters "48".
the first one "4" - this character has an ASCII decimal value of 52. the lookup table has mapped this value at index 52 as 5. the reason for 5 instead of 4 is that i wanted the table to look pretty, filled with zeroes and thus i increased each value by one so that "0" is mapped to 1, "1" is mapped to 2 and so on. this can be removed for a small optimization [1], but then one has to fill the table with values such as -1.

"8" which is mapped to ASCII index 56 in the table and a value 9 is returned.

if we reach a bad character, 0 will be returned in which case the program should return an error with the bad character position in the HEX string.

at some point in the example string, "c" will be reached and here we could have a case-sensitivity issue, but the lookup table has mapped both the ASCII "c" and "C" to return the same value which is 13 or the 12th index in hexadecimal, plus 1.
0 1 2 3 4 5 6 7 8 9 A B C D E F
                        ^ (12 + 1 = 13)
now, to obtain the resulting C char we have to form a hexadecimal value combining the two decimal values, as follows:
first = table[52]; // 5
second = table[56]; // 9
if (!first || !second)
    goto error;
first--; second--; // 4, 8; possible optimization [1]
unsigned char result = first * 16 + second; // 72 = "H"
and then the result can be written to the output char buffer.

that's all!
TMK, this parsing method would work for any string encoding UTF-8, UTF-16, etc. as long as sizeof(char) == 1, which should be very much true as it's in the ISO C standard.
 --

0 comments: