Unpacking a bytestream efficiently in C
I’d forgotten how hard this is, but a post on G+ made me revisit my early days of low-level coding.
Other people’s problems seem to just be more interesting sometimes….
I couldn’t remember or Google any handy bit-twiddling hacks, so I thought I’d see what a naive implementation would cost:
#include <stdio.h>
#include <stdint.h>
#pragma pack(1)
typedef struct {
struct three_byte_word_t {
char bytes[3];
} words[4];
} input_t;
uint32_t naively_convert_three_bytes_to_four(uint8_t* x) {
return (0xff & x[2]) + ((0xff & x[1]) << 8) + ((0xff & x[0]) << 16);
}
void convert_four(input_t* in_bytes, uint32_t* out_words) {
out_words[0] = naively_convert_three_bytes_to_four(in_bytes->words[0].bytes);
out_words[1] = naively_convert_three_bytes_to_four(in_bytes->words[1].bytes);
out_words[2] = naively_convert_three_bytes_to_four(in_bytes->words[2].bytes);
out_words[3] = naively_convert_three_bytes_to_four(in_bytes->words[3].bytes);
}
void output_bytes(uint8_t *bytes, int byte_count) {
printf("test_bytes = ");
for (int i = 0; i < byte_count; ++i)
printf("0x%02x ", *bytes++);
printf("\n");
}
uint8_t bench_test_bytes[3000000];
uint32_t out_words[1000000];
long max_loop = sizeof(bench_test_bytes)/12;
uint8_t test_bytes[] = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'};
uint32_t target[] = {(uint32_t)0x00616263,
(uint32_t)0x00646566,
(uint32_t)0x00676869,
(uint32_t)0x006a6b6c};
void prepare_to_bench() {
uint8_t* in_ptr = test_bytes;
uint8_t* out_ptr = bench_test_bytes;
for (long i = 0; i < max_loop; ++i)
for (int j = 0; j < 12; ++j)
*out_ptr++ = *in_ptr++;
}
void bench() {
uint8_t* in_ptr = test_bytes;
uint32_t* out_ptr = out_words;
for (long i = 0; i < max_loop; ++i) {
convert_four(in_ptr, out_ptr);
in_ptr += 12;
out_ptr += 4;
}
}
void post_bench_verify() {
for (uint32_t* out_ptr = out_words; out_ptr < max_loop; out_ptr += 4){
if (out_ptr[0] != target[0] ||
out_ptr[1] != target[1] ||
out_ptr[2] != target[2] ||
out_ptr[3] != target[3])
printf("Something went wrong !!!");
}
}
int main() {
output_bytes(test_bytes, 12);
printf(" target = 0x%08x 0x%08x 0x%08x 0x%08x\n", target[0], target[1], target[2], target[3]);
uint32_t out_buffer[4];
convert_four(test_bytes, out_buffer);
printf(" got = 0x%08x 0x%08x 0x%08x 0x%08x\n", out_buffer[0], out_buffer[1], out_buffer[2], out_buffer[3]);
prepare_to_bench();
bench();
post_bench_verify();
}
This simplified version actually seems to work, but what about those precious CPU cycles?
I went on to profile it:
$ gcc -std=c99 -pg -O0 main.c
$ ./a.out
test_bytes = 0x61 0x62 0x63 0x64 0x65 0x66 0x67 0x68 0x69 0x6a 0x6b 0x6c
target = 0x00616263 0x00646566 0x00676869 0x006a6b6c
got = 0x00616263 0x00646566 0x00676869 0x006a6b6c
$ gprof
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 1000004 0.00 0.00 naively_convert_three_bytes_to_four
0.00 0.00 0.00 250001 0.00 0.00 convert_four
0.00 0.00 0.00 1 0.00 0.00 bench
0.00 0.00 0.00 1 0.00 0.00 output_bytes
0.00 0.00 0.00 1 0.00 0.00 post_bench_verify
0.00 0.00 0.00 1 0.00 0.00 prepare_to_bench
We’re going to need a bigger benchmark! I’m told we need to handle 300000 bytes per second, so let’s give it a minute’s worth of load:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 60.85 0.52 0.52 1 523.31 523.31 prepare_to_bench 22.82 0.72 0.20 15000001 0.00 0.00 convert_four 9.36 0.80 0.08 60000004 0.00 0.00 naively_convert_three_bytes_to_four 7.02 0.86 0.06 1 60.38 337.13 bench 0.59 0.87 0.01 frame_dummy 0.00 0.87 0.00 1 0.00 0.00 output_bytes 0.00 0.87 0.00 1 0.00 0.00 post_bench_verify
So in a profiled run of the full debug build with no optimization, my rather un-clever version does the work for a whole minute in about 60ms.
These numbers from a 3rd-generation i5 Ultrabook. YMMV.
This is probably good enough for me to get back to my enormous feature branch merge work now.