Archive | December 2013

Unpacking a bytestream efficiently in C

I’d forgotten how hard this is, but a post on G+ made me revisit my early days of low-level coding.

Other people’s problems seem to just be more interesting sometimes….

I couldn’t remember or Google any handy bit-twiddling hacks, so I thought I’d see what a naive implementation would cost:

#include <stdio.h>
#include <stdint.h>

#pragma pack(1)

typedef struct {
  struct three_byte_word_t {
    char bytes[3];
  } words[4];
} input_t;

uint32_t naively_convert_three_bytes_to_four(uint8_t* x) {
  return (0xff & x[2]) + ((0xff & x[1]) << 8) + ((0xff & x[0]) << 16);
}

void convert_four(input_t* in_bytes, uint32_t* out_words) {
  out_words[0] = naively_convert_three_bytes_to_four(in_bytes->words[0].bytes);
  out_words[1] = naively_convert_three_bytes_to_four(in_bytes->words[1].bytes);
  out_words[2] = naively_convert_three_bytes_to_four(in_bytes->words[2].bytes);
  out_words[3] = naively_convert_three_bytes_to_four(in_bytes->words[3].bytes);
}

void output_bytes(uint8_t *bytes, int byte_count) {
  printf("test_bytes = ");
  for (int i = 0; i < byte_count; ++i)
    printf("0x%02x ", *bytes++);
  printf("\n");
}

uint8_t bench_test_bytes[3000000];
uint32_t out_words[1000000];
long max_loop = sizeof(bench_test_bytes)/12;
uint8_t test_bytes[] = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'};
uint32_t target[] = {(uint32_t)0x00616263,
                     (uint32_t)0x00646566,
                     (uint32_t)0x00676869,
                     (uint32_t)0x006a6b6c};


void prepare_to_bench() {
  uint8_t* in_ptr = test_bytes;
  uint8_t* out_ptr = bench_test_bytes;

  for (long i = 0; i < max_loop; ++i)
    for (int j = 0; j < 12; ++j)
      *out_ptr++ = *in_ptr++;
}


void bench() {
  uint8_t* in_ptr = test_bytes;
  uint32_t* out_ptr = out_words;

  for (long i = 0; i < max_loop; ++i) {
    convert_four(in_ptr, out_ptr);
    in_ptr += 12;
    out_ptr += 4;
  }
}


void post_bench_verify() {
  for (uint32_t* out_ptr = out_words; out_ptr < max_loop; out_ptr += 4){
    if (out_ptr[0] != target[0] ||
        out_ptr[1] != target[1] ||
        out_ptr[2] != target[2] ||
        out_ptr[3] != target[3])
      printf("Something went wrong !!!");
  }
}


int main() {
  output_bytes(test_bytes, 12);
  printf("    target = 0x%08x 0x%08x 0x%08x 0x%08x\n", target[0], target[1], target[2], target[3]);

  uint32_t out_buffer[4];
  convert_four(test_bytes, out_buffer);
  printf("       got = 0x%08x 0x%08x 0x%08x 0x%08x\n", out_buffer[0], out_buffer[1], out_buffer[2], out_buffer[3]);

  prepare_to_bench();
  bench();
  post_bench_verify();
}

This simplified version actually seems to work, but what about those precious CPU cycles?

I went on to profile it:

$ gcc -std=c99 -pg -O0 main.c
$ ./a.out
test_bytes = 0x61 0x62 0x63 0x64 0x65 0x66 0x67 0x68 0x69 0x6a 0x6b 0x6c 
    target = 0x00616263 0x00646566 0x00676869 0x006a6b6c
       got = 0x00616263 0x00646566 0x00676869 0x006a6b6c

$ gprof

Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00  1000004     0.00     0.00  naively_convert_three_bytes_to_four
  0.00      0.00     0.00   250001     0.00     0.00  convert_four
  0.00      0.00     0.00        1     0.00     0.00  bench
  0.00      0.00     0.00        1     0.00     0.00  output_bytes
  0.00      0.00     0.00        1     0.00     0.00  post_bench_verify
  0.00      0.00     0.00        1     0.00     0.00  prepare_to_bench

We’re going to need a bigger benchmark! I’m told we need to handle 300000 bytes per second, so let’s give it a minute’s worth of load:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 60.85      0.52     0.52        1   523.31   523.31  prepare_to_bench
 22.82      0.72     0.20 15000001     0.00     0.00  convert_four
  9.36      0.80     0.08 60000004     0.00     0.00  naively_convert_three_bytes_to_four
  7.02      0.86     0.06        1    60.38   337.13  bench
  0.59      0.87     0.01                             frame_dummy
  0.00      0.87     0.00        1     0.00     0.00  output_bytes
  0.00      0.87     0.00        1     0.00     0.00  post_bench_verify

So in a profiled run of the full debug build with no optimization, my rather un-clever version does the work for a whole minute in about 60ms.

These numbers from a 3rd-generation i5 Ultrabook. YMMV.

This is probably good enough for me to get back to my enormous feature branch merge work now.