I noticed that the running time of my code seemed to up by worse than O(k^3). Doubling k from 100 to 200 caused the running time to go up by a factor of 12.

Here are the running times of my program using different values of k (median of 3 results):

k = 20, time = 0.054s

30, 0.106

35, 0.137

40, 0.189

45, 0.330

50, 0.268

60, 0.715 (6.75x slower than k = 30)

70, 1.062 (7.75x slower than k = 35)

80, 1.667 (8.82x)

90, 2.439 (7.39)

100, 3.229 (12.05)

120, 6.416 (8.97)

140, 10.868 (10.23)

160, 17.877 (10.72)

180, 28.189 (11.56)

200, 42.269 (13.09)

So, when k was 20, it only took 0.054 seconds. If the time complexity was linear, when k is 200 it should only take 0.54 seconds, but it actually takes 42 seconds - almost 100 times slower than you'd expect.

So I tried valgrind. Here's a screenshot from kcachegrind:

Samples: 31K of event 'cpu-clock:uhppp', Event count (approx.): 7899500000 Children Self Command Shared Object Symbol + 100.00% 0.00% ec_simple_examp ec_simple_example [.] main + 99.99% 0.01% ec_simple_examp libisal.so.2.0.30 [.] test_helper + 99.94% 0.00% ec_simple_examp libc.so.6 [.] __libc_start_call_main + 99.94% 0.00% ec_simple_examp libc.so.6 [.] __libc_start_main_alias_2 (inlined) + 99.87% 0.01% ec_simple_examp libisal.so.2.0.30 [.] recover_data + 99.65% 0.15% ec_simple_examp libisal.so.2.0.30 [.] gf_gen_decode_matrix_simple + 99.47% 73.84% ec_simple_examp libisal.so.2.0.30 [.] gf_invert_matrix + 98.21% 0.00% ec_simple_examp libisal.so.2.0.30 [.] test_exhaustive + 98.18% 0.00% ec_simple_examp ec_simple_example [.] _start + 19.25% 19.25% ec_simple_examp libisal.so.2.0.30 [.] gf_mul + 6.40% 6.40% ec_simple_examp libisal.so.2.0.30 [.] gf_mul@plt + 1.78% 0.00% ec_simple_examp libisal.so.2.0.30 [.] test_random 0.12% 0.00% ec_simple_examp libc.so.6 [.] __printf (inlined) 0.10% 0.03% ec_simple_examp libc.so.6 [.] __vfprintf_internal 0.05% 0.00% ec_simple_examp libc.so.6 [.] __calloc (inlined) 0.04% 0.00% ec_simple_examp libc.so.6 [.] __GI__IO_file_xsputn (inlined)

int gf_invert_matrix(unsigned char *in_mat, unsigned char *out_mat, const int n) { int i, j, k; unsigned char temp; // Set out_mat[] to the identity matrix for (i = 0; i < n * n; i++) // memset(out_mat, 0, n*n) out_mat[i] = 0; for (i = 0; i < n; i++) out_mat[i * n + i] = 1; // Inverse for (i = 0; i < n; i++) { // Check for 0 in pivot element if (in_mat[i * n + i] == 0) { // Find a row with non-zero in current column and swap for (j = i + 1; j < n; j++) if (in_mat[j * n + i]) break; if (j == n) // Couldn't find means it's singular return -1; for (k = 0; k < n; k++) { // Swap rows i,j temp = in_mat[i * n + k]; in_mat[i * n + k] = in_mat[j * n + k]; in_mat[j * n + k] = temp; temp = out_mat[i * n + k]; out_mat[i * n + k] = out_mat[j * n + k]; out_mat[j * n + k] = temp; } } temp = gf_inv(in_mat[i * n + i]); // 1/pivot for (j = 0; j < n; j++) { // Scale row i by 1/pivot in_mat[i * n + j] = gf_mul(in_mat[i * n + j], temp); out_mat[i * n + j] = gf_mul(out_mat[i * n + j], temp); } for (j = 0; j < n; j++) { if (j == i) continue; temp = in_mat[j * n + i]; for (k = 0; k < n; k++) { out_mat[j * n + k] ^= gf_mul(temp, out_mat[i * n + k]); in_mat[j * n + k] ^= gf_mul(temp, in_mat[i * n + k]); } } } return 0; }

So this is a O(n^3) function because of the triple nested for loop at the end of the function. That one nested loop at the end takes up almost all of the function's runtime (my primitive instrumentation says 0.00026s out of 0.00029s). Here's the source code for gf_mul:

unsigned char gf_mul(unsigned char a, unsigned char b) { #ifndef GF_LARGE_TABLES int i; if ((a == 0) || (b == 0)) return 0; return gff_base[(i = gflog_base[a] + gflog_base[b]) > 254 ? i - 255 : i]; #else return gf_mul_table_base[b * 256 + a]; #endif }

So as you can see, gf_mul doesn't do much - there's actually more work done in gf_invert_matrix as the profiler says 64-80% of the work is done in gf_invert_matrix while only 20-36% of the work is done in gf_mul.

Anyways, it was surprising that almost the entirety of the running time of my program was taken up by gf_matrix_invert. If n is 100 then there are 1 million iterations of the innermost loop, if n is 200 then there are 8 million iterations.

But when I timed my program, k=200 is 13 times slower than k=100, which itself is 12 times slower than k=50. If it was O(n^3) then I would expect it to be only 8 times slower, not 12 times. So, that's a mystery to me.

On the other hand, k = 20 was 0.054s, and k=200 is 10 times bigger and 10^3 is 1000, and 0.054s * 1000 = 54 seconds, and indeed k = 200 is 42 seconds so not that far off. So to say my program is O(k^3) is a pretty good approximation.

Anyways, the upshot is that gf_invert_matrix is slow when k is large. This is a problem if you want to do exhaustive testing as I do. It's not a problem if you only want to do encoding and recovery. But if you want to do exhaustive testing then it will be too slow.

**UPDATE: I solved the mystery!** So I just thought about it some more and figured out why my program is 13x slower and not just 8x slower.

Basically it's because the exhaustive testing runs it over m shards so **it runs gf_invert_matrix m times**.

When we increase k from 100 to 200, since p = 50 and m = k+p, m goes from 150 to 250. So we are multiplying 8 by 250/150 = 8 x 1.6666 = 13.33333 which agrees pretty much exactly with the observed running time.

Basically,
I think gf_invert_matrix does indeed get slow as O(k^3) but we run
gf_invert_matrix m times where m is the total number of shards since we're doing exhaustive testing over all the shards. So as k increases m also increases which means we're running
gf_invert_matrix more times. So that's why my program got 13x slower
instead of 8x slower as expected. I think gf_invert_matrix did indeed
get 8x slower, it's just that I was running gf_invert_matrix 1.6 times more than before.