Thursday, May 18, 2023

gf_invert_matrix is very slow - takes up 99% of the total running time...

I noticed that the running time of my code seemed to up by worse than O(k^3). Doubling k from 100 to 200 caused the running time to go up by a factor of 12.

Here are the running times of my program using different values of k (median of 3 results):

k = 20, time = 0.054s

30, 0.106 

35, 0.137

40, 0.189

45, 0.330

50, 0.268

60, 0.715 (6.75x slower than k = 30)

70, 1.062 (7.75x slower than k = 35)

80, 1.667 (8.82x)

90, 2.439 (7.39)

100, 3.229 (12.05)

120, 6.416 (8.97)

140, 10.868 (10.23)

160, 17.877 (10.72)

180, 28.189 (11.56)

200, 42.269 (13.09)

So, when k was 20, it only took 0.054 seconds. If the time complexity was linear, when k is 200 it should only take 0.54 seconds, but it actually takes 42 seconds - almost 100 times slower than you'd expect.

So I tried valgrind. Here's a screenshot from kcachegrind:

I also used perf to record the run again with the same parameters (k=120). Here's the perf report:
Samples: 31K of event 'cpu-clock:uhppp', Event count (approx.): 7899500000
  Children      Self  Command          Shared Object         Symbol
+  100.00%     0.00%  ec_simple_examp  ec_simple_example     [.] main
+   99.99%     0.01%  ec_simple_examp  libisal.so.2.0.30     [.] test_helper
+   99.94%     0.00%  ec_simple_examp  libc.so.6             [.] __libc_start_call_main
+   99.94%     0.00%  ec_simple_examp  libc.so.6             [.] __libc_start_main_alias_2 (inlined)
+   99.87%     0.01%  ec_simple_examp  libisal.so.2.0.30     [.] recover_data
+   99.65%     0.15%  ec_simple_examp  libisal.so.2.0.30     [.] gf_gen_decode_matrix_simple
+   99.47%    73.84%  ec_simple_examp  libisal.so.2.0.30     [.] gf_invert_matrix
+   98.21%     0.00%  ec_simple_examp  libisal.so.2.0.30     [.] test_exhaustive
+   98.18%     0.00%  ec_simple_examp  ec_simple_example     [.] _start
+   19.25%    19.25%  ec_simple_examp  libisal.so.2.0.30     [.] gf_mul
+    6.40%     6.40%  ec_simple_examp  libisal.so.2.0.30     [.] gf_mul@plt
+    1.78%     0.00%  ec_simple_examp  libisal.so.2.0.30     [.] test_random
     0.12%     0.00%  ec_simple_examp  libc.so.6             [.] __printf (inlined)
     0.10%     0.03%  ec_simple_examp  libc.so.6             [.] __vfprintf_internal
     0.05%     0.00%  ec_simple_examp  libc.so.6             [.] __calloc (inlined)
     0.04%     0.00%  ec_simple_examp  libc.so.6             [.] __GI__IO_file_xsputn (inlined)
As you can see, when I increased k from 70 to 120, the total % of time spent in gf_invert_matrix increased from 97% to 99%. So, with higher values of k, almost all of the time is spent in the gf_invert_matrix function, so let's have a look at that:
int gf_invert_matrix(unsigned char *in_mat, unsigned char *out_mat, const int n)
{
	int i, j, k;
	unsigned char temp;

	// Set out_mat[] to the identity matrix
	for (i = 0; i < n * n; i++)	// memset(out_mat, 0, n*n)
		out_mat[i] = 0;

	for (i = 0; i < n; i++)
		out_mat[i * n + i] = 1;

	// Inverse
	for (i = 0; i < n; i++) {
		// Check for 0 in pivot element
		if (in_mat[i * n + i] == 0) {
			// Find a row with non-zero in current column and swap
			for (j = i + 1; j < n; j++)
				if (in_mat[j * n + i])
					break;

			if (j == n)	// Couldn't find means it's singular
				return -1;

			for (k = 0; k < n; k++) {	// Swap rows i,j
				temp = in_mat[i * n + k];
				in_mat[i * n + k] = in_mat[j * n + k];
				in_mat[j * n + k] = temp;

				temp = out_mat[i * n + k];
				out_mat[i * n + k] = out_mat[j * n + k];
				out_mat[j * n + k] = temp;
			}
		}

		temp = gf_inv(in_mat[i * n + i]);	// 1/pivot
		for (j = 0; j < n; j++) {	// Scale row i by 1/pivot
			in_mat[i * n + j] = gf_mul(in_mat[i * n + j], temp);
			out_mat[i * n + j] = gf_mul(out_mat[i * n + j], temp);
		}

		for (j = 0; j < n; j++) {
			if (j == i)
				continue;

			temp = in_mat[j * n + i];
			for (k = 0; k < n; k++) {
				out_mat[j * n + k] ^= gf_mul(temp, out_mat[i * n + k]);
				in_mat[j * n + k] ^= gf_mul(temp, in_mat[i * n + k]);
			}
		}
	}
	return 0;
}

So this is a O(n^3) function because of the triple nested for loop at the end of the function. That one nested loop at the end takes up almost all of the function's runtime (my primitive instrumentation says 0.00026s out of 0.00029s). Here's the source code for gf_mul:

unsigned char gf_mul(unsigned char a, unsigned char b)
{
#ifndef GF_LARGE_TABLES
	int i;

	if ((a == 0) || (b == 0))
		return 0;

	return gff_base[(i = gflog_base[a] + gflog_base[b]) > 254 ? i - 255 : i];
#else
	return gf_mul_table_base[b * 256 + a];
#endif
}

So as you can see, gf_mul doesn't do much - there's actually more work done in gf_invert_matrix as the profiler says 64-80% of the work is done in gf_invert_matrix while only 20-36% of the work is done in gf_mul.

Anyways, it was surprising that almost the entirety of the running time of my program was taken up by gf_matrix_invert. If n is 100 then there are 1 million iterations of the innermost loop, if n is 200 then there are 8 million iterations. 

But when I timed my program, k=200 is 13 times slower than k=100, which itself is 12 times slower than k=50. If it was O(n^3) then I would expect it to be only 8 times slower, not 12 times. So, that's a mystery to me.

On the other hand, k = 20 was 0.054s, and k=200 is 10 times bigger and 10^3 is 1000, and 0.054s * 1000 = 54 seconds, and indeed k = 200 is 42 seconds so not that far off. So to say my program is O(k^3) is a pretty good approximation.

Anyways, the upshot is that gf_invert_matrix is slow when k is large. This is a problem if you want to do exhaustive testing as I do. It's not a problem if you only want to do encoding and recovery. But if you want to do exhaustive testing then it will be too slow.



UPDATE: I solved the mystery! So I just thought about it some more and figured out why my program is 13x slower and not just 8x slower. 

Basically it's because the exhaustive testing runs it over m shards so it runs gf_invert_matrix m times

When we increase k from 100 to 200, since p = 50 and m = k+p, m goes from 150 to 250. So we are multiplying 8 by 250/150 = 8 x 1.6666 = 13.33333 which agrees pretty much exactly with the observed running time. 

Basically, I think gf_invert_matrix does indeed get slow as O(k^3) but we run gf_invert_matrix m times where m is the total number of shards since we're doing exhaustive testing over all the shards. So as k increases m also increases which means we're running gf_invert_matrix more times. So that's why my program got 13x slower instead of 8x slower as expected. I think gf_invert_matrix did indeed get 8x slower, it's just that I was running gf_invert_matrix 1.6 times more than before.