Old Blog 1: proof

Showing posts with label proof. Show all posts

Monday, March 27, 2023

Proof my base53 checksum detects all single-character changes and adjacent transpositions

I decided to add a checksum to my Base53 scheme (which I created for URL shortener IDs) for the purpose of detecting typos. The checksum is a generalization of the ISBN-10 checksum (one could perhaps call it the ISBN-n checksum). The scheme is as follows:

Choose any prime p. Our alphabet will consist of p symbols which map to the set of integers from 0 to p-1. For example, if we choose p to be 17, then our alphabet will be {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g} where 0 maps to 0 and g maps to 16.

The maximum valid length of a string (not including the checksum character) in this scheme shall be p-2.

Given a string of length n where n is less than or equal to p-2 consisting of symbols from the alphabet, we want to compute a single-character error-detection checksum for the string. Denote characters in the string as x_i such that x_1 is the first character of the string, x_2 is the second character and x_n is the last character (n is the length of the string). Denote the checksum character as c.

The checksum character c shall be computed as the sum of products of multipliers with the characters of the string mod p, as follows:

c = sum mod p

Where sum = (p-2) * x_1 + (p-3) * x_2 + ... (p-1-n) * x_n

Where:

n is the length of the string
n is less than or equal to p-2
the multipliers are (p-2), (p-3), ...

As c is a number modulo p, its minimum value is 0 and maximum value is p-1, which means it can be represented using the alphabet.

For example, for p=17 the checksum will be computed as:

c = 15 * x_1 + 14 * x_2 + ... + (p-1-n) * x_n mod 17

Where n is less than or equal to 15.

The verification of a string will simply consist of recomputing the checksum of the string without the checksum character (the checksum character will be the last character in the string because we append the checksum character to the end of the string) and seeing if it matches the checksum character.

We want to be able to detect 2 types of errors:

Single character typo, where a single character is changed
Adjacent transposition, where two adjacent characters are swapped resulting in a different string.

The checksum character will be appended to the end of the string, therefore an adjacent transposition could swap the checksum character with the character next to it. We want to be able to detect this.

Here is my proof:

Lemma: A change in the string will go undetected iff the original and new strings have the same checksum, which means that the sums are equal mod p.

Therefore: A change will only go undetected iff the sum changes by a multiple of p.

=== Property 1. This checksum detects all single-character changes ===

Proof:

Since none of the multipliers are a multiple of p (they are all smaller than p) this means that in order for a single character change to produce a change by a multiple of p, it has to be a character that changes by a multiple of p. But as the characters are in the range 0 to p-1, it is impossible for them to change by a multiple of p (except for 0 * p, which isn't a change).

Therefore, it is impossible for a single-character change to go undetected.

=== Property 2. This checksum detects all adjacent transpositions in the string excluding the checksum character ===

Proof:

Transposition of two adjacent characters will go undetected iff the resulting sum is equal to the original sum plus a multiple of p.

Let r be the multiplier of x, x and y be the adjacent characters (where x precedes y in the string), and k be any integer.

The other characters in the string remain in their positions unchanged, therefore their contribution to the sum remains unchanged therefore the sum only changes if the sum of the products involving the two transposed characters changes.

In order for the change to go undetected the sum of the products involving the two transposed characters must remain unchanged mod p:

r * x + (r-1) * y = r * y + (r-1) * x + p * k where k is any integer

rx + ry - y = ry + rx - x + pk

rx - rx - y = ry - ry - x + pk

-y = -x + pk

x = y + pk

Thus, the only time when a transposition results in an unchanged checksum is when x is equal to y plus a multiple of p. This is impossible if k is nonzero because the characters are in the range 0 to p-1. But if k is 0 then x equals y and there is no change in the string.

Therefore the code detects all adjacent transpositions excluding the checksum character.

=== Property 3. This checksum detects the transposition between the checksum character and the last character ===

Denote the checksum character as c.

Denote the length of the string as n where n is less than or equal to p-2.

Denote the last character (not the checksum) as x_n.

Denote the sum not including the last character (p-2) * x_1 + (p-3) * x_2 + ... + x_{n-1} as s.

Denote the last multiplier (the multiplier for x_n) as r.

Then the following equation holds:

s + r * x_n = c + p * k where k is some integer

When the last character x_n is transposed with c, the string will be valid iff:

s + r * c = x_n + p * k where k is some integer. Otherwise the string would be detected as invalid.

Therefore in order for the transposition to go undetected, the following simultaneous equations must hold:

s + r * x_n = c + p * u where u is some integer

s + r * c = x_n + p * v where v is some integer.

Now we can subtract them like this:

s - s + r * x_n - r * c = c - x_n + p(u-v)

r * x_n - r * c - c + x_n = p(u-v)

(r + 1) * x_n - (r + 1) * c = p(u-v)

(r + 1) * (x_n - c) = p(u-v)

Now, p(u-v) is a multiple of p which means (r + 1) * (x_n - c) must be a multiple of p. This means that either r+1 is a multiple of p or (x_n - c) is a multiple of p or both.

r cannot be zero because we specified that the length of the string is less than or equal to p-2 which means r cannot be less than 1. The maximum possible value of r is p-2, which means r+1 cannot be a multiple of p.

Since r+1 cannot be a multiple of p, that means x_n - c must be a multiple of p. But the characters go from 0 to p-1, so the difference between two characters must be less than p, which means that if x_n - c is a multiple of p then x_n - c = 0, which means they're the same and therefore the string is unchanged.

Therefore the last transposition cannot go undetected.

See:

https://math.stackexchange.com/questions/4664122/proving-my-checksum-detects-all-single-character-typos-and-adjacent-transpositio
https://github.com/1f604/base53a

Tuesday, June 28, 2022

Proof of algorithm to detect cycles in a directed graph

I was writing my own wiki software and needed a way to detect tag cycles e.g. when tag X has tag Y and tag Y has tag Z and tag Z has tag X. This should not be allowed. So I needed an algorithm to detect this kind of error.

Here is a simple implementation of depth first search to produce a post-order traversal:

    def postorderTraversal(self, root):
        result = []
        def dfs(node, result):
            for child in [node.left, node.right]:
                if child is not None:
                    dfs(child, result)
            result.append(node.val)
        if root is not None:
            dfs(root, result)
        return result

As you can see, this algorithm is recursive which isn't very good for large graphs.

So here is the iterative version using a stack:

    def postorderTraversal(self, root):
        if root is None:
            return []
        result = []
        visited = set()
        stack = [root]
        while stack:
            node = stack[-1]
            if node in visited:
                stack.pop()
                result.append(node.val)
            else:
                visited.add(node)
                children = [node.left, node.right]
                children.reverse() # needed for left-to-right traversal
                stack.extend([child for child in children if child is not None])
        return result

Property 1: After each node is pushed onto the stack, its children are then pushed onto the stack. This means that a node cannot be popped until its children are first popped. And its children cannot be popped until their children are popped and so on. This means that when a node is visited the second time it means its children have already been popped off which means all of its descendants have been visited.

    def canFinish(self, numCourses, prerequisites):
        global has_cycle        
        has_cycle = False
        children = defaultdict(list)
        
        for edge in prerequisites:
            children[edge[0]].append(edge[1])
        
        color = defaultdict(int)
        def dfs(node):
            global has_cycle
            if color[node] == 2:
                return
            if color[node] == 1: # 1 means GREY
                has_cycle = True
                return
            
            color[node] = 1
            for child in children[node]:
                dfs(child)
            color[node] = 2 # 2 means BLACK
        
        for node in range(numCourses):
            dfs(node)
            
        return not has_cycle

The above is recursive code for detecting cycles in a directed graph.

Idea:

- We do DFS on every node in the graph.

- When we start DFS on a node, we mark it GREY.

- When we finish DFS on a node, we mark it BLACK.

- We visit every child.

- If when we visit a node, it is GREY, then we have a cycle.

- We ignore black nodes.

Informal proof of correctness:

- A node is BLACK if all of its descendents have been visited and none of them are part of any cycle therefore a BLACK node cannot be part of any cycle, therefore they do not need to be revisited.

- A node is GREY if we have NOT finished DFS on it, meaning it must be on the recursion stack. Proof: if we have finished DFS on it, then it would have been marked BLACK. Therefore, if it's GREY then it means we haven't finished DFS on it which means that it must be in the recursion stack.

- If we revisit a GREY node then it means we are revisiting a node already on the recursion stack which means there is a cycle.

- If we do not revisit a GREY node then it means there is no cycle. Proof: if there is a cycle then that means there is something like this: X->A->B where X is reachable from B. Now if we do DFS then we will visit all the edges from the node and all of the edges from all of its descendants. This means if A is reachable from the node then we will visit the edge A->B, and since X is reachable from B we will also visit the edge X->A. This means we will visit A twice, first from the starting node, and then from X. In general if there is a cycle reachable from the starting node then if we do DFS following all the edges then we will definitely revisit a node that we've already visited. Therefore if we do the DFS and do not revisit any node then that means there isn't a cycle.

Here is the iterative version:

    def canFinish(self, numCourses, prerequisites):
        global has_cycle        
        has_cycle = False
        children = defaultdict(list)
        
        for edge in prerequisites:
            children[edge[0]].append(edge[1])
        
        color = defaultdict(int)
        
        for node in range(numCourses):
            # do dfs on node
            stack = [node]
            while stack:
                node = stack[-1]
                if color[node] == 2: # if already visited, don't visit it again
                    stack.pop()
                    continue # ignore black nodes
                if color[node] == 1: # we finished visiting all of its children
                    stack.pop()
                    color[node] = 2
                else:
                    color[node] = 1
                    for child in children[node]:
                        if color[child] == 1: # found a cycle
                            return False
                        if color[child] == 2: # don't visit it
                            continue
                        stack.append(child)
        return True

Properties:

DFS is simulated using the stack and while loop.
The stack contains the list of "nodes to visit".
The first time we "visit" a node, we mark it GREY.
The second time we "visit" a node, we mark it BLACK (we first check if it's GREY, if so that means we've already visited it).

Informal proof of correctness:

Base case: when a node is a leaf i.e it has no children, then when it is visited that means it is at the top of the stack, so we visit it once and mark it GREY then we push its children onto the stack (it has no children) so then we immediately revisit it and we mark it BLACK and pop it off the stack. In this case the property holds: we popped it off when we have visited all of its children (it has none).
Inductive step: when a node is a non-leaf i.e it has children. If a node has only children who are leaves, we push all of its children onto the stack and then we pop off each of the children (as per the previous point), marking each of them BLACK, until we get back to the node. At this point we have visited all of its children. So the property again holds. So we have that each node is only revisited and popped off once all of its descendants have been popped off.
We want to prove that a node is popped off when and only when all of its descendants have been visited.
A node is popped off when all of its descendants have been popped off. This is because when we "visit" a node for the first time, we push all of its descendants onto the stack. We only pop off a node when we visit it for the second time. In order for us to visit it for the second time, we must have popped off all of the nodes on top of it in the stack. Since when we visit a node we push all of its children on the stack, that means we only pop off the node when we pop its children off the stack first. And this property holds for its children and its children's children and so on. Therefore, when we pop off a node, that means we've popped off not only its children but also its children's children and so on. Thus, when we pop off a node, that means we've already popped off all of its descendants from the stack i.e all of the nodes reachable from that node.

Thursday, September 9, 2021

Proof of correctness for Knuth shuffle

== Intuition ==

Imagine you have a pile of cards and you want to create a hand that is completely random. Here "completely random" means it has to be a hand that is uniformly drawn at random from the set of all permutations of hands. That is, all permutations must have the same probability of being chosen.

The simplest way to do this that I can think of is this: pull a card at random out of the pile, add it to your hand as the first card. Then pull another card at random out of the pile, and add that to your hand as the second card, keeping them in order. And then draw another card randomly out of the pile, and add it to your hand as the third card. And so on and so forth until the pile is empty.

In Knuth's algorithm, the way you "pull" a card out of the pile is to swap the "current card" A[i] with A[i to n] where n is the index of the last card. This means that A[i] is going to be a card randomly chosen from A[i to n], i.e it's a card randomly pulled from that subarray. And since we then move on to A[i+1] for which we choose from the subarray A[i+1,n], that means the card that we have just chosen is not put back into the pile.

This basically follows the definition of a random permutation: the first card is chosen uniformly at random from the set of all the cards. The second card is chosen uniformly at random from the set of all the remaining cards - that is, the set of card that do not include the previously chosen cards. And so on and so forth.

== Informal proof (using the tree diagram) ==

We can draw a tree diagram to represent all the possible runs of the algorithm on a particular input. At the topmost level, at the root node, it has n children (each child node corresponds to a different choice that the algorithm can make). At the second level, each node has n-1 children, and so on and so forth. So it's easy therefore to see that there are n! possible paths through the tree, each with equal probability (because each node at each level has equal probability, so the probability of reaching each leaf node is the same).

Now all we have to do is to show that each of the n! permutations is represented by at least one execution path. Note that at each node we produce a distinct permutation. Which means that each node on each level produces a set of subtrees that is distinct from all other subtrees on that level. Recursing down the tree we get to the bottom level, which are subtrees consisting of only one node, where we stop.

== Formal proof (by Peter J. Haas) ==

The formal proof by Peter J. Haas looks at the probability of picking an arbitrary permutation.

If we can show that the probability of picking any arbitrary permutation is 1/n! then we have proven that the shuffle is correct (by definition).

Consider an arbitrary permutation [b_1, b_2, ..., b_n]

We produce a permutation x by the random shuffle algorithm.

What is the probability that x is identical to b, that is x_1 = b_1, x_2 = b_2, ..., x_n = b_n?

The probability of the Knuth shuffle producing a permutation with b_1 as its first element is clearly 1/n.

Now, the probability of the shuffle producing a permutation that starts with b_1, b_2 is: 1/n x 1/(n-1). Why? Because having chosen b_1 as the first element, now there are n-1 elements left to choose. The probability of choosing b_2 from those n-1 elements is therefore 1/(n-1).

And the probability of choosing b_3 as the third element, given that the first two elements are b_1 and b_2? Well, we simply multiply up the probabilities (this follows from the definition of conditional probability) as follows:

P(x_1 = b_1): 1/n

P(x_2 = b_2 given x_1 = b_1): 1/n * 1/(n-1)

P(x_3 = b_3 given x_1 = b_1 and x_2 = b_2): 1/n * 1/(n-1) * 1/(n-2)

And so on and so forth, such that

P(x_1 = b_1, x_2 = b_2, ..., x_n = b_n) = 1/n * 1/(n-1) * 1/(n-2) * ... * 1/1

Hence:

P(x = b) = 1/n!

Thus the proof concludes: the probability of the shuffling algorithm picking any arbitrary permutation is 1/n!, therefore the algorithm is correct.

Source: https://people.cs.umass.edu/~phaas/CS590M/handouts/Fisher-Yates-proof.pdf

Monday, August 3, 2020

Proof of function to normalize angles using why3

Problem: Write a function to normalize an angle to between -179 and 180 degrees (inclusive).

Solution (as found on StackOverflow here: https://stackoverflow.com/questions/2320986/easy-way-to-keeping-angles-between-179-and-180-degrees/2323034#2323034):

// reduce the angle  
angle =  angle % 360; 

// force it to be the positive remainder, so that 0 <= angle < 360  
angle = (angle + 360) % 360;  

// force into the minimum absolute value residue class, so that -180 < angle <= 180  
if (angle > 180)  
    angle -= 360;

Proof (verify on http://why3.lri.fr/try/):

module NormalizeAngle

  use int.Int  
  use int.ComputerDivision

  let normalize (a: int) : int
    requires { true }
    ensures  { result > -180 /\ result <= 180 /\ mod (a - result) 360 = 0 }
  =
    let ref r = a in
    r <- mod r 360;
    r <- mod (r + 360) 360;
    if r > 180 then r <- r - 360;
    r

  let main () =
    normalize (270)

end

This proof proves that the return value from the function is indeed between -179 and 180 and furthermore that the result is congruent to the input modulo 360 (recall that a and b are congruent mod n if their difference is divisible by n), which means that they are equal.