Tuesday, June 28, 2022

Proof of algorithm to detect cycles in a directed graph

I was writing my own wiki software and needed a way to detect tag cycles e.g. when tag X has tag Y and tag Y has tag Z and tag Z has tag X. This should not be allowed. So I needed an algorithm to detect this kind of error.

Here is a simple implementation of depth first search to produce a post-order traversal:

    def postorderTraversal(self, root):
        result = []
        def dfs(node, result):
            for child in [node.left, node.right]:
                if child is not None:
                    dfs(child, result)
            result.append(node.val)
        if root is not None:
            dfs(root, result)
        return result

As you can see, this algorithm is recursive which isn't very good for large graphs. 

So here is the iterative version using a stack:

    def postorderTraversal(self, root):
        if root is None:
            return []
        result = []
        visited = set()
        stack = [root]
        while stack:
            node = stack[-1]
            if node in visited:
                stack.pop()
                result.append(node.val)
            else:
                visited.add(node)
                children = [node.left, node.right]
                children.reverse() # needed for left-to-right traversal
                stack.extend([child for child in children if child is not None])
        return result
    

Property 1: After each node is pushed onto the stack, its children are then pushed onto the stack. This means that a node cannot be popped until its children are first popped. And its children cannot be popped until their children are popped and so on. This means that when a node is visited the second time it means its children have already been popped off which means all of its descendants have been visited.


    def canFinish(self, numCourses, prerequisites):
        global has_cycle        
        has_cycle = False
        children = defaultdict(list)
        
        for edge in prerequisites:
            children[edge[0]].append(edge[1])
        
        color = defaultdict(int)
        def dfs(node):
            global has_cycle
            if color[node] == 2:
                return
            if color[node] == 1: # 1 means GREY
                has_cycle = True
                return
            
            color[node] = 1
            for child in children[node]:
                dfs(child)
            color[node] = 2 # 2 means BLACK
        
        for node in range(numCourses):
            dfs(node)
            
        return not has_cycle
            

The above is recursive code for detecting cycles in a directed graph. 

Idea:

- We do DFS on every node in the graph.

- When we start DFS on a node, we mark it GREY. 

- When we finish DFS on a node, we mark it BLACK.

- We visit every child.

- If when we visit a node, it is GREY, then we have a cycle. 

- We ignore black nodes.

Informal proof of correctness:

- A node is BLACK if all of its descendents have been visited and none of them are part of any cycle therefore a BLACK node cannot be part of any cycle, therefore they do not need to be revisited.

- A node is GREY if we have NOT finished DFS on it, meaning it must be on the recursion stack. Proof: if we have finished DFS on it, then it would have been marked BLACK. Therefore, if it's GREY then it means we haven't finished DFS on it which means that it must be in the recursion stack.

- If we revisit a GREY node then it means we are revisiting a node already on the recursion stack which means there is a cycle.

- If we do not revisit a GREY node then it means there is no cycle. Proof: if there is a cycle then that means there is something like this: X->A->B where X is reachable from B. Now if we do DFS then we will visit all the edges from the node and all of the edges from all of its descendants. This means if A is reachable from the node then we will visit the edge A->B, and since X is reachable from B we will also visit the edge X->A. This means we will visit A twice, first from the starting node, and then from X. In general if there is a cycle reachable from the starting node then if we do DFS following all the edges then we will definitely revisit a node that we've already visited. Therefore if we do the DFS and do not revisit any node then that means there isn't a cycle.

Here is the iterative version:

    def canFinish(self, numCourses, prerequisites):
        global has_cycle        
        has_cycle = False
        children = defaultdict(list)
        
        for edge in prerequisites:
            children[edge[0]].append(edge[1])
        
        color = defaultdict(int)
        
        for node in range(numCourses):
            # do dfs on node
            stack = [node]
            while stack:
                node = stack[-1]
                if color[node] == 2: # if already visited, don't visit it again
                    stack.pop()
                    continue # ignore black nodes
                if color[node] == 1: # we finished visiting all of its children
                    stack.pop()
                    color[node] = 2
                else:
                    color[node] = 1
                    for child in children[node]:
                        if color[child] == 1: # found a cycle
                            return False
                        if color[child] == 2: # don't visit it
                            continue
                        stack.append(child)
        return True

Properties:

  • DFS is simulated using the stack and while loop.
  • The stack contains the list of "nodes to visit".
  • The first time we "visit" a node, we mark it GREY.
  • The second time we "visit" a node, we mark it BLACK (we first check if it's GREY, if so that means we've already visited it).

Informal proof of correctness:

  • Base case: when a node is a leaf i.e it has no children, then when it is visited that means it is at the top of the stack, so we visit it once and mark it GREY then we push its children onto the stack (it has no children) so then we immediately revisit it and we mark it BLACK and pop it off the stack. In this case the property holds: we popped it off when we have visited all of its children (it has none). 
  • Inductive step: when a node is a non-leaf i.e it has children. If a node has only children who are leaves, we push all of its children onto the stack and then we pop off each of the children (as per the previous point), marking each of them BLACK, until we get back to the node. At this point we have visited all of its children. So the property again holds. So we have that each node is only revisited and popped off once all of its descendants have been popped off.
  • We want to prove that a node is popped off when and only when all of its descendants have been visited.
  • A node is popped off when all of its descendants have been popped off. This is because when we "visit" a node for the first time, we push all of its descendants onto the stack. We only pop off a node when we visit it for the second time. In order for us to visit it for the second time, we must have popped off all of the nodes on top of it in the stack. Since when we visit a node we push all of its children on the stack, that means we only pop off the node when we pop its children off the stack first. And this property holds for its children and its children's children and so on. Therefore, when we pop off a node, that means we've popped off not only its children but also its children's children and so on. Thus, when we pop off a node, that means we've already popped off all of its descendants from the stack i.e all of the nodes reachable from that node.