YADCA (yet another distributed counting algorithm)

After seeing @gapMindful 's impressive distributed counting example (with really superb explanation as well!), I was inspired to try my own version to see how simple of an algorithm I could come up with. I was surprised at how difficult and interesting this problem turned out to be!

Features:

  • Can count up to 342 blinks in a group (limited by ability to display larger numbers)
  • Can count up to 62 blinks in any one branch (limited by the size of IR_DATA_VALUE_MAX
  • Does not use datagrams - only normal IR send value stuff
  • Generates count basically as quickly as packets can reach the edges of the group and return
  • Count increases in real time as more of group is discovered, stopping when it gets to the last edge
  • Count updates in real-time (limited only by one way propagation delay) to reflect additions and subtractions from the group
  • Memory usage is constant regardless of group size - currently only single timer plus one byte for state/parent pointer (the byte is overloaded).
  • Any tile can become the “master”
  • Builds the node tree in a wide, time-efficient way
  • Handles errant multiple masters gracefully (each gets a part of the total count, and sum is correct)
  • Handles spontaneous changes in topology gracefully (master count is always the number of connected blinks)
  • Handles loops seamlessly

Try the demo out! It should be robust to anything you can throw at it! LMK if you find any edge cases where it either gets stuck or reports a bad total count!

NOTES

Loops turned out to be the hardest wrinkle. I thought I came up with a really sweet way to basically build a spanning tree out from the master that would automatically find the shortest branches using a “hop” counter… but it failed miserably in cases where there was a loop in topology and you disconnected a master from the loop at just the right second so that the loop ended up feeding back on itself forever. I then tried adding a time-to-live to the packets and this worked, but it took an impractically long time for the TTL to expire going around the loop, forcing you to pick either waiting seconds for a loop to time out, or artificially limiting the longest branch allowed - and I did not like either of those.

This new code take a totally different approach. In it, each tile is is in either MASTER, CHILD, or IDLE mode. When in IDLE mode, it waits for a count request which puts it into CHILD mode (and it remembers which face it got the request from as its parent). In CHILD mode, (1) it sends a count request on each face besides its parent, and (2) reports back to it parent the total counts it has for all of its children plus itself. MASTER mode is basically the same as child mode except it does not have a parent so it sends request on all faces.

Locking our parent to the first request we get is important - this insures that no loops form. We will join the first branch that makes it to us from the master and ignore any others that arrive later.

Why do we need IDLE mode? This is important for decommissioning a master. When a master stops being a master, it starts sending IDLE messages and these percolate down though all the branches telling all the children to forget their current parent lock and go into IDLE mode. This must happen for at least as long as the prorogation time to the farthest away leaf or else a loop could develop. To insure this, there is a timer that forces a blink to stay in IDLE mode for at least 250ms (long enough for a very big tree!).

Future directions

It would be nice if the code dealt with overflows. If you have more than 342 blinks then I apologize for letting you down! :slight_smile:

Firstly, we could reserve a value for “too many downstream blinks to count” and this would percolate up the chain so the master would be able to show “more than [total number correctly counted]” which is at least correct if not precise. This is straightforward and only a few more lines of code.

It would be nice if the code at least tried to balance the tree when a branch got too long. This would probably require children to be able to report something like “BTW, I have other parents I could go to if you get too full”, and if a node did get full then it could IDLE one of those faces to force that child over to a different branch.

Finally it would be nice to make it be able to precisely deal with very large groups. This would likely require falling back to datagrams to carry the counts upstream. An unsigned long count should last us a couple years at least, depending on how blinks do next xmas season! :slight_smile:

At least the datagrams would only be in one direction - we could still use the normal face values for all the state control and discovery which makes things much faster. We’d also need to update the IDLE timeout to make sure it is long enough to propagate across the largest groups. How cool would that be to watch the count accumulate on a group of 10,000 blinks! :slight_smile:

4 Likes

This will likely prove very helpful for me while I explore my new idea. Thank you so much!

2 Likes