Timing issue with datagrams

BGA · July 29, 2020, 11:26pm

I stumbled upon an issue that is hindering my progress on my game so I hope someone out there can provide some advice.

It seems there is some timing issue related to sending datagrams. Consider the program below:

#include <blinklib.h>
#include <shared/blinkbios_shared_functions.h>
#include <string.h>

#define TRANSFER_LEN 16
#define PROCESSING_DELAY 0

void setup() {}

byte datagram[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};

Timer wait_timer;

void Send(byte f) { sendDatagramOnFace(datagram, TRANSFER_LEN, f); }

void loop() {
  if (buttonSingleClicked()) {
    FOREACH_FACE(f) { Send(f); }
  }

  if (wait_timer.isExpired()) {
    FOREACH_FACE(f) {
      if (isValueReceivedOnFaceExpired(f)) continue;

      if (!canSendDatagramOnFace(f)) continue;

      if (getDatagramLengthOnFace(f) == 0) continue;

      const byte* data = getDatagramOnFace(f);

      if (memcmp(data, datagram, TRANSFER_LEN) != 0) {
        BLINKBIOS_ABEND_VECTOR(f);
      } else {
        setColorOnFace(GREEN, f);
      }

      markDatagramReadOnFace(f);

      Send(f);

      wait_timer.set(PROCESSING_DELAY);
    }
  } else {
    setColor(OFF);
  }
}

(Keep in mind this is juts a hacked up testing case.)

What the program was supposed to do:

If you load it in 2 blinks and connect them, clicking on the button on one would start a datagram message to be sent back and forth between them.

What actually happens is that, after a while it stops. The reason it stops is because one side sends a message that never reaches the other side.

Now, the interesting part is that if you increase the value of PROCESSING_DELAY, you will see it will take longer for the messages to stop. If the delay is big enough, the messages do not seem to stop at all.

I looked at the blinklib code but could not find anything obviously wrong but it might simply mean that the actual bug is inside BlinkBIOS itself.

Now, the even more interesting thing can be seem if you connect several blinks (say, 7, clicking on the center one). In this case the messages will stop again even with the delay so, whatever it is, it is affected by the number of faces you are trying to send data to in a single loop iteration.

@bigjosh?

BGA · July 29, 2020, 11:27pm

BTW, ignore the canSendDatagramOnFace(). I hacked blinklib so it would return true here if the length of the output buffer is 0. False otherwise. It is not material to the issue I am seeing.

BGA · July 29, 2020, 11:33pm

Another thing: The delay required seems to be directly proportional to the size of the datagram being sent. Bigger one require a bigger delay. A 1 byte datagram does not appear to require any delay (when sent to a single face).

BGA · July 31, 2020, 10:19pm

Ok, this was definitely a timing issue. There is an easy fix but I am not sure if this would cause some other issues with Blinks.

The actual problem is that if you try to send 16 byte datagrams on all the 6 faces (which might happen when broadcasting a message), it takes around to 400 ms (with a high variability) for all of them.

I looked at the code expecting to find something that directly depended on faces being not expired inside blinklib but I could not immediately find it. But as the long time looked suspicious when compared with the TX_PROBE_TIME_MS (150 ms) and RX_EXPIRE_TIME_MS (200 ms) I decided to give it ago and increased them to 400 and 450 respectively. Now even datagrams with 16 bytes are being correctly transferred to all faces!

Based on the code, it looks like if these timings are increased, the only side effect is blinks taking a bit longer to detect that a face is not connected but I am not sure if there are any games that depend on that time not being higher than it was today. Thoughts?

@jbobrow @bigjosh

BGA · July 31, 2020, 10:27pm

One important thing I forgot to mention. The number I changed to works but only for programs that do almost nothing else (my test program only sent and received datagrams.

My broadcast manager required me to up the timings to 550 and 600 to work reliably.

BGA · July 31, 2020, 11:15pm

One last thing and I will shut up until someone that knows more than I do about blinks inner workings comments:

I am using 16 byte datagrams because it is the work case scenario. But even 4 bytes datagrams already start getting missed with the existing timeout. And, in fact, depending on what ones program is doing, even 3 bytes will be too much (I happen to be using up to 5 byte datagrams in my game).

bigjosh · August 2, 2020, 3:28am

Datagrams are not guaranteed to be delivered. It is part of their semantics. While you can potentially get a long string of successful deliveries on the blinks currently sitting on your desk with the current battery levels and the current firmware and the current temperature and light levels, if your code assumes this case then it violates the documented behavior and will be brittle when things change.

The idiomatic way to use datagrams under blinklib is to use idempotent messages that are controlled using state that is shared with the underlying setValueSentOnFace() values.

This is admittedly awkward. Ideally, I would have liked to have offered a setLargeValueSentOnFace( void *data, byte len) that worked exactly the same way that setValueSentOnFace()works just with more data, but this is impractical because of the way that blinklib does collision avoidance, neighbor presence detection, and it would generally be slow and wasteful sending all that big data over and over again. So we split the baby - but for the existing games that have used datagrams it seems to be an OK compromise.

If your game can live with the much higher latency for detecting a missing neighbor then you are certainly free to fork the blinklib and change time outs, but again I think you are trying to fit a square peg into a round hole (that is in a square hole). If you are going to fork blinklib anyway, mind as well get what you really want and do the packets (and neighbor detection) yourself.

(Note that if you kill all the blinklib coms you will be responsible for making sure that blinks that happened to not get button presses do not sleep prematurely - but you can almost certainly do this better yourself than the current method since you will be able to tailor your approach to how your specific game uses button presses. It is possible this will not even require extra bits in the communication if there is already information about game play events in the packets).

BGA · August 2, 2020, 4:35am

Replying from my phone, so I will be brief and will answer in full tomorrow. In this specific car, the issue seems to be directly related to a timeout. Do you have a theory about what exactly is happening?

bigjosh · August 2, 2020, 3:12pm

This is the documented and expected behavior. I promise.

// Each datagram sent is received at most 1 time.

github.com

bigjosh/Move38-Arduino-Platform/blob/main/cores/blinklib/blinklib.h#LC72:~:text=Datagram processing


// Value should be between 0 and IR_DATA_VALUE_MAX inclusive.
// If a value greater than IR_DATA_VALUE_MAX is specified, IR_DATA_VALUE_MAX will be sent.
// By default we power up with all faces sending the value 0.

void setValueSentOnFace( byte value , byte face );

// Same as setValueSentOnFace(), but sets all faces in one call.

void setValueSentOnAllFaces( byte value );

/* --- Datagram processing */

// A datagram is a set of 1-IR_DATAGRAM_MAX_LEN bytes that are atomically sent over the IR link
// The datagram is sent immediately on a best efforts basis. If it is not received by the other side then
// it is lost forever. Each datagram sent is received at most 1 time. Once you have processed a received datagram
// then you must mark it as read before you can receive the next one on that face. 

// Must be smaller than IR_RX_PACKET_SIZE

#define IR_DATAGRAM_LEN 16

This is also expected behavior. Let’s say that for your blinks on your desk with your battery levels, you would expect statistically that 1 in 1,000 datagrams will get dropped. If you are sending 100 datagrams per second (10ms between sends), I would expect to wait an average of 10 seconds between dropped datagrams. If you increase the delay to 100ms between datagrams, now I would expect to wait an average of 100 seconds between dropped datagrams.

The real effect is probably not so linear, but could explain what you are seeing qualitatively.

I would not consider this a “bug” - it is the explicitly documented behavior.

I can think of many possible physical and logical reasons that the chance of a datagram getting dropped would increase with the number of concurrent datagrams happening on other faces. I have spent many dozens of hours of my life staring at oscilloscope traces and logic analyzers looking at these cases!

But fundamentally according to Claude, there is no such thing as a ~~free lunch~~ reliable communication channel. The best we can ever do is pick how we’d like to trade-off speed, fidelity, latency, and complexity.

The logical communications channel between two touching blinks is a surprisingly noisy one. We are using LEDs meant only for transmitting also as receivers. There are no less than 6 air-to-polycarbonate interfaces that each extract their dB toll on every passing photon. There is a sub-$1 MCU running at a pitiful 8Mhz that is solely responsible for managing the constant concurrent bidirectional communications on all these LEDs while also blinking the 18 visible LEDs fast enough to look like there is a range of brightnesses, and managing the charge pump, and keeping track of the button states, and monitoring the battery voltage… and running the game!

Like almost all modern stacks, the blinks’ network layer explicitly does not guarantee delivery. That is left to higher level protocols because doing it at the network layer would add complexity, latency, and non-determinism. Higher level protocols are a better place to make these decisions to suit their use cases. The blinklib transport layer uses redundancy rather than acknowledgement to ensure delivery of the data because this is simple and has good latency and low jitter- which is a good fit for many games.

If you need an ACK based transport layer then the best place to do this is directly on top of the network layer. The game downloading mechanism works like this - it uses sequence number-based request scheme with timeouts to ensure that the blocks are delivered and delivered in order. Alternately you could also do a sliding window based system like TCP to deliver an in-order byte stream across the link. It all depends on what is the best fit for the problem you are trying to solve. Let’s figure out how to best get you the communication services you want rather than figuring out why the one that you’ve got is such a bad fit!

BGA · August 3, 2020, 1:21am

Ok, datagrams are unreliable but form my experience they can be made at least less unreliable.

Also, there is a problem with something you mentioned earlier. If someone tries to simply keep sending datagrams, that will prevent face values to be sent at all (as datagrams always take precedence). What about making sure datagrams and face values always get their share? Basically, if there is no datagram pending, send face values (as it is today). If there is a datagram pending, send it instead UNLESS a datagram was sent the previous iteration. In this case send a face value. This will at least prevent starvation if someone tries to just do a datagram storm to try to get them to be delivered.

Now, as datagrams mostly work, I though of a compromise: Instead of sending 1 datagram, I will send, say, 3 (or whatever number ends up being reasonable). Datagrams will have their sequence number so if a peer see more than one with the same sequence number, it simply discards them. This can still result in failures but them, again, someone might just disconnect a blink from another anyway. For now, I will handle the starvation prevention on my side.

bigjosh · August 3, 2020, 3:21am

I think this kind of flow control can only properly be done on the RX side since the sender could send a “forced” face value but it could get dropped - and in life the only thing that matters is what it received not what is sent.

I think the (admittedly ugly & awkward) way to handle this is to use the face values to control the datagrams. That is, you send a datagram and then you wait until either (1) you get an ACK via face value or (2) you timeout out. This way there is never more than 1 datagram pending and you also leverage the speed and auto repeat of the face values system.

BGA · August 3, 2020, 8:25pm

This is more or less what my original guaranteed delivery implementation did:

github.com

brunoga/blinks/blob/8162d47503a85f94bf415957604c6d85eabd9f89/datagram/datagram.cpp


#include "datagram.h"

#include <string.h>  // For memcpy.

#include "payload_bytes.h"

// Use the last 3 available face values for ourselves, effectivelly making the
// maximum value a game using this can send on a face be 57.
#define DATAGRAM_SENT1 IR_DATA_VALUE_MAX
#define DATAGRAM_SENT2 IR_DATA_VALUE_MAX - 1
#define DATAGRAM_RETRY1 IR_DATA_VALUE_MAX - 2
#define DATAGRAM_RETRY2 IR_DATA_VALUE_MAX - 3
#define DATAGRAM_RECEIVED IR_DATA_VALUE_MAX - 4
#define DATAGRAM_DONE IR_DATA_VALUE_MAX - 5

namespace datagram {

// Keep track of pending faces.
static byte pending_ack_ = 0;

This file has been truncated. show original

Now I am not sure why I gave up on it for the datagram-only implementation. Maybe it was due to face starvation when I did not know that there was such a thing. Maybe I should revisit it.

BGA · August 5, 2020, 5:44am

Ok, I gave it another go to doing guaranteed delivery using face values for signaling. It seems to actually be working after I did 2 specific workarounds:

1 - Make sure I send a single datagram at each loop iteration. I can actually be smarter and do this based on size (1 byte datagrams do not need this). This is required to avoid timing out all connected faces at once the next loop iteration.

2 - Only send a datagram to a face every other loop iteration. I am still trying to understand why I had to do that but it is definitely related to face starvation. I was expecting (1) would also solve this (as a datagram will only be sent every 6 loops to the same face) but it looks like datagrams are even more unreliable than I originally thought and the 6 iterations were still starving face values (verified by checking the face values a connected blink was getting.

But now I actually have what looks to be very reliable “guaranteed” delivery. In fact, the only reason it is not guaranteed is because someone might physically disconnect a blink and connect it somewhere else and THIS case I can not cover so there is no guaranteed delivery.

Anyway, this is considerable less code than my datagram-only solution used and is even working better (there is still a bug in the other implementation somewhere).

So, without further ado, here it is:

@bigjosh @jbobrow I sent a pull request for blinklib to add some functions that would both simplify this code and make it use considerable less memory. Please check it out when you can and let me know what you think.

BGA · August 6, 2020, 9:18pm

I guess this will never end

I now added a third implementation that is faster and smaller (code-wise) than the other 2. This should be what I will end up using as soon as I clean up the quirks. No face values used and no complex handshake for transfers.