What exactly happens when an IR collision is detected?

BGA · August 6, 2020, 9:15pm

I assume a collision is when both ends try to transmit at the same time, correct? What happens when one is detected? Does it mean that neither dude will transmit or does it mean that one side ends up transmitting while the other fails?

bigjosh · August 7, 2020, 5:20pm

There are a few levels of collision detection and avoidance.

At the blinkBIOS level, when you call BLINKBIOS_IRDATA_SEND_PACKET_VECTOR to send a packet, it first checks to see if there is an RX in progress on that face. If there is, then the TX request will fail (and tells you so) so you can try again. The in-progress RX continues unaffected.

How does it know that an RX is in progress? If it has successfully received a sync on that face and has not yet received a stop bit. So there is still the possibility that two faces could have an undetected collision if they both start transmitting within the window of time that a sync takes. This is a very narrow window (currently about 1/2 a millisecond), but since a silent collisions is still possible, this means that higher levels must be able to handle a lost packet. Note that when this happens, neither side knows that the other side is transmitting, so they both blindly continue with their transmit to completion and both packets are lost. In theory, it would have been possible to detect this case and abort one or possibly both sides of the TX, it was not worth the extra code space since this should be very rare and the best possible benefit would be to save a bit of time before the retransmit attempt… but really you want a random delay here anyway since if both sides just started the next TX immediately after the abort they would potentially just collide again.

The blinklib also implements its own synchronous collision avoidance that is extremely effective and is optimized for minimum and predicable latency across the link with small data sizes (good for many games). When two blinks are touching each other, they are continuously doing an IR ping-pong where one blink sends its face value, and then as soon as the other side gets this packet, it goes ahead and transmits its face value, and then back again, etc. See how that avoids collisions? There is also a slow timeout here to catch the case when a packet is lost, or to kickstart the process when a new blink shows up on a face. In practice, this logical ping-pong almost completely eliminates physical layer collisions.

Does this answer the question?

BGA · August 7, 2020, 6:12pm

This was the actual answer to my question. Thanks! That is really usedful information for what I will be doing.

Ah! So THIS is the actual reason for packets to get lost, correct? Ignoring this, if BLINKBIOS_IRDATA_SEND_PACKET_VECTOR returns not 0 then the packed should have been received at the other side, correct?

While we are at it, does the return value from BLINKBIOS_IRDATA_SEND_PACKET_VECTOR has any other meaning or is it really a boolean (it only ever returns 0 or 1)?

Understood. Would be feasible (and less costly code-size-wise) to not try to abort the transfers but simply let they continue BUT return 0 from BLINKBIOS_IRDATA_SEND_PACKET_VECTOR? Because, in the end, for client code what happened was a collision even if the lower layers just kept going. In any case, I can live with the current status quo specially with the insight that if BLINKBIOS_IRDATA_SEND_PACKET_VECTOR returns zero, then the likelihood of the packet have been transmitted is high. This allows optimization for the success case while dealing with failure in a side channel (the timer in the current blinklib implementation).

This is actually what I am doing now and the reason why I am working on doing direct sends instead of using the blinklib support. It looks to me blinklib was optimized for face values and it probably worked REALLY well until datagrams came into play. The collision avoidance in blinklib simply does not work for big datagrams as at evert send attempt the timers will have expired.

What I am doing is rolling a dice (currently a 50/50 percent change) on both ends. on every loop iteration after a collision until data goes through. Because the loops in different Blinks do not run in lockstep (which is obviously a good thing) then this “sending on every other loop iteration on average” will need to be adapted. a bit (specially considering each side might send datagrams with different sizes) but, at least, it does help a bit with collisions.

BTW, I am trying to avoid timers as much as possible, but I will use them if what I come up with does not work reliably.

Yep, I saw that and, as I mentioned, if all we had were face values this would work amazingly well. But datagrams introduces 2 things:

1 - They always take precedence over face values, so faces might expire.
2 - For bigger datagrams, they take considerable more time to be transmitted than the single-byte face values and this results in the send timer not being an effective mechanism for flow control.

I guess I mentioned this before but I would do a few things to sort this (and this is more or less what I am doing in my code):

1 - Reduce maximum datagram size by 1 byte (btw, this has the added benefit of the new size fitting in 4 bits, so one can even some memory space in the internal face_t struct by putting incoming and outgoing length in a single byte).
2 - Use that extra byte in the datagram payload to hold the face value.

With this, now you can either send a single face value as it is done today if there is no datagram or send a datagram AND the face value in a single transfer if a datagram is pending. This both reduces the numbers of transfers needed and eliminates the face value starvation that currently exists at the cost of making an average datagram transfer be one byte larger. The benefits outweight the costs here. I did not measure what would be the code-size cost yet but I will.

3 - Get rid of the send timer if at all possible and use loop iterations as a proxy. This would save some more memory at the cost of needing some tunning (based on transfer size). Note that assuming this works, you probably can reduce the 200 milliseconds timeout to something smaller on average.

What do you think? Should I send a pull request?

My datagram3 implementation will include all this soon as a proof of concept so you can better evaluate how all of this would work. As i am doing direct sends, the blinklib datagram path is never exercised. This is good because it seems face values is what irt deos best anyway but is bad because then there might be collisions between face values (not in my control) and datagrams so collision avoidance is not as straight-forward.