[feature request] ECC for SPI transfer

timschneider

The current SPI Protocol Implementation is already acounting some transfer errors and will retry up to Settings.MaxSpiRetries.
The retry mechanis can fail without any retry e.g. for bit errors in the transfer on several positions in the header or data transfer.
Moreover in a noisy environment the transfer can fail for a single bit error in every transfer for more than Settings.MaxSpiRetries.

The code is quite complex.

ExchangeHeader()
https://github.com/Duet3D/DuetSoftwareFramework/blob/079e0158c757f86d954902e409eb3c89fc7e8197/src/DuetControlServer/SPI/DataTransfer.cs#L1385

ExchangeData()
https://github.com/Duet3D/DuetSoftwareFramework/blob/079e0158c757f86d954902e409eb3c89fc7e8197/src/DuetControlServer/SPI/DataTransfer.cs#L1567C14-L1567C14

In order to make the SPI transfer more robust I propose to implement some sort of Error Correction Code (ECC).

For example a simple Hamming code, this provides single-bit error correction and 2-bit error detection.

In order not to loose speed, ECC should be enabled in every header and by default not be enabled in data exchange (CRC is ok).
If the CRC is wrong for the first time, both ends fallback to ECC even in the data transfer.*

@chrishamm can tell if speed is a problem in the SPI transfer, if not, ECC should be enabled in every transfer but this may interfere with the use of DMA.

The Background: I often run printjobs for more than a few days, but it is very likly that the SPI will reset mid print for that long period of time in sbc setup. I do not have this kind of failure in standalone mode - I can run duet in standalone for month without any error - this is not the case in sbc mode.

For reference:
https://forum.duet3d.com/topic/34460/multiple-print-failures/15?_=1704373031244
https://forum.duet3d.com/topic/34315/rff-3-5-0-rc1-spi-reset-mid-print

chrishamm

@timschneider I don't really see why the retry mechanisms can fail without any retry. I did extensive tests with all sorts of different transfer errors that could be caused by RRF or DSF, and I can say that the CRC-based error detection/recovery worked well when I did. Also note that we have several SBC customers that have been printing 24/7 for months without any problems, so your report really sounds like an issue specific to your setup. I do remember SPI communication issues with a RockPi that I did not see with a RaspPi, so comparing the two might be worthwhile as well.

ECC may help in your scenario but implementing and testing it sufficiently doesn't seem like a quick change to me. Also, there are several more urgent things I need to take care of at this point, but of course I'd be happy to accept a PR if you fancy implementing and testing it yourself

timschneider

@chrishamm
I'll check the raspberry vs rockpi hint. And maybe you are right, my crashes are not related to the SPI bus - I'll check that.