Chasing bad SPI checksums.



  • Duet 3 hardware and software. 6Hc base + one Expander. I'm seeing a few of these, now and then:

    Nov 07 18:55:24 duet3 DuetControlServer[351]: [warn] Bad data checksum (expected 0x3c1a, got 0x3037)
    Nov 07 18:55:24 duet3 DuetControlServer[351]: [warn] Note: RepRapFirmware didn't receive valid data either (code 0x00000000)
    Nov 07 18:55:24 duet3 DuetControlServer[351]: [warn] Restarting transfer because RepRapFirmware received a bad response (data content)
    Nov 07 19:01:57 duet3 DuetControlServer[351]: [warn] Bad data checksum (expected 0x15ce, got 0x1e0a)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Bad header checksum (expected 0x0000, got 0x1760)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Note: RepRapFirmware didn't receive valid data either (code 0x0001005f)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Bad header checksum (expected 0x0000, got 0x1760)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Note: RepRapFirmware didn't receive valid data either (code 0x0001005f)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Bad header checksum (expected 0x0000, got 0x1760)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Note: RepRapFirmware didn't receive valid data either (code 0x0001005f)
    Nov 07 19:05:50 duet3 DuetControlServer[351]: [warn] Restarting transfer because the number of maximum retries has been exceeded
    

    Are these on the CAN bus? Or are the between the SBC (Pi 4 in this case) and the main board? Or somewhere else?


  • administrators

    They are between the Pi and the Duet.


  • administrators

    @Danal what speed are you running the SPI bus at?



  • Hmmm... to all of my reading, this is determined by the application at the "open" of the connection.

    The app specifies this as a divisor of the system clock frequency. That means two things that we should think about:

    1. This is an RPI 4, and therefore may operate at a higher clock speed than other SBCs that have been tested with the Duet 3 6HC. Have you guys used an R4?

    2. When the System-on-a-Chip manages thermals, it can vary the clocks without notice. This does affect the "divided down" SPI speed. And, it can occur in the middle of an SPI transaction. In theory, given that SPI uses explicit clocking, this shouldn't hurt anything. In practice? There seems to be lots of discussion around this and no firm conclusions.



  • And... I found this in the config.json for the SPI device driver in the duetsoftwareframework:

     "SpiDevice": "/dev/spidev0.0",
     "SpiFrequency": 2000000,
     "SpiTransferTimeout": 500,
     "MaxSpiRetries": 3,
     "SpiPollDelay": 25,

  • administrators

    @Danal said in Bad checksums - Which connection?:

    Hmmm... to all of my reading, this is determined by the application at the "open" of the connection.

    The app specifies this as a divisor of the system clock frequency. That means two things that we should think about:

    1. This is an RPI 4, and therefore may operate at a higher clock speed than other SBCs that have been tested with the Duet 3 6HC. Have you guys used an R4?

    2. When the System-on-a-Chip manages thermals, it can vary the clocks without notice. This does affect the "divided down" SPI speed. And, it can occur in the middle of an SPI transaction. In theory, given that SPI uses explicit clocking, this shouldn't hurt anything. In practice? There seems to be lots of discussion around this and no firm conclusions.

    I use a RPI4 on my Toolchanger and i beleive a few other users are using them as well. I do have a heatsink case on mine though.

    Yes in config.json the SPI speed is set. 2Mhz is not that high - it is capable of much higher - but you might want to try 1Mhz to see if it makes any difference to the CRC errors. Also normal things like keeping the ribbon cable length short and not too close to other noise emitters. Also i assume you are using up todate DSF/RRF?



  • The ribbon cable is the one in the box with the 6HC. The RPI is mounted "alongside" (I'll get a photo in a few min).

    Up to date: sudo apt-get update followed by upgrade does not upgrade anything.

    I am on the duet3d.list; should I be on duet3d-unstable.list?


  • administrators

    @Danal I think that stable and unstable match right now. Unstable gives me:

    duetcontrolserver 1.1.0.5
    duetruntime 1.1.0.5
    duetsd 1.0.4
    duetsoftwareframework 1.1.0.5
    duettools 1.1.0.5
    duetwebcontrol 2.0.4-1
    duetwebserver 1.1.2.1

    The ribbon cable out the box is short and the mounting scheme is as i have it so that all looks good.



  • I will give 1000000 a try, for debugging if nothing else.



  • And, I do have a restart that occured about an hour into a multi hour print.

    Edit: There are a few bad checksums before the restart; at the same time, there have been similar messages that did not cause a restart.

    reset-subset.txt


  • administrators

    so you got a CRC arror that caused a reset? of what RRF, DCS or? Hopefully @chrishamm can investigate


  • administrators

    @Danal It looks like your Duet 3 restarted/crashed before the print was cancelled:

    Nov 08 00:41:17 duet3 DuetControlServer[351]: [warn] Controller has been reset

    What RRF version are you using, can you share the output of M122?



  • Just a short one from me: I have it running with a RPi4 at ~20.83MHz (set it to 21MHz and it will be get to the actual value through core clock divisor) stable. But I also have fixed my core clock in /boot/config.txt to not have those uncontrollable frequency changes:

    core_freq=250
    core_freq_min=250
    


  • @chrishamm said in Bad checksums - Which connection?:

    @Danal It looks like your Duet 3 restarted/crashed before the print was cancelled:

    Nov 08 00:41:17 duet3 DuetControlServer[351]: [warn] Controller has been reset

    What RRF version are you using, can you share the output of M122?

    Board: Duet 3 MB6HC (MB6HC)
    DSF Version: 1.1.0.5
    Firmware: RepRapFirmware for Duet 3 MB6HC v0.6 or 1.0 3.0beta12 (2019-11-02b1)
    Duet Web Control 2.0.4

    Unfortunately, I did not obtain M122 prior to a power cycle.



  • @wilriker said in Bad checksums - Which connection?:

    Just a short one from me: I have it running with a RPi4 at ~20.83MHz (set it to 21MHz and it will be get to the actual value through core clock divisor) stable. But I also have fixed my core clock in /boot/config.txt to not have those uncontrollable frequency changes:

    core_freq=250
    core_freq_min=250
    

    I was thinking about trying that, as the Pi4 is plenty fast enough.

    Thanks for the detailed settings, I will do that next.


  • administrators

    @Danal said in Bad checksums - Which connection?:

    core_freq=250
    core_freq_min=250

    @wilriker are there any downsides to this approach?


  • administrators

    @T3P3Tony I also see from the documentation:

    For Pi 4B the default value is 500, 600 is the only other accepted value;



  • @T3P3Tony said in Bad checksums - Which connection?:

    @Danal said in Bad checksums - Which connection?:

    core_freq=250
    core_freq_min=250

    @wilriker are there any downsides to this approach?

    CPU scales independently from the core clock and I could not find what else this has an influence on. So far I have not encountered any downsides.

    Re 500 and 600: that's for overclocking. But you can always underclock. Confirmed on mine.

    I will post some more details findings later. In short SPI clock is only about 40% of what your set it to on RPi 4 unless you fix the core clock.


  • administrators

    @T3P3Tony said in Bad checksums - Which connection?:

    core_freq=250
    core_freq_min=250

    but i i want to be greedy, don't care about power, and have a heatsink i would potentially fix this at 500 or 600?

    I just checked (vcgencmd measure_clock arm) and i am running at 600 right now (probably due to using VNC with a big screen resolution which is quite processor intensive i guess).

    might test:

    core_freq=600
    core_freq_min=600
    

    Also what is not clear is what the actual difference between doing that and using force_turbo=1
    is



  • @T3P3Tony said in Bad checksums - Which connection?:

    @T3P3Tony said in Bad checksums - Which connection?:

    core_freq=250
    core_freq_min=250

    but i i want to be greedy, don't care about power, and have a heatsink i would potentially fix this at 500 or 600?

    I don't think this brings very much improvement but yes, you could fix that at 500 (I would not overclock something I don't even know what it all influences).

    I just had a thought that it might influence RAM clock but I have no evidence to back that up.

    I just checked (vcgencmd measure_clock arm) and i am running at 600 right now (probably due to using VNC with a big screen resolution which is quite processor intensive i guess).

    That's the CPU clock you checked. So although you are running VNC it's still stepped down to it's minimum speed. 😁 It's the clock core you were looking for.

    Also what is not clear is what the actual difference between doing that and using force_turbo=1
    is

    force_turbo=1 will do some more things, some of them being very sketchy when it comes to system stability - but it also blows an "efuse" (i.e. it sets a non-reversible flag) to void your warranty.

    Now to my (most likely confusing) explanation about SPI speed: what you set is the maximum frequency. For the first two Pi generations that is equal to the frequency at any point in time but starting with RPi 3 this changed. Before that the core clock (which is the base for the SPI frequency) was fixed at 250MHz. RPi 3 has a scaling core clock 250-400MHz. And RPi 4 has a range of 200-500MHz (that lower minimum will bite is later).

    SPI frequency is derived from the core frequency divided by an even number and also scales with it. And since all RPis should be interchangeable they all use the common denominator of 250MHz for that. The first even number is 2 and thus 125MHz is the maximum SPI frequency any RPi can achieve.
    But since they introduced possibly higher core frequencies with RPi 3 and later they had to make sure that the above mentioned maximum frequency is not exceeded. They solved it by first deriving the actual maximum by checking for the next lower divider (e.g. setting to 21MHz will result in 125/6 =~ 20.83MHz) and then scale that value again so this will only be reached at maximum core clock.

    Simple example: set SPI frequency to 125MHz. On a RPi 4 with a maximum of 500MHz this means when the core clock actually runs at 250MHz the SPI frequency will also be halfed to 62.5MHz. And now the core clock in RPi 4 can even go down below that to 200MHz and guess what: SPI frequency will follow that as well.

    The biggest problem here is that whenever core clock changes SPI frequency will change with it not awaiting any data transfer to finish - some devices handle this better than others.
    And of course also you get lower speeds than what you requested without even noticing most of the time since with current resource usage when attached to a Duet 3 mine runs at only insignificantly higher usage than fully idle. core clock in mine was virtually fixed at 200MHz.

    That's why I fixed my minimum (new with RPi 4) and maximum to 250MHz. I don't know how it influences SPI frequency if you fix it at any value above that (my scope is orders of magnitude too slow to check that - might get myself a more capable one some day) and whenever I try to formulate a guess in my head I find a reason why this might still be different. So I just don't know.


Log in to reply