Chasing bad SPI checksums.

Danal

Hmmm... to all of my reading, this is determined by the application at the "open" of the connection.

The app specifies this as a divisor of the system clock frequency. That means two things that we should think about:

This is an RPI 4, and therefore may operate at a higher clock speed than other SBCs that have been tested with the Duet 3 6HC. Have you guys used an R4?
When the System-on-a-Chip manages thermals, it can vary the clocks without notice. This does affect the "divided down" SPI speed. And, it can occur in the middle of an SPI transaction. In theory, given that SPI uses explicit clocking, this shouldn't hurt anything. In practice? There seems to be lots of discussion around this and no firm conclusions.

Danal

And... I found this in the config.json for the SPI device driver in the duetsoftwareframework:

 "SpiDevice": "/dev/spidev0.0",
 "SpiFrequency": 2000000,
 "SpiTransferTimeout": 500,
 "MaxSpiRetries": 3,
 "SpiPollDelay": 25,

T3P3Tony

@Danal said in Bad checksums - Which connection?:

Hmmm... to all of my reading, this is determined by the application at the "open" of the connection.

The app specifies this as a divisor of the system clock frequency. That means two things that we should think about:

This is an RPI 4, and therefore may operate at a higher clock speed than other SBCs that have been tested with the Duet 3 6HC. Have you guys used an R4?

When the System-on-a-Chip manages thermals, it can vary the clocks without notice. This does affect the "divided down" SPI speed. And, it can occur in the middle of an SPI transaction. In theory, given that SPI uses explicit clocking, this shouldn't hurt anything. In practice? There seems to be lots of discussion around this and no firm conclusions.

I use a RPI4 on my Toolchanger and i beleive a few other users are using them as well. I do have a heatsink case on mine though.

Yes in config.json the SPI speed is set. 2Mhz is not that high - it is capable of much higher - but you might want to try 1Mhz to see if it makes any difference to the CRC errors. Also normal things like keeping the ribbon cable length short and not too close to other noise emitters. Also i assume you are using up todate DSF/RRF?

Danal

The ribbon cable is the one in the box with the 6HC. The RPI is mounted "alongside" (I'll get a photo in a few min).

Up to date: sudo apt-get update followed by upgrade does not upgrade anything.

I am on the duet3d.list; should I be on duet3d-unstable.list?

T3P3Tony

@Danal I think that stable and unstable match right now. Unstable gives me:

duetcontrolserver 1.1.0.5
duetruntime 1.1.0.5
duetsd 1.0.4
duetsoftwareframework 1.1.0.5
duettools 1.1.0.5
duetwebcontrol 2.0.4-1
duetwebserver 1.1.2.1

The ribbon cable out the box is short and the mounting scheme is as i have it so that all looks good.

Danal

I will give 1000000 a try, for debugging if nothing else.

Danal

And, I do have a restart that occured about an hour into a multi hour print.

Edit: There are a few bad checksums before the restart; at the same time, there have been similar messages that did not cause a restart.

reset-subset.txt

T3P3Tony

so you got a CRC arror that caused a reset? of what RRF, DCS or? Hopefully @chrishamm can investigate

chrishamm

@Danal It looks like your Duet 3 restarted/crashed before the print was cancelled:

Nov 08 00:41:17 duet3 DuetControlServer[351]: [warn] Controller has been reset

What RRF version are you using, can you share the output of M122?

wilriker

Just a short one from me: I have it running with a RPi4 at ~20.83MHz (set it to 21MHz and it will be get to the actual value through core clock divisor) stable. But I also have fixed my core clock in /boot/config.txt to not have those uncontrollable frequency changes:

core_freq=250
core_freq_min=250

Danal

@chrishamm said in Bad checksums - Which connection?:

@Danal It looks like your Duet 3 restarted/crashed before the print was cancelled:

Nov 08 00:41:17 duet3 DuetControlServer[351]: [warn] Controller has been reset

What RRF version are you using, can you share the output of M122?

Board: Duet 3 MB6HC (MB6HC)
DSF Version: 1.1.0.5
Firmware: RepRapFirmware for Duet 3 MB6HC v0.6 or 1.0 3.0beta12 (2019-11-02b1)
Duet Web Control 2.0.4

Unfortunately, I did not obtain M122 prior to a power cycle.

Danal

@wilriker said in Bad checksums - Which connection?:

Just a short one from me: I have it running with a RPi4 at ~20.83MHz (set it to 21MHz and it will be get to the actual value through core clock divisor) stable. But I also have fixed my core clock in /boot/config.txt to not have those uncontrollable frequency changes:
core_freq=250
core_freq_min=250

I was thinking about trying that, as the Pi4 is plenty fast enough.

Thanks for the detailed settings, I will do that next.

T3P3Tony

@Danal said in Bad checksums - Which connection?:

core_freq=250
core_freq_min=250

@wilriker are there any downsides to this approach?

T3P3Tony

@T3P3Tony I also see from the documentation:

For Pi 4B the default value is 500, 600 is the only other accepted value;

wilriker

@T3P3Tony said in Bad checksums - Which connection?:

@Danal said in Bad checksums - Which connection?:

core_freq=250
core_freq_min=250

@wilriker are there any downsides to this approach?

CPU scales independently from the core clock and I could not find what else this has an influence on. So far I have not encountered any downsides.

Re 500 and 600: that's for overclocking. But you can always underclock. Confirmed on mine.

I will post some more details findings later. In short SPI clock is only about 40% of what your set it to on RPi 4 unless you fix the core clock.

T3P3Tony

@T3P3Tony said in Bad checksums - Which connection?:

core_freq=250
core_freq_min=250

but i i want to be greedy, don't care about power, and have a heatsink i would potentially fix this at 500 or 600?

I just checked (vcgencmd measure_clock arm) and i am running at 600 right now (probably due to using VNC with a big screen resolution which is quite processor intensive i guess).

might test:

core_freq=600
core_freq_min=600

Also what is not clear is what the actual difference between doing that and using force_turbo=1
is

wilriker

@T3P3Tony said in Bad checksums - Which connection?:

@T3P3Tony said in Bad checksums - Which connection?:

core_freq=250
core_freq_min=250

but i i want to be greedy, don't care about power, and have a heatsink i would potentially fix this at 500 or 600?

I don't think this brings very much improvement but yes, you could fix that at 500 (I would not overclock something I don't even know what it all influences).

I just had a thought that it might influence RAM clock but I have no evidence to back that up.

I just checked (vcgencmd measure_clock arm) and i am running at 600 right now (probably due to using VNC with a big screen resolution which is quite processor intensive i guess).

That's the CPU clock you checked. So although you are running VNC it's still stepped down to it's minimum speed. It's the clock core you were looking for.

Also what is not clear is what the actual difference between doing that and using force_turbo=1
is

force_turbo=1 will do some more things, some of them being very sketchy when it comes to system stability - but it also blows an "efuse" (i.e. it sets a non-reversible flag) to void your warranty.

Now to my (most likely confusing) explanation about SPI speed: what you set is the maximum frequency. For the first two Pi generations that is equal to the frequency at any point in time but starting with RPi 3 this changed. Before that the core clock (which is the base for the SPI frequency) was fixed at 250MHz. RPi 3 has a scaling core clock 250-400MHz. And RPi 4 has a range of 200-500MHz (that lower minimum will bite is later).

SPI frequency is derived from the core frequency divided by an even number and also scales with it. And since all RPis should be interchangeable they all use the common denominator of 250MHz for that. The first even number is 2 and thus 125MHz is the maximum SPI frequency any RPi can achieve.
But since they introduced possibly higher core frequencies with RPi 3 and later they had to make sure that the above mentioned maximum frequency is not exceeded. They solved it by first deriving the actual maximum by checking for the next lower divider (e.g. setting to 21MHz will result in 125/6 =~ 20.83MHz) and then scale that value again so this will only be reached at maximum core clock.

Simple example: set SPI frequency to 125MHz. On a RPi 4 with a maximum of 500MHz this means when the core clock actually runs at 250MHz the SPI frequency will also be halfed to 62.5MHz. And now the core clock in RPi 4 can even go down below that to 200MHz and guess what: SPI frequency will follow that as well.

The biggest problem here is that whenever core clock changes SPI frequency will change with it not awaiting any data transfer to finish - some devices handle this better than others.
And of course also you get lower speeds than what you requested without even noticing most of the time since with current resource usage when attached to a Duet 3 mine runs at only insignificantly higher usage than fully idle. core clock in mine was virtually fixed at 200MHz.

That's why I fixed my minimum (new with RPi 4) and maximum to 250MHz. I don't know how it influences SPI frequency if you fix it at any value above that (my scope is orders of magnitude too slow to check that - might get myself a more capable one some day) and whenever I try to formulate a guess in my head I find a reason why this might still be different. So I just don't know.

gtj0

@wilriker Can I just say "ew"?

Get a Jetson Nano.

Danal

First, my cable layout. Nothing really near; nothing crossing; anything even sort of near at 90 degrees.

Second, no joy on eliminating thermal management of the core clock.

/boot/config.txt
core_freq=250
core_freq_min=250

And a reboot does not eliminate all of my checksum messages. I'm beginning to think I have a hardware problem. Time to power off and re-seat cables.

By the way, I did check that the commands had the desired effect:

pi@duet3:~ $ vcgencmd measure_clock core
frequency(1)=250000496
pi@duet3:~ $ vcgencmd measure_clock core
frequency(1)=250000496
pi@duet3:~ $ vcgencmd measure_clock core
frequency(1)=250000496

Danal

Well... very VERY interesting. Hard to prove a negative... but... I reseated the ribbon cable at both ends, and as of this moment, I've run about 5x the testing that reliably produced a checksum error in the past, with no errors logged.

Update: Hours and Hours of printing. No Checksum errors.

Looks like re-seating the cables fixed it.