[3.6.0-beta.2 and prev] Random out of sync

droftarts

@leone Thank you for your in-depth analysis. I've highlighted it to @dc42.

Ian

leone

Thanks @droftarts.
Update note: I got a resync also with MaxTimeSyncDelay = 375.

leone

now I'm doing some tests increasing StepTimer::MinSyncInterval from 2000 to 4000. I noticed it was already increased some years ago, it might be the same issue, I am also running on a SAME70.

dc42

@leone thanks for your report. If you run M122 what does it report for "peak Tx sync delay", both when the machine has been idle for several minutes and when it has been printing for several minutes? Note, that value gets reset after each M122 report.

dc42

@leone further to my previous answer, here are some observations:

I made a note in the source code for StepTimer.cpp "Occasionally on the SAME70 we get very large delays reported. These delays are not genuine." This could be relevant, however AFAIR the delays I was seeing were very much greater than 300.
We don't yet use bit rate switching, so the time taken to send a full 64-byte CAN-FD packet is about 600us. So if the CAN clock task tries to send a time sync message just after transmission of another packet starts, the acknowledge delay is expected to be up to 600us.

As to why you see the linearly decreasing acknowledge delay, if you see this when the machine is idle then my best guess is that it starts when the CAN clock task wakes up to send a time sync message just after the main board or an expansion board has started to send one of its standard broadcast messages. For example, the sensor readings are broadcast every 250ms whereas the CAN clock message is broadcast every 200ms. If the sending times of these messages clashes, then they clash again 1 second (5 samples) later. The clash won't be exact, because the sensors broadcast interval varies a little. That may explain the linearly decreasing acknowledge delay.

If am right, we can avoid this clash every 5 messages by changing the CAN clock send interval to be relatively prime with respect to the sensors broadcast interval. Please try changing the definition of CanClockIntervalMillis at CanInterface.cpp(189) to a different value such as 199 or 211 and use your CAN protocol analyser to see if the behaviour has changed.

leone

@dc42 thanks for the suggestion!
From M122, "peak Tx sync delay" is about 480 both when the machine is idle and when is printing. This might correspond to the 600us needed to send the CANFD packet, right?
I tried with CanClockIntervalMillis = 199 and below is the (blue) plot from the CAN analyser. The interesting thing now is that the samples between ~4 and 300 are way more than before.
Then I also increased the MaxTimeSyncDelay to 500, and the number of zeros is considerably reduced, which I think is the goal to avoid resyncs.
Could this value of MaxTimeSyncDelay lead to other issues?

dc42

@leone thanks for doing the test. I think that proves my theory.

Are you certain that the number of high acknowledge delays is now higher? I'm not at all sure that the number of delays between 200 and 300 is any higher than before, although the number between 100 and 200 may be higher.

I suspect that this change alone will fix the loss of sync issue by spreading out the high delays, however I agree with you that MaxTimeSyncDelay should be increased to cover the maximum delay expected.

leone

@dc42 with CanClockIntervalMillis = 200 the number of high acknowledge delays depends on how many clashes with other transmissions we have, and since it seems to happen quite random, the comparison with CanClockIntervalMillis = 199 is difficult to do. I tried to analyse 10 minutes in idle state with different configurations of CanClockIntervalMillis and MaxTimeSyncDelay.
I think I will run some prints with CanClockIntervalMillis = 199 and MaxTimeSyncDelay = 500, to see if the resync issue still occurs. From the analysys below it seems to be the one with less number of zeros and less high delays.
Do you think this is something relevant to include in the next release?

freq199_thr300
Percentage of zeros: 0.50%
Percentage of values between 1 and 10: 96.60%
Percentage of values between 11 and 100: 1.13%
Percentage of values between 101 and 200: 1.10%
Percentage of values greater than 200: 0.67%

freq200_thr500
Percentage of zeros: 0.00%
Percentage of values between 1 and 10: 93.77%
Percentage of values between 11 and 100: 2.30%
Percentage of values between 101 and 200: 2.10%
Percentage of values greater than 200: 1.83%

freq199_thr500
Percentage of zeros: 0.00%
Percentage of values between 1 and 10: 96.63%
Percentage of values between 11 and 100: 1.30%
Percentage of values between 101 and 200: 1.10%
Percentage of values greater than 200: 0.97%

dc42

@leone I've committed a change to 3.6-dev to increase the allowed transmit delay and to change the time sync interval to 211ms. I'll also look at the other point you mentioned about moves being discarded.

leone

@dc42 great, thanks!

dc42

@leone I'm still trying to work out why you were getting loss of sync. Please can you run the original 3.6 firmware in which you noticed the problem, then when the problem is observed run M122 B# where # is the CAN address of the expansion board. I'm interested in whether it is timeout resyncs or jitter resyncs that are causing the loss of sync.

leone

@dc42 I still have one:

All averaging filters OK Never used RAM 164992, free system stack 172 words 
Tasks: Move(3,nWait 7,0.0%,126) TMC(2,delaying,1.4%,51) HEAT(2,nWait 6,0.1%,89) CanAsync(5,nWait 4,0.0%,66) CanRecv(3,nWait 1,0.1%,73) CanClock(5,nWait 1,0.0%,63) LED(2,delaying,0.0%,123) ACCEL(3,nWait 6,0.0%,87) MAIN(1,running,97.4%,319) IDLE(0,ready,0.0%,29) AIN(2,delaying,1.0%,259), total 100.0% 
Owned mutexes: 
Last reset 02:15:02 ago, cause: power up 
Last software reset data not available 
Moves scheduled 37920, hiccups 0 (0.00/0.00ms), segs 37, step errors 0 (types 0x0), maxLate 0 maxPrep 62, ebfmin 0.00 max 0.00 
Peak sync jitter -3/41, peak Rx sync delay 187, resyncs 1/0, no timer interrupt scheduled, next step interrupt due in 4247606105 ticks, disabled 
VIN voltage: min 24.2, current 24.4, max 24.5 
MCU temperature: min 51.7C, current 58.0C, max 60.5C 
Driver 0: pos 12510580, 2056.0 steps/mm, standstill, SG min 0, read errors 0, write errors 0, ifcnt 25, reads 53750, writes 25, timeouts 0, DMA errors 0, CC errors 0 Last sensors broadcast 0x3fe0000000 found 9 225 ticks ago, 0 ordering errs, loop time 0 
CAN messages queued 192060, send timeouts 0, received 305253, lost 0, ignored 16, errs 0, boc 0, free buffers 22, min 22, error reg 0 dup 0, oos 0/0/0/0, bm 0, wbm 0, rxMotionDelay 299, adv 35157/37138 
Accelerometer: ADXL345, status: 00 I2C bus errors 0, naks 0, contentions 0, other errors 0

dc42

@leone thanks. Was that definitely after a sync error? It includes resyncs 1/0 which suggests it was a timeout resync, not a jitter resync. I think a timeout resync should only occur if no time sync message with a valid ack delay field is received over a period of 2 seconds.

leone

@dc42 yes, I logged that message right after the expansion boards got out of sync and the print shifted. I was also able to sniff the CAN messages of another print that failed in the same way and that's where I noticed 'lastTimeAcknowledgeDelay' being equal to zero for more than 10 times consecutively. That's how I started to test higher values for MaxTimeSyncDelay (initially 375), but I didn't know how far I could go.

dc42

@leone thanks again. Looks like the maximum allowed acknowledge delay should be about 900 to allow for maximum length messages. However, I'm still a little surprised that you had the maximum delay exceeded 10 times in succession.