UNSOLVED Slow/Unreliable SPI Communication when Logging with M929??
I've been running a rather long macro from the SBC and found that it cannot complete reliably. I get various symptoms and errors:
- The printer pauses or hesitates between moves for approx. 0.25 to 2 second
- Pauses are more frequent if the expression language is used, particularly loops
- Warning that the Macro has not completed in 3000ms (the macro takes 3+ minutes to run, but I don't always get this error, so its no the total macro runtime its complaining about)
Warning: Controller has been reset- macro execution stops and I have to reset the machine
What Hardware and Software?
Duet 3 and Raspberry Pi 3B+ running 3.2 final
What debugging steps have been taken?
- Re-worked my macro to remove all scripting and expressions so its plain GCode (eliminates parsing weirdness on the SBC as root cause)
- Switch to Ethernet on the SBC, disabled WiFi
- Switched SBC power to an external power adapter, removed the 5v -> SBC jumper
- Checked that there is no SD card in the Duet
- Tried the following config changes in DCS suggested here: https://github.com/Duet3D/DuetSoftwareFramework/issues/137
- Raised the SPI frequency to 16Mhz
- Dropped the poll interval to 5ms
- Increased buffered macro codes to 32
- Moving the Pi board and ribbon cable as far away from any other wiring as is practical in case its interference
- Disable logging in the script by commenting out the
M929 S3 P"/sys/bed-sample.log"line
Increasing the SPI frequency and lowering the poll interval had a positive impact. This resulted in noticeably snappier macro execution. But the intermittent complete stalls remained and the
Warning: Controller has been resetstill happened.
Turning off the logging command in the macro seems to have eliminated the stall problem. My wild guess in that DCS blocks to perform logging file I/O and this causes an SPI stall but I haven't looked at the DCS code to check. Since I need this on to capture the probing data for Python to process I cant accept turning this off as a solution (unless you know another way to get data out of the duet reliably).
Possible Next Steps
- Try a very fast SD card to see if the faster I/O speed helps
- Try raising the logging level from Info to Error and then log all data as 'Error' on the theory that this would have less overhead somehow
- Try a Pi4 on the theory that more CPU might help.
I got one right before a reset SPI Shows some disconnects have occurred and there appears to be an underun logged on the Duet (if I'm reading that correctly).
1/19/2021, 1:06:45 PM Warning: Controller has been reset 1/19/2021, 1:06:33 PM M122 === Diagnostics === RepRapFirmware for Duet 3 MB6HC version 3.2 running on Duet 3 MB6HC v0.6 or 1.0 (SBC mode) Board ID: 08DJM-956L2-G43S4-6JTD4-3SS6K-TA7AH Used output buffers: 1 of 40 (16 max) === RTOS === Static ram: 149788 Dynamic ram: 65312 of which 76 recycled Never used RAM 143656, free system stack 122 words Tasks: Linux(ready,87) HEAT(blocked,296) CanReceiv(blocked,834) CanSender(blocked,344) CanClock(blocked,352) TMC(blocked,19) MAIN(running,1120) IDLE(ready,19) Owned mutexes: Telnet(MAIN) === Platform === Last reset 00:26:13 ago, cause: software Last software reset at 2021-01-19 12:40, reason: User, none spinning, available RAM 143620, slot 0 Software reset code 0x0012 HFSR 0x00000000 CFSR 0x00000000 ICSR 0x0044a000 BFAR 0x00000000 SP 0x00000000 Task Linu Freestk 0 n/a Error status: 0x00 Aux0 errors 0,0,0 Aux1 errors 0,0,0 MCU temperature: min 44.5, current 46.4, max 46.7 Supply voltage: min 24.2, current 24.3, max 24.3, under voltage events: 0, over voltage events: 0, power good: yes 12V rail voltage: min 12.2, current 12.2, max 12.3, under voltage events: 0 Driver 0: position 56774, standstill, reads 48573, writes 8 timeouts 0, SG min/max 0/1023 Driver 1: position 48723, standstill, reads 48573, writes 8 timeouts 0, SG min/max 0/1023 Driver 2: position 1600, standstill, reads 48573, writes 8 timeouts 0, SG min/max 0/1023 Driver 3: position 24800, standstill, reads 48578, writes 2 timeouts 0, SG min/max not available Driver 4: position 0, standstill, reads 48581, writes 0 timeouts 0, SG min/max not available Driver 5: position 0, standstill, reads 48581, writes 0 timeouts 0, SG min/max not available Date/time: 2021-01-19 13:06:32 Slowest loop: 2882.45ms; fastest: 0.04ms === Storage === Free file entries: 10 SD card 0 not detected, interface speed: 37.5MBytes/sec SD card longest read time 0.0ms, write time 0.0ms, max retries 0 === Move === DMs created 125, maxWait 446431ms, bed compensation in use: none, comp offset 0.000 === MainDDARing === Scheduled moves 1506, completed moves 1506, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 3], CDDA state -1 === AuxDDARing === Scheduled moves 0, completed moves 0, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 0], CDDA state -1 === Heat === Bed heaters = 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1, chamberHeaters = -1 -1 -1 -1 === GCodes === Segments left: 0 Movement lock held by HTTP HTTP* is idle in state(s) 45 0, running macro Telnet* is doing "M122" in state(s) 0 File is idle in state(s) 0 USB is idle in state(s) 0 Aux is idle in state(s) 0 Trigger* is idle in state(s) 0 Queue is idle in state(s) 0 LCD is idle in state(s) 0 SBC is idle in state(s) 0 Daemon is idle in state(s) 0 0, running macro Aux2 is idle in state(s) 0 Autopause is idle in state(s) 0 Code queue is empty. === CAN === Messages queued 6765, send timeouts 0, received 17182, lost 0, longest wait 1ms for reply type 6013, free buffers 48 === SBC interface === State: 4, failed transfers: 0 Last transfer: 1ms ago RX/TX seq numbers: 8122/8122 SPI underruns 0, overruns 0 Number of disconnects: 2, IAP RAM available 0x2c884 Buffer RX/TX: 1416/112-4088 === Duet Control Server === Duet Control Server v3.2.0 HTTP: Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: M118 L1 S"Bed probe location: X:236.872 Y:8.128" Buffered code: G1 X236.872 Y20.128 F9000 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: M118 L1 S"Bed probe location: X:196.872 Y:8.128" Buffered code: G1 X196.872 Y20.128 F9000 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: M118 L1 S"Bed probe location: X:170.0 Y:8.128" Buffered code: G1 X170.0 Y20.128 F9000 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: M118 L1 S"Bed probe location: X:130.0 Y:8.128" Buffered code: G1 X130.0 Y20.128 F9000 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: M118 L1 S"Bed probe location: X:103.128 Y:8.128" Buffered code: G1 X103.128 Y20.128 F9000 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 Buffered code: G30 S-1 Buffered code: G1 Z1 F252 ==> 1420 bytes Executing macro 0:/macros/mesh-bed-leveling/mesh-bed-probe.g, started by M98 P"0:/macros/mesh-bed-leveling/mesh-bed-probe.g" > Next stack level Code buffer space: 2696 Configured SPI speed: 16000000 Hz Full transfers per second: 78.97 Maximum length of RX/TX data transfers: 3669/1548 ok 1/19/2021, 1:05:45 PM Connected to duet3
- The printer pauses or hesitates between moves for approx. 0.25 to 2 second
Ooh its all C#! Logging happens here using good async constructs, this shouldn't in principal block the thread:
So I'm going after the SD card first then...
I switched to a SanDisk Extreme 128GB card and that totally cleared up the issue. This is one that I use for recording 4K video.
The card I had before was a Kingston 16GB card I got with the Pi.
I'm happy that I figured out a way around this, but I'm surprised that the
async/awaitprogramming constructs don't protect us from I/O flakiness!
- The file is wrapped in a StreamWriter that should have a 4K buffer
- The whole call chain is
async/await. Log writing doesn't have to complete before the next GCode instruction executes (at least from what I can read without debugging it).
I was going to suggest keeping a queue of logging data and writing it on a separate dedicated thread for I/O. But StreamWriter is basically that. (the only difference being that the log time can be captured without trying to lock the file)
So yeah, not sure how the code should change. But if you are having problems, get a FAST SD card.