CAN bus anomalies with 6HC and 3HC
-
@dc42 said in CAN bus anomalies with 6HC and 3HC:
does the 3HC still have the same problem, responding to M122 but the status LED not flashing in sync with the main board?
Not currently, no. I have not seen it happen again since I initially reported the issue, even before I updated the bootloader this morning. I'm stumped as to why - the machine had been working fine. Then for a couple of days I kept getting the issue with the light blinking out of sync and the motors on that side not working. Then I reported it and have not seen it happen again since.
Is there a typical reason for the light flashing out of sync like that? I'm certainly happy it works now but would be great to nail down why it did that so it doesn't risk it happening again.
-
@adammhaile if communication is working but the led is not flashing in sync, that means the software frequency locked loop that keeps the clock on the 3HC synchronised with the 6XD was not able to maintain sync. That could mean that the crystal oscillator on the 3HC or on the 6HC drifts too much with temperature.
The peak sync jitter in the M122 B1 report can give a clue that this is happening. The values in your most recent report are minimum +2 and maximum +8 which are OK.
-
@dc42 said in CAN bus anomalies with 6HC and 3HC:
The peak sync jitter in the M122 B1 report can give a clue that this is happening. The values in your most recent report are minimum +2 and maximum +8 which are OK.
Interesting... so the jitter values on the 3HC from when it was flashing out of sync (it was actually doing that at the time of that diagnostic) were 0 and 0... are those still ok?
I would assume that means no jitter - which is odd given it was out of sync at the time. -
@adammhaile when it's out of sync there are no jitter values available, so it reports 0/0. The jitter values just before it lost sync would have been interesting!
-
@dc42 said in CAN bus anomalies with 6HC and 3HC:
The jitter values just before it lost sync would have been interesting!
Any way to log those in real time in case it happens again?
Any recommendations as next steps? You seemed like a replacement was a good idea before but now I'm not sure... Just don't want this happening randomly again in the future. It's a big machine so a random fail mid-print could be hugely wasteful.
-
@dc42 Is it possible it's also either the 6HC or maybe the CAN bus cable?
If it's only possible the issue was caused by the 3HC I'd almost rather just replace it immediately in the interest making sure it doesn't happen in the future... getting ready to bring the machine to MRRF so I only have so much time to make sure it's running reliably. -
I'll see what DC42 thinks about 3hc replacement whether it's worthwhile or not.
-
@phaedrux said in CAN bus anomalies with 6HC and 3HC:
I'll see what DC42 thinks about 3hc replacement whether it's worthwhile or not.
Thanks!
-
@phaedrux said in CAN bus anomalies with 6HC and 3HC:
I'll see what DC42 thinks about 3hc replacement whether it's worthwhile or not.
@Phaedrux @dc42 a bit of an update...
<sigh> This is getting frustrating.
Early on in building this machine I had 1 instance where mid-print it just stopped. DWC status page showed the "Print Again?" button and after digging into the duetcontrolserver logs I found aSPI connection has been reset
message.
I dug into the forums and found lots of mentions of grounding/static issues. So I went to work making sure everything was fully grounded - which it is now. Everything is grounded to the frame and I can even confirm full continuity between ground and the nozzle, all motors, pulleys, etc. Pretty much anything metal is grounded.
I even went to the point of tying 24V negative to ground so that the "ground" on all the electronics would be actually grounded.I thought that problem had been solved as I had never seen it again. And then the CAN bus issue started happening. Except that within a day of reporting it here that stopped and all seemed to be going fine. Until today when I got the
SPI connection has been reset
message twice and the CAN bus issue onceFist SPI issue was when I was just testing out some new meta commands and noticed that DWC disconnected. CAN bus problem happened in a similar scenario - while I was setting up some new scripting - and then when I went to home for a new print the Y axis bound up because only the left motor was driven. And then just now it stopped again mid print with the SPI issue again.
I think the most frustrating thing about both of these is that there's no real logic to when or why they happen. But I'm about to throw both these boards in the trash.
While this is certainly not my first time dealing with a complicated electronic system (I've been building my own computers, printers, plotters, CNC machines, and designing my own PCBs for a long, long time).
I even had a friend, who is an electrical engineer that specializes in failure analysis, take a look and he couldn't come up with anything I've done that seemed wrong.I very much do not trust either the 6HC or 3HC at this point - I feel like my only options at this point are replace both boards or find a completely different controller that meets my needs. I'm not trying to be threatening here or anything, I'm just supremely frustrated.
-
@adammhaile I'm sorry to hear that you are still having problems. Did you run a M122 report after the SPI connection reset message? If the 6HC reset then this may help us determine why. It may not be too late to run one now.
-
@dc42 said in CAN bus anomalies with 6HC and 3HC:
@adammhaile I'm sorry to hear that you are still having problems. Did you run a M122 report after the SPI connection reset message? If the 6HC reset then this may help us determine why. It may not be too late to run one now.
Unfortunately, no - the first time it happened I SSHed into the Pi to confirm it was the SPI error message but forgot to run the diagnostic. The second time it happened I tried but DWC just didn't respond, even after it said it was connected again.
Is there a way to run such commands from the pi terminal directly?Also - new datapoint... I just started up the same print it failed on last night again and I wanted to move the bed down so I could clear it of the failed print. Since I couldn't home it first, I do what I normally do and issued
M564 H0
to let me move without homing... except nothing happened. The console showed that the command had been run but the axis didn't unlock for movement in DWC.BTW - since the first time the SPI issue happened and now I've also replaced the Pi itself with a brand new one. Same exact model (Pi 4 w/ 4GB RAM).
That print is running now - if it SPI or CAN bus fails again I will update here ASAP.
-
@dc42 Update: Ran prints all day. Nothing. I was even screen recording DWC and a couple different camera angles to see the exact moment. This is what's most frustrating - it'll happen the moment I get comfortable with it again
-
@adammhaile If you see occasional SPI connection resets, please consider reflashing your microSD card. See here why it could help.
-
@chrishamm said in CAN bus anomalies with 6HC and 3HC:
@adammhaile If you see occasional SPI connection resets, please consider reflashing your microSD card. See here why it could help.
I originally flashed it quite awhile ago and it's running buster, not bullseye.
Granted, I have run apt upgrade a few times since - could it still be affected?pi@rancor:~ $ cat /etc/os-release PRETTY_NAME="Raspbian GNU/Linux 10 (buster)" NAME="Raspbian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster ID=raspbian ID_LIKE=debian HOME_URL="http://www.raspbian.org/" SUPPORT_URL="http://www.raspbian.org/RaspbianForums" BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"
@chrishamm Update: I found the image I used the last time I did a clean re-flash and it was on Feb 15th, 2022 with
2021-07-12-DuetPi-lite.img
I also realize I should probably also note the few things I've done with that image:
dsf
was given a user directory and the ability to login to that account. This is so that I can SSH to the Pi and directly edit the files in thesys
directory. I do this so that I can use VS Code's remote features and have multiple files open at a time. It's SO much faster than going through DWC when you have a lot of edits to make - and I have an extensive system of conditional logic for tool and filament management.- It's running
isc-dhcp-server
(dhcpd
) to provide an IP to another Pi in the printer that's running android and driving a large touch screen that displays DWC. - It's running a slightly modified version of the
webcamd
mjpg-streamer
service from OctoPi for on-board camera streaming. I did this before the motion camera plugin was available. And even after I was never able to get it to serve up the stream larger than 640x480. Since my previous solution worked I just went back to that.
-
At this point I'm ready to just suck it up and buy new controllers unless you are willing to RMA these boards. I'm beyond frustrated.
I was able to get both issues to happen again - at the same time.
I was running yet another simple test print and out of nowhere it just stopped dead... unlike other times though, the connection to the system never really came back. Eventually DWC on the built in Android screen on the machine came back enough to display "SPI connection has been reset"But otherwise I could not remotely access the machine at all. Could not get to DWC from my any other system on my network and could not SSH into the Pi.
I fortunately have a screen on the Duet Pi and was able to connect a keyboard and run a couple things before rebooting the system.
One note about the DCS log below - the SPI reset seems to happen over and over again.
At the bottom you will see diagnostics for the 6HC and logs from DuetControlServer when the event happened. I was unable to get M122 to output the 3HC diagnostics - it just returned error every time.
Not only did the SPI comms issue occur but when it did the LED on the 3HC was blinking rapidly - sadly that's all I could tell because, as noted, I was unable to grab diagnostics from it - it just seemed completely disconnected.
One new thing of note: I think this is only happening when I use the right tool - the one that's using the 3HC. To recap from previous: The 3HC controls the T1 extruder, T1 X axis (the U axis), and the right side Y motor. So even if T1 isn't being used the 3HC is always involved at least with the Y motor. But it seems to only happen when I'm running jobs either with both tools or only with T1. No idea what that means - hopefully it will make sense to you.
6HC Diagnostics - this was captured about 3-4 minutes after all went to hell. It took me that long to gain access to the system and figure out how to run M122 from the pi terminal.
=== Diagnostics === RepRapFirmware for Duet 3 MB6HC version 3.4.0 (2022-03-15 18:57:24) running on Duet 3 MB6HC v1.01 or later (SBC mode) Board ID: 08DJM-956BA-NA3TN-6J1FG-3S86T-TUBUS Used output buffers: 1 of 40 (15 max) === RTOS === Static ram: 151000 Dynamic ram: 69008 of which 0 recycled Never used RAM 127280, free system stack 114 words Tasks: SBC(ready,0.4%,438) HEAT(suspended,0.0%,321) TMC(notifyWait,8.0%,58) MAIN(running,91.6%,1147) IDLE(ready,0.0%,30), total 100.0% Owned mutexes: HTTP(MAIN) === Platform === Last reset 05:22:14 ago, cause: power up Last software reset details not available Error status: 0x00 Aux1 errors 0,0,0 Step timer max interval 127 MCU temperature: min 46.0, current 46.0, max 46.0 Supply voltage: min 23.9, current 23.9, max 23.9, under voltage events: 0, over voltage events: 0, power good: yes 12V rail voltage: min 12.1, current 12.1, max 12.1, under voltage events: 0 Heap OK, handles allocated/used 99/52, heap memory allocated/used/recyclable 2048/1620/986, gc cycles 5 Events: 0 queued, 0 completed Driver 0: standstill, SG min n/a, mspos 184, reads 12908, writes 0 timeouts 0 Driver 1: standstill, SG min n/a, mspos 504, reads 12907, writes 0 timeouts 0 Driver 2: standstill, SG min n/a, mspos 8, reads 12907, writes 0 timeouts 0 Driver 3: standstill, SG min n/a, mspos 152, reads 12908, writes 0 timeouts 0 Driver 4: standstill, SG min n/a, mspos 152, reads 12908, writes 0 timeouts 0 Driver 5: standstill, SG min n/a, mspos 152, reads 12908, writes 0 timeouts 0 Date/time: 2022-04-11 18:51:59 Slowest loop: 1.55ms; fastest: 0.05ms === Storage === Free file entries: 10 SD card 0 not detected, interface speed: 37.5MBytes/sec SD card longest read time 0.0ms, write time 0.0ms, max retries 0 === Move === DMs created 125, segments created 22, maxWait 0ms, bed compensation in use: mesh, comp offset 0.000 === MainDDARing === Scheduled moves 45757, completed 45757, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 0], CDDA state -1 === AuxDDARing === Scheduled moves 0, completed 0, hiccups 0, stepErrors 0, LaErrors 0, Underruns [0, 0, 0], CDDA state -1 === Heat === Bed heaters 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1, chamber heaters -1 -1 -1 -1, ordering errs 0 === GCodes === Segments left: 0 Movement lock held by null HTTP* is doing "M122 B0" in state(s) 0 Telnet is idle in state(s) 0 File* is idle in state(s) 0 USB is idle in state(s) 0 Aux is idle in state(s) 0 Trigger* is idle in state(s) 0 Queue* is idle in state(s) 0 LCD is idle in state(s) 0 SBC is idle in state(s) 0 Daemon is idle in state(s) 0 Aux2 is idle in state(s) 0 Autopause is idle in state(s) 0 Code queue is empty === Filament sensors === Extruder 0 sensor: ok Extruder 1 sensor: no filament === CAN === Disabled Longest wait 0ms for reply type 0, peak Tx sync delay 0, free buffers 50 (min 49), ts 0/0/0 Tx timeouts 0,0,0,0,0,0 === SBC interface === Transfer state: 4, failed transfers: 0, checksum errors: 0 RX/TX seq numbers: 41225/1471 SPI underruns 0, overruns 0 State: 5, disconnects: 12, timeouts: 12, IAP RAM available 0x2b880 Buffer RX/TX: 0/0-0, open files: 0 === Duet Control Server === Duet Control Server v3.4.0 Code buffer space: 4096 Configured SPI speed: 8000000Hz, TfrRdy pin glitches: 0 Full transfers per second: 36.63, max time between full transfers: 4566.7ms, max pin wait times: 26.1ms/0.3ms Codes per second: 0.13 Maximum length of RX/TX data transfers: 3868/1520
DuetControlServer logs - the M800 is just a custom macro that runs for various print events. It sends a serial message to an external arduino that plays some audio.
Apr 11 18:40:38 rancor DuetControlServer[370]: [info] Finished macro file M800.g Apr 11 18:41:20 rancor DuetControlServer[370]: [info] Starting macro file M800.g on channel File Apr 11 18:41:20 rancor DuetControlServer[370]: [info] Finished macro file M800.g Apr 11 18:41:53 rancor DuetControlServer[370]: [info] Starting macro file M800.g on channel File Apr 11 18:41:53 rancor DuetControlServer[370]: [info] Finished macro file M800.g Apr 11 18:42:30 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:42:30 rancor DuetControlServer[370]: [warn] SPI connection has been reset Apr 11 18:42:30 rancor DuetControlServer[370]: [warn] Trigger: Out-of-order reply: '' Apr 11 18:42:30 rancor DuetControlServer[370]: [info] Aborted job file Apr 11 18:42:55 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:42:55 rancor DuetControlServer[370]: [warn] SPI connection has been reset Apr 11 18:42:55 rancor DuetControlServer[370]: [warn] Trigger: Out-of-order reply: '' Apr 11 18:43:10 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:43:10 rancor DuetControlServer[370]: [warn] SPI connection has been reset Apr 11 18:43:10 rancor DuetControlServer[370]: [warn] Trigger: Out-of-order reply: '' Apr 11 18:43:19 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:43:28 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:43:37 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:43:47 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:43:56 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:05 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:05 rancor DuetControlServer[370]: [warn] SPI connection has been reset Apr 11 18:44:05 rancor DuetControlServer[370]: [warn] Trigger: Out-of-order reply: '' Apr 11 18:44:14 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:23 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:32 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:41 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:50 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:59 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:44:59 rancor DuetControlServer[370]: [warn] SPI connection has been reset Apr 11 18:45:00 rancor DuetControlServer[370]: [warn] Trigger: Out-of-order reply: '' Apr 11 18:45:09 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:45:18 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:45:27 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:45:36 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:45:45 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:45:54 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:45:54 rancor DuetControlServer[370]: [warn] SPI connection has been reset Apr 11 18:45:54 rancor DuetControlServer[370]: [warn] Trigger: Out-of-order reply: '' Apr 11 18:46:03 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:46:12 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:46:22 rancor DuetControlServer[370]: [info] System time has been changed Apr 11 18:46:31 rancor DuetControlServer[370]: [info] System time has been changed
-
Please send an email toΒ warranty@duet3d.comΒ and CC your reseller. Include a link to this forum thread and the details of your original purchase. You'll receive a reply with a form to fill out.
Of course we will continue to try and understand and resolve the issue.
Sorry for the inconvenience and thank you for your patience.
-
@phaedrux Done. Will handle the form as soon as I get it.
Thank you -
@adammhaile Thanks for the log. You have lots of "System time has been changed" messages in there which indicates an I/O or CPU overload on the SBC that can cause frequent timeouts - in detail, the application on the SBC (DCS) fails to get CPU time from the Linux kernel frequently enough so timeouts are a likely consequence.
If you can confirm the CPU usage is normal on the SBC, please consider replacing your SD card with an A-rated microSD card which is better suited for concurrent IO. That should eliminate those messages, too.
-
@chrishamm Interesting...
I've been using one of these microSD cards which is typical for me on the Pi and especially for one that is in a setup like this where "properly" shutting it down each time is not easy.I noticed in the docs mention of an SD card speed test, which I ran but I'm thinking that it is only meant for a card mounted in the Duet, not the SBC... because... well, these are horrible numbers:
4/12/2022, 8:35:48 AM M122 P104 S5 Testing SD card write speed... 4/12/2022, 8:36:26 AM SD write speed for 5.0Mbyte file was 0.13Mbytes/sec 4/12/2022, 8:36:26 AM Testing SD card read speed... 4/12/2022, 8:43:50 AM SD read speed for 5.0Mbyte file was 0.01Mbytes/sec
As for CPU usage - Note: this is a Pi 4 w/ 4GB RAM. No overclock.
This is at machine idle - just on, no job running:
This is during the text at the bottom of a benchy - so tons of tiny moves:
This is a few seconds after the last, with an mjpg_streamer camera stream started:
-
@adammhaile The CPU usage looks OK but I agree the SD test is pretty disappointing. I've been using these SanDisk Extreme 64GB A2 cards and overwrote all of them countless times for DuetPi tests and they're still perfectly fine.
I'm still happy with the Samsung SSDs I have but I cannot say much about the quality of their microSD cards.