Dead driver or dying board?
-
I have a config with a lot of steppers -- double idex, and 4 z steppers. I am running duet 3 with 6 steppers and 2 3 stepper expansion boards and 4 toolboards. I've had this config for a couple of years -- using duet 3 electronics. Recently I've had an odd issue. My primary Y gantry - which runs on 2 nema 23s started acting weird, or more specifically the right Y stepper which is on driver 5 of the main board randomly decides to stop moving. It locks up -- i.e the other Y stepper is trying to force the gantry and fighting it.
So enable is on, but clearly step is not doing anything. When this happens, there are no errors from the driver in the console or DWC. M122 also shows no missed steps, just some driver timeouts, but those are there when things are fine, but that's it. It's getting worse over time first -- printer ran for 15+ hours with no issues, then 4 hrs, then ~2 hrs. E-stop doesn't fix it, but power cycle does, temporarily.
I ruled out connection issues (I redid the stepper connector and wiring).
I have a spare 3 driver expansion board, so I moved both Y steppers to that board - along with the Y end stops. Completed a long 19 hour print with no issue. If driver 5 on the main board has developed an issue, should I be expecting more issues with the main duet 3 board, or can I just cross driver 5 off and run with the expansion for the Y axis? Are timeouts from drivers on mainboard normal? Should I be looking to replace the main board (it's v 1.01)Edited for length
-
@kazolar I ruined my CNC frame with a similar problem. One Y motor stopped and skipped steps, the other went on.
To avoid that once and for all, you should add an "anti racking" mechanism.
Maybe couple both motors (dual shaft motors preferably) with a shaft extension or do what is common in big machines: add a cord/wire to the crossbeam to eliminate racking.
Check out the first few pages of my hashPrinter thread, where I used that method a lot. There are also some links to my Youtube channel, where I demonstrate the anti racking effect.PS: If you want more people to read your wall of text, you should edit it for better readability.
-
@o_lampe thank you for the tips, I kinda have to live with the decoupled approach as I dialed in each Y end stop to perfectly square the gantry, so having the steppers coupled together would defeat the purpose. Also with each gantry running 2 carriages, there is no room to do that type of kinematic. The gantry actually is designed to allow for a few degrees of flex to self square when homing. I updated my rambling to be more specific to the issue I'm asking about. Which is more to the point (now I know that running the machine with the expansion board controlling the Y axis with no issue) does the failing(failed) driver 5 signify the main board is on it's way out?
-
@kazolar the issue is likely to be confined to that driver only.
When the driver stops working, are you sure that M122 still reports the driver status as OK? Does the soldering of driver 5 and the components around it look OK?
-
@kazolar PS - here's another test you can do, if you can provoke the problem without damaging your machine:
- After power up and before the driver stops working, send:
M569 P5 R1
. The response will probably be0x00000005
. - Send
M569 P5 R1 V7
followed byM569 P5 R1
. The response should be0x00000000
. - Provoke the problem.
- Send
M569 P5 R1
again and report the response.
Another useful piece of information would be to know whether sending M18 Y followed by M17 Y gets it working again without a power cycle. This will mark the U axis as not homed so normal movement won't work but homing Y should.
- After power up and before the driver stops working, send:
-
@dc42 m18/m17 doesn't fix it -- it does allow me to move the axis freely by hand, so stepper does release. E-stop also disables the stepper. When I try to home the axis, the stepper on driver 5 locks up again. Only the power cycle clears the gremlins (seemingly for a rather short period of time now). I see no physical issues near the driver, I didn't take the board out to examine under my micro soldering scope, but visual inspection in situ doesn't raise any suspicions.
Yes after I paused the printer after the last time the problem occurred M122 said all drivers were OK, which is why this is weird.Can you suggest how I can add this driver back into config where it's not part of kinematics. Can it be the 3rd stepper of my Y axis? Does it need it's own end stop? I can plug in a spare stepper -- have it sit on the side during the print and watch for it to lock up. Putting it back into the kinematics is not ideal for a test, it's a nema 23 with a lot of holding torque. My Y gantries are designed to allow for some flex to self square, but this behavior is very violent.
More to the point -- how isolated is the driver. If it's worst case - am I losing any performance or risking any issue by using an expansion board for the Y axis. I figured its better to move both Y steppers to the same expansion board. More specifically can I run it as is? Or is the main board going to degrade more. I've not had any overheating issues, all boards have adequately cooling and stay in the 30c reported temp range.
-
@kazolar you can add the driver back as a 3rd Y axis driver, however if your two existing Y axis drivers have separate endstops then you would need a 3rd endstop.
Yes it's better to put both Y drivers on an expansion board (the same one) rather than just one of them. When endstops are triggered there is a small latency before drivers on expansion boards are stopped, so they may overshoot very slightly. They are then reverted to the position they had when the endstop trigger was detected. So when a single endstop is used, if only one driver is on an expansion board then only that one will overshoot and revert slightly.
Another reason to drive both motors from the same board is so that in the event of a CAN bus failure or a board reset, the motors don't move out of sync.
My guess is that either the driver is being affected by heat, or there is a bad solder joint in that area that is affected by heat.
-
@dc42 Yes that's what I did I moved Y axis entirely to the expansion board both Y steppers and end stops on the same expansion board. I recall reading that recommendation before that a paired axis should be on the same expansion board. Last time the error occurred was after the printer had been of for at least an hour while I re-wired/routed the problematic stepper/driver. So the printer hadn't been on for very long for heat to become a problem. The curious part is previously Y axis was using driver 0 and driver 5, which are next to each other driver 0 was fine, and 5 was the one that had problems. If it's localized to an area, it's really specific. As I said only thing I saw in m122, and saw yesterday while printer was printing normally was low count of timeouts on drivers -- all expansion boards showed 0 for timeouts. Are timeouts a normal thing? Should I just run the printer as is until another driver fails and replace the main board then? I'm afraid, if I wait long enough - I may get burned by tariff nonsense.
-
@kazolar it's normal to see zero driver timeouts, although you might see one timeout after VIN is powered up.
-
@dc42 I see non-zero timeouts during normal operation on my main board. My expansion boards show zeros. During the print yesterday I was seeing timeouts of 8-33 when I was polling the main board. Expansion boards said 0. Oddly enough the 1 axis that is on the main board, x. Showed stalled briefly then went back to ok, even through things were moving fine and nothing was wrong. Is this a sign of something more serious? 19 hour print finished and came out perfect. With zsp the bed mesh is working beautifully. I took a zoomed in picture of the board, and see nothing suspicious.
-
@dc42 here is typical m122 from main board
m122
=== Diagnostics ===
RepRapFirmware for Duet 3 MB6HC version 3.6.0-rc.1 (2025-02-28 15:00:13) running on Duet 3 MB6HC v1.01 (standalone mode)
Board ID: 08DJM-956BA-NA3TJ-6J1F4-3S06T-KV8UT
Used output buffers: 3 of 40 (36 max)
=== RTOS ===
Static ram: 137420
Dynamic ram: 135436 of which 0 recycled
Never used RAM 58824, free system stack 130 words
Tasks: NETWORK(1,ready,32.9%,180) ETHERNET(5,nWait 7,0.2%,307) LASER(5,nWait 7,0.7%,167) HEAT(3,nWait 6,0.0%,331) Move(4,nWait 6,0.3%,213) TMC(4,nWait 6,3.2%,341) CanReceiv(6,nWait 1,0.2%,770) CanSender(5,nWait 7,0.0%,327) CanClock(7,delaying,0.0%,348) MAIN(1,running,62.4%,500) IDLE(0,ready,0.0%,29) USBD(3,blocked,0.0%,149), total 100.0%
Owned mutexes:
=== Platform ===
Last reset 01:31:08 ago, cause: power up
Last software reset at 2058-01-24 11:14, reason: HardFault, none spinning, available RAM 222236, slot 2
Software reset code 0x4073 HFSR 0x40000000 CFSR 0x00080000 ICSR 0x00000803 BFAR 0x00000000 SP 0x20415c78 Task MAIN Freestk 2306 ok
Stack: 001d37c0 0000fffe 20412118 00000000 00000000 0049a7ff 0049a7fe a1000000 00000000 ffffffff 00000000 00000000 204137f0 20413608 ffffffff 00000000 2041221c 0049a867 00000050 00000058 00000000 00497a07 20429738 0040055d 00000050 20429728 00000000
=== Storage ===
Free file entries: 19
SD card 0 detected, interface speed: 25.0MBytes/sec
SD card longest read time 2.6ms, write time 3.3ms, max retries 0
=== Move ===
Segments created 229, maxWait 767ms, bed comp in use: mesh, height map offset 0.000, hiccups added 0/0 (0.00/30.23ms), max steps late 0, ebfmin 0.00, ebfmax 0.00
Pos req/act/dcf: 28643.00/28940/-0.17 15681.00/16154/-0.67 419.00/418/0.98 61485.00/61485/0.00 63990.00/63990/0.00 63995.00/63995/-0.00 -3405.00/-3405/0.00
Next step interrupt due in 28 ticks, disabled
Driver 0: standstill, SG min n/a, mspos 8, reads 29077, writes 242 timeouts 23
Driver 1: standstill, SG min n/a, mspos 136, reads 29077, writes 242 timeouts 23
Driver 2: stalled, SG min 0, mspos 367, reads 29077, writes 242 timeouts 23
Driver 3: standstill, SG min n/a, mspos 88, reads 29077, writes 242 timeouts 23
Driver 4: standstill, SG min n/a, mspos 24, reads 29077, writes 242 timeouts 23
Driver 5: standstill, SG min n/a, mspos 8, reads 29077, writes 242 timeouts 23
Phase step loop runtime (us): min=0, max=150, frequency (Hz): min=501, max=15957
=== Heat ===
Bed heaters 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1, chamber heaters -1 -1 -1 -1 -1 -1 -1 -1, ordering errs 0
Heater 0 is on, I-accum = 0.1
Heater 1 is on, I-accum = 0.0
=== GCodes ===
Movement locks held by null, null
HTTP is idle in state(s) 0
Telnet is idle in state(s) 0
File is idle in state(s) 3
USB is idle in state(s) 0
Aux is idle in state(s) 0
Trigger is idle in state(s) 0
Queue is idle in state(s) 0
LCD is idle in state(s) 0
SBC is idle in state(s) 0
Daemon is idle in state(s) 0
Aux2 is idle in state(s) 0
Autopause is idle in state(s) 0
File2 is idle in state(s) 0
Queue2 is idle in state(s) 0
=== CAN ===
Messages queued 84006, received 233772, lost 0, ignored 0, errs 0, boc 0
Longest wait 0ms for reply type 0, peak Tx sync delay 568, free buffers 50 (min 46), ts 9233/9233/0
Tx timeouts 0,0,0,0,0,0
=== Network ===
Slowest loop: 22.09ms; fastest: 0.03ms
Responder states: MQTT(0) HTTP(0) HTTP(0) HTTP(0) HTTP(0) HTTP(0) HTTP(0) FTP(0) Telnet(0) Telnet(0)
HTTP sessions: 4 of 8
=== Multicast handler ===
Responder is inactive, messages received 0, responses 0
= Ethernet =
Interface state: active
Error counts: 0 0 0 0 0 0
Socket states: 6 6 6 2 2 0 0 0 0 -
@kazolar was that M122 report taken after the problem with driver 5 occurred, or before?
-
@dc42 before, this is "normal" for me, when everything is printing fine.