Duet 2 underruns high loop times stuttering
Continuation of https://forum.duet3d.com/topic/15421/duet-2-05-memory-leak/130
Start of the problem.
Triplication print of shields for hospital -- if I did anything prior to the start of a print that would power the steppers -- the print would fail with high max loop times and eventually stuttering and if let go to conclusion, a full freeze and board self reset.
Initial workaround that worked -- preheat -- press STOP (M999) -- then start the print. That worked for the first couple of weeks
I tried upgrading to firmware 2.05.1 from 2.05 and the problem appeared to resolve itself but after the 3rd of 4th consecutive print, the same issues would start occurring. Underruns pointed to an SD card issue -- I have gone through 4 different SD cards, currently using a SandDisk high speed 32 gig -- brand new card.
Issue has gotten worse over time, now back to version 2.05 -- as it's more predictable.
On version 2.05 -- the first print after power up -- or idle after completing a print hours ago -- for an overnight print. The first print would inevitably fail. The next print after a reset would succeed and subsequent prints would succeed also until the board was either powered of for an extend period of time (1-2 hours, or left idle).
Initial observation was that re-uploading the same gcode file would fix the issues, but that was a not necessary as simply letting the first print fail, reset and print again allows the next prints to succeed, so the file variable was removed from the equation.
I have also tried printing via octoprint to possibly rule out the SD card, and got high loop times and eventual underrun failure. I told octorpint I am using RRF, and not to wait for OK replies -- it still failed.
To answer questions @dc42 had in the previous thread
- A power issue, if the PSU can't handle the additional heater and stepper motor load. What PSU are you using, and are the VIN terminal block screws still tight? However, your M122 report does not show any power outages, assuming you ran the M122 report just once at the end of the print because M122 resets the min and max voltages.
Everything is super tight, the PSU is rated for 600 w (24v) -- is a brand name PSU -- eyeboot -- which makes industrial size PSUs, I have current monitoring on the power rail going into the PSU (on AC) and it never goes to 200w even with triplicate mode, as my bed is AC powered. PSU is barely warm since is well overspec'ed for wattage used.
- A temperature issue, because more stepper drivers and heaters are running. However, your M122 report shows a low MCU temperature, assuming you ran the M122 report just once at the end of the print because M122 resets the min and max temperatures.
I have really good cooling I have never seen a driver overtemp, or any rises in temperature, prior to replacing the faulty Z stepper connectors the board never went above 29c. Now with wires soldered to the duex5 for the Z steppers (4) -- board stays at 26c.
- A firmware issue that is causing DriveMovement objects to be lost from the system. However, your M122 report shows that there were never less than 56 of these free.
That may be possible if this wasn't tied to the 40 min "warm up time" as it seems after 40 minutes the board is ready to print -- and print for days -- if I don't sleep and cycle prints every 5 hours.
- Noise on the bus between the Duet and DueX being worse when you drive another stepper motor on the DueX. This could cause I2C data corruption, or possibly spurious interrupts form the DueX to the Duet (which would lengthen the loop time).
I have looked through M122 results and have never seen i2c errors -- i have 14 gauge shortest possible ground wire running between duet and duex5, and again, why would it magically start working after a failed print of 40 minutes -- and then subsequent prints have very low loop times (5-10ms max) .
The big take away is I don't know for sure that a duplication or a single nozzle print would run fine now if I tried from a cold start -- I had done them before the problem got progressively worse in the triplication mode. I will do one Sunday after the pickup of 200 shields for a hospital on Sunday morning -- as I am printing 100 ear relievers for another hospital and that is most efficiently done in duplication mode.
Thanks for starting a new thread.
- Do you think you can find time at some point to upgrade to RRF 3.01-RC10? This will of course involve changes to your config.g file (see https://duet3d.dozuki.com/Wiki/RepRapFirmware_3_overview#Section_Summary_of_what_you_need_to_do_to_convert_your_configuration_and_other_files). You can set up RRF2 and RRF3 subfolders in /sys so that you can easily revert to RRF2; see the M505 command.
- I have also made some changes post-RC10 to record the longest SD card read time and report it in the M122 report, which may help to prove or disprove that SD card read access is implicated.
- I can add further diagnostics to help pin this down, but this is much easier for me to do in RRF3 than in RRF2.
- Your MCU temperature readings look good, but have you calibrated the MCU temperature sensor? See https://duet3d.dozuki.com/Wiki/Calibrating_the_CPU_temperature.
- In your original thread you reported that you were getting "phase A & B warnings". You have since then rewired the stepper motor connections. Have those warnings gone away?
- Upgrading to RRF3 requires the time I don't have. The actual goal would be to move to duet 3 at some later point (after the covid stuff is over and I have time) -- I wanted to do so, but the rewire job would take a significant amount of time, and I can't (our group can't either) have my quad be down for longer than an evening -- the rewire of the stepper connectors was as long of a down time as I could realistically do. I would do this project at a later date -- now, with pressing needs - I can't.
- That would be good, but if these were in 2.05.something, it would be much easier for me to deploy it to test.
- Just reading all the steps I need to do to move to RRF3 - it's a over a day of troubleshooting.
Phase A&B warning were happening after underruns started and after stuttering -- they were simple a symptom of the slow loop. They were happening on steppers which were not even doing anything -- part of the 4th extruder and it's X axis -- it was parked, so that didn't make much sense.
I only had legitimate phase warnings when the connectors melted -- and obviously those have been sorted.
- I have not calibrated the MCU temp, but I check everything a thermal camera, and nothing is hot -- everything is below 30c -- even my external 2209s don't even reach body temperature. My cooling is really good, I have a lot of air blowing across the board exhausting air through various vents.
Question -- it's highly likely the problem is 2 fold in 2.05 -- and most likely it's fixed in 2.05.1 -- but the same reason that now results in cold power on or idle - to require 40 minutes of printing to print properly is the reason I'm having this problem. I can literally power a warm system off for a few minutes -- power it up -- hit print and have it work...cold, I get underruns -- my sense is that's an indication of a hardware fault. Anything software is reproducible every time doing the same thing -- hardware tends to act funny if there is cold solder joint somewhere and magically works when warm.
Having said that, why do you not believe this is a hardware issue? Isn't the indication that an idle print fails, and a warmed up print succeeds an indication of that?
Here is how I see it -- if I order a new duet 2, and it fixes the issue, the likelihood of me getting a duet 3 is zero, I would be need a duet 3 probably 4 or 5 expansion modules, that was the eventual plan. The likelihood of me coming out of this with a positive experience is zero also. It's not a question of money, my employer has offered to pay for the cost of the replacement.
arhi last edited by
Having said that, why do you not believe this is a hardware issue?
one old TV repair hint ... how about you turn off all your fancy cooling of the duet, heat it up with a hairdryer and see if then it will give you a good print without requirement of 40min "warmup time"
@arhi I've actually thought about doing that --I'm thinking about doing exactly that tomorrow morning. Right now it's already warmed up and printing happily.
Have you described your system completely somewhere else? It seems like you have additional 2209 steppers hooked up externally? Which axes are those driving?
Also, is this the PSU you are using? https://www.eyeboot.com/24v-600w-dc-power-supply.html
@bot yes that's the power supply. The 2209s are driving Y and U axis, and extruder 0, and 1.
duplication mode -- underruns. This worked before, not now:
=== Diagnostics ===
RepRapFirmware for Duet 2 WiFi/Ethernet version 2.05 running on Duet Ethernet 1.02 or later + DueX5
Board ID: 08DGM-9T6BU-FG3S0-7JTD4-3S06K-1A4ZD
Used output buffers: 1 of 24 (21 max)
=== RTOS ===
Static ram: 25708
Dynamic ram: 96332 of which 0 recycled
Exception stack ram used: 448
Never used ram: 8584
Tasks: NETWORK(ready,616) HEAT(blocked,1136) DUEX(blocked,164) MAIN(running,1668) IDLE(ready,156)
Owned mutexes: I2C(DUEX)
=== Platform ===
Last reset 00:44:07 ago, cause: software
Last software reset at 2020-04-25 22:27, reason: User, spinning module GCodes, available RAM 8560 bytes (slot 3)
Software reset code 0x0003 HFSR 0x00000000 CFSR 0x00000000 ICSR 0x0441f000 BFAR 0xe000ed38 SP 0xffffffff Task 0x4e49414d
Error status: 0
Free file entries: 9
SD card 0 detected, interface speed: 20.0MBytes/sec
SD card longest block write time: 16.5ms, max retries 0
MCU temperature: min 24.1, current 26.0, max 26.6
Supply voltage: min 24.2, current 24.5, max 25.0, under voltage events: 0, over voltage events: 0, power good: yes
Driver 0: standstill, SG min/max 0/333
Driver 1: standstill, SG min/max not available
Driver 2: standstill, SG min/max not available
Driver 3: standstill, SG min/max 0/242
Driver 4: standstill, SG min/max not available
Driver 5: standstill, SG min/max not available
Driver 6: standstill, SG min/max 21/422
Driver 7: standstill, SG min/max 144/451
Driver 8: standstill, SG min/max 72/422
Driver 9: standstill, SG min/max 45/438
Date/time: 2020-04-25 23:11:57
Cache data hit count 4294967295
Slowest loop: 215.62ms; fastest: 0.08ms
I2C nak errors 0, send timeouts 0, receive timeouts 0, finishTimeouts 0, resets 0
=== Move ===
Hiccups: 0, FreeDm: 157, MinFreeDm: 105, MaxWait: 574595ms
Bed compensation in use: none, comp offset 0.000
=== DDARing ===
Scheduled moves: 1335, completed moves: 1304, StepErrors: 0, LaErrors: 0, Underruns: 0, 31
=== Heat ===
Bed heaters = 0 -1 -1 -1, chamberHeaters = -1 -1
Heater 0 is on, I-accum = 1.0
Heater 1 is on, I-accum = 0.3
Heater 2 is on, I-accum = 0.4
=== GCodes ===
Segments left: 0
Stack records: 2 allocated, 0 in use
Movement lock held by null
http is idle in state(s) 0
telnet is idle in state(s) 0
file is idle in state(s) 0
serial is idle in state(s) 0
aux is idle in state(s) 0
daemon is idle in state(s) 0
queue is idle in state(s) 0
autopause is idle in state(s) 0
Code queue is empty.
=== Network ===
Slowest loop: 267.21ms; fastest: 0.06ms
Responder states: HTTP(0) HTTP(0) HTTP(0) HTTP(0) FTP(0) Telnet(0) Telnet(0)
HTTP sessions: 2 of 8
Interface state 5, link 100Mbps full duplex
@dc42 at this point I tried a duplication print which worked fine last week even when I had trouble with triplicate -- and this print is now causing underruns and stuttering. I ask again -- is the board under warranty. I did the same thing that has worked in the past, and it didn't I'm not sure I can get this print to work at all now. I formatted the card, copied the files over and trying it again.
@dc42 I had trouble getting all the ear relievers to stay put -- so I slowed down the first layer to 15mm/sec -- after 2 hours that froze up with an underrun -- that's an idex print. Now I wanted to see if it underruns sooner, and tried speeding it up and at 30mm/sec it seems to be sticking OK -- so I ran that -- and that print is giving me much lower loop times. I will wait for another 10 minutes, but this has a shot of working -- is a super slow first layer a problem? I know I've used that technique before with no issues.
Question -- it's highly likely the problem is 2 fold in 2.05 -- and most likely it's fixed in 2.05.1 -- but the same reason that now results in cold power on or idle - to require 40 minutes of printing to print properly is the reason I'm having this problem.
There was an important bug fix in 2.05.1 (a 1-byte buffer overflow) and the consequences of that bug are unknown. You should definitely use a firmware build that includes that bug fix. It was in file OutputBuffer.cpp.
I can literally power a warm system off for a few minutes -- power it up -- hit print and have it work...cold, I get underruns -- my sense is that's an indication of a hardware fault. Anything software is reproducible every time doing the same thing -- hardware tends to act funny if there is cold solder joint somewhere and magically works when warm.
Tasks: NETWORK(ready,616) HEAT(blocked,1136) DUEX(blocked,164) MAIN(running,1668) IDLE(ready,156)
Owned mutexes: I2C(DUEX)
That's significant. It means that the Duet was communicating with the DueX when you ran M122. Were you doing anything that might cause that? For example, changing the speed of a fan connected to the DueX; or toggling an endstop or other switch connected to the DueX?
If it is a hardware problem then I think a likely cause is a poor solder joint between the SX1509B chip on the DueX board and the PCB. We've seen trouble with that before. That could cause spurious input transitions on the pin with the bad joint, leading to extra I2C traffic to read the changed input, leading in turn to increased loop times and underruns.
Do you have any normally-open switches connected to endstop inputs on the DueX5, or to GPIO pins on the DueX5?
I will add a counter in RRF3 to count the number of I2C transactions and display a transactions/minute count in the diagnostics. As you are building your own firmware, you could add a similar count in your RRF2 build.
@dc42 this print is only 2 extruders, but the all z axis motors are on duex5, one of the fans used by the 2 extruder is on duex5. That's it. The 2nd extruder has some minimal stuff overflowed to duex5.
Yes I am building my own firmware
Can you give me the code pointer to add the counter and I'll do that.
I'll switch over 2.05.1 in the process.
if you're saying it's the duex5, it would make some sense. There was one instance when I just rerouted a wire going to it, just moved it for better management, didn't disturb anything, just unplugged an end stop and plugged it back in the on power up, duet refused to boot. It kept cycling. I disconnected everything from the duex5, duet booted up. Then I plugged it all back in, one at at time and everything worked again. So then it sounds like I need a new duex5. I've had i2c issues with it before. Seems my duex5 might have been glitchy for a while, and the times the z axis motors got singed could have caused some issues with it.
@dc42 I don't use normally open, you guys don't recommend normally open, and it makes sense not use it. I have 3 normally closed switches and the rest are optical powered switches which also act normally closed.
The duex5 is older is also purchased from filastruder in May 2017. Is that under warranty, or do I need to order a replacement. Kinda sucks cause I just had that board out when I replaced the 4 z axis connectors. I'm not that proficient in microsoldering to try to reflow an smd chip.
I just want to clarify something:
You have the 5 stepper drivers on the Duet board, are they all being used?
Then you have 5 more stepper drivers on the duex, are they all being used?
And you also have 4 external drivers, driving Y, U, E0 and E1.
It seems that may be more motors than the firmware can handle? I thought 11 or 12 was the most RRF2 could manage. If you are using all of these, that's 14. Can you clarify?
@bot yes. dc42 helped me add 2 more. It's not in the official firmware because is 2 extra ops which would normally not be used by anyone else with a duet 2. It has worked for the last 3+ years perfectly fine. There is a way to reuse the pt100 pins for on duet and duex5. Same way that the LCD pins are being reused for 2 extra steppers.
@kazolar Ahh, gotcha. Thanks.
@dc42 so we're back the i2c mess. I looked through some other posts regarding i2c -- and I saw some related to not running wires next to or along the ribbon connecting the 2 boards...agh...that's the change that I did make when I added extra cable chains, I changed some wiring paths inside the enclosure. God help me, i2c with duet+duex5 is so glitchy. I re-routed cables around the ribbon, best I could, it's kinda tricky now that the case isn't really designed to do that, but -- OK, I found ways. I just started a print -- same one that failed 2 times yesterday before somehow magically working to completion overnight. Well I am running it now, and all loop times are low -- 3-4ms -- it's early in the print, but that looks promising. So it's not just the heavy grounding wire, but the ribbon needs to be clear of anything -- that might appear to be the issue here -- not defective hardware, or cold solder or firmware -- but cross talk on that ribbon. I'm using some insulated ribbons for my PC -- PCIEX extension ribbons -- they're expensive -- about $30 per, but they have a lot of protection against this kind of cross talk, shielding and such, is that something that is worth considering -- I am going to design a new case to split the duet and duex5 apart from each other the way it's now -- they're one on top of each other (as I believe it's intended) but that leads to some odd wiring runs that make it very difficult to avoid the ribbon.
I will do a triple head print later tonight -- see if that works without a reboot -- so far too far into this duplication print with very low loop times to consider a possible failure.
@dc42 question -- as it seems (too early to conclude) - that i2c is to blame for the issues -- why is i2c not showing any errors or timeouts in m122 reports -- freeing up the ribbon from interference from heater and other wires appears (for now) to have solved the issue -- but how come all this i2c interference -- slow loop times and no i2c errors... would have been too obvious and easy to investigate that path if there were some - poor SD cards got blamed and they appear to have been faultless.
well that was it -- I just did a bunch of starts and stops of a triple headed print -- trying to re-acquire my z offset and I did not reset anything and the current print is running with normal loop times, this would have inevitably triggered high loop times previously. Good to know no physical defect on anything -- just that ribbon cable must be treated like it's a newborn -- I didn't even think about it when I rerouted all the wires to the 2nd cable chain how they were connecting to where they had to connect to. I am still confused why i2c never showed any errors or timeouts or anything -- just increased the loop time tremendously. Thank you for your patience in sorting this out -- I do wish this ribbon came shielded -- I'd gladly pay extra for a shielded 50 pin cable -- I've been searching for one -- honestly at this point $100 for a cable that would not be bothered by things around it would be a bargain.
Glad we found the smoking gun.
Doing some googling it would appear that there is shielded ribbon cable available from digikey and mouser, and in certain other hobbies it looks like it's often DIYed with metallic tape.
I've thought about doing that, but I know how the PC shielded PCIEX extensions are made -- but I have tried that with similar cheap PCIEX cables -- putting HVAC tape around it and then putting regular insulation so that aluminum tape wouldn't short anything, and it didn't make the cable perform any better. I had purchased some inexpensive similar IDE style extensions which claimed to support PCIE-4X, and PCIE-4X devices would not work with them, 1X would -- Doing the DIY trick of insulting etc, did not help. Getting a cable for 4x the money that was already shielded worked. So I'd have to hunt around for the proper cable that's premade 50 pin and is the right spacing.
this can be marked solved -- I don't know how to.
You have to convert it into a question and then mark it as solved. I don't know where that's done but the option should be available to you somewhere.
Hey I don't suppose you could post a short video of the triplicate printing in action?
@Phaedrux sure: not sure how long I'll keep it on my dropbox:
After covid is over I'll release a full build vlog