Print stopped at 83% for seemingly no reason

shogran

Hi there,

Started an overnight print that should have taken ~10 hours and noticed the next day it was still sitting at 83% via the web interface. Thought nothing of it and checked back an hour later to see it was still at the same point, I assumed the interface had frozen so I went and checked in on the printer and saw that the printhead was stopped in place on the incomplete model.

The bed and nozzle were still hot and according to the interface at the correct temperatures of 75 and 240 degrees respectively and I could pause the print which would move the head back to 0, then resume. However the resumed print would just continue to sit in the same place. There were no errors in the console and no matter how many times I paused and resumed the print it did not continue. In the end I ran M122 and then cancelled the print. I'm running version 2.02RC3 with server version 1.21 and web interface 1.22.4-b1. The GCODE file was sliced in Simplify3D 4 and uploaded via FileZilla, I've downloaded the file and it looks fine to me. It looks like it occured at layer 127 (Z height 17.66mm). The machine is a CoreXY.

Due to the size of the files I've put the GCODE and some screenshots/photos here along with my config.g file and the M122 text which I'll try to get below as a code snippet: https://www.dropbox.com/sh/384kq6ocd85e95e/AADj-WkBKcL7vvWY1bmKvIn8a?dl=0

If I can provide anything else that will make diagnosing this easier then let me know!

EDIT: If I wrap the code below in the ``` code tags I get told I can't submit as "Post content was flagged as spam by Akismet.com".

10:09:35
M122
=== Diagnostics ===
RepRapFirmware for Duet 2 WiFi/Ethernet version 2.02RC3(RTOS) running on Duet WiFi 1.02 or later
Board ID: 08DDM-9FAM2-LW4S8-6JTDG-3SD6T-13YBW
Used output buffers: 1 of 20 (18 max)
=== RTOS ===
Static ram: 28532
Dynamic ram: 98600 of which 0 recycled
Exception stack ram used: 624
Never used ram: 3316
Tasks: NETWORK(ready,328) HEAT(blocked,1232) MAIN(running,3484)
Owned mutexes:
=== Platform ===
Last reset 14:38:10 ago, cause: power up
Last software reset at 2018-12-04 19:40, reason: User, spinning module GCodes, available RAM 3452 bytes (slot 1)
Software reset code 0x0003 HFSR 0x00000000 CFSR 0x00000000 ICSR 0x0041f000 BFAR 0xe000ed38 SP 0xffffffff Task 0x4e49414d
Error status: 0
Free file entries: 9
SD card 0 detected, interface speed: 20.0MBytes/sec
SD card longest block write time: 0.0ms, max retries 0
MCU temperature: min 38.6, current 38.7, max 38.9
Supply voltage: min 23.2, current 24.2, max 24.5, under voltage events: 0, over voltage events: 0
Driver 0: standstill, SG min/max not available
Driver 1: standstill, SG min/max not available
Driver 2: standstill, SG min/max not available
Driver 3: standstill, SG min/max not available
Driver 4: standstill, SG min/max not available
Date/time: 2018-12-07 10:09:32
Cache data hit count 4294967295
Slowest loop: 1.25ms; fastest: 0.07ms
=== Move ===
Hiccups: 0, StepErrors: 0, LaErrors: 0, FreeDm: 240, MinFreeDm: 240, MaxWait: 0ms, Underruns: 0, 0
Scheduled moves: 402064, completed moves: 401876
Bed compensation in use: mesh
Bed probe heights: 0.000 0.000 0.000 0.000 0.000
=== Heat ===
Bed heaters = 0 -1 -1 -1, chamberHeaters = -1 -1
Heater 0 is on, I-accum = 0.4
Heater 1 is on, I-accum = 0.6
=== GCodes ===
Segments left: 0
Stack records: 2 allocated, 0 in use
Movement lock held by null
http is idle in state(s) 0
telnet is idle in state(s) 0
file is doing "M106 S204" in state(s) 0
serial is idle in state(s) 0
aux is idle in state(s) 0
daemon is idle in state(s) 0
queue is idle in state(s) 0
autopause is idle in state(s) 0
Code queue is not empty:
Queued 'M106 S204' for move 401967
Queued 'M106 S102' for move 401968
Queued 'M106 S204' for move 401970
Queued 'M106 S102' for move 401975
Queued 'M106 S204' for move 401978
Queued 'M106 S102' for move 401979
Queued 'M106 S204' for move 401981
Queued 'M106 S102' for move 401986
Queued 'M106 S204' for move 401989
Queued 'M106 S102' for move 401990
Queued 'M106 S204' for move 402016
Queued 'M106 S102' for move 402017
Queued 'M106 S204' for move 402019
Queued 'M106 S102' for move 402024
Queued 'M106 S204' for move 402027
Queued 'M106 S102' for move 402028
16 of 16 codes have been queued.
=== Network ===
Slowest loop: 178.61ms; fastest: 0.08ms
Responder states: HTTP(0) HTTP(0) HTTP(0) HTTP(0) FTP(0) Telnet(0) Telnet(0)
HTTP sessions: 1 of 8

WiFi -
Network state is running
WiFi module is connected to access point
Failed messages: pending 0, notready 0, noresp 0
WiFi firmware version 1.21
WiFi MAC address 5c:cf:7f:ee:68:10
WiFi Vcc 3.39, reset reason Turned on by main processor
WiFi flash size 4194304, free heap 14168
WiFi IP address 192.168.1.120
WiFi signal strength -76dBm, reconnections 0, sleep mode modem
Socket states: 0 0 0 0 0 0 0 0
=== Expansion ===
10:09:33
Message Log cleared!

dc42

I'm sorry, I think you have found a firmware bug. This line from your report:

Scheduled moves: 402064, completed moves: 401876

has a difference of 188, which is much greater than the size of the move queue. So it must be failing to count some completed moves. That has caused the code queue to become full with M106 commands; so the firmware is waiting for the completed move count to catch up.

I think your only hope of saving the print is to pause it, check that a resurrect.g file has been created, use Emergency Stop to reset the Duet, and use the resurrect feature (M916) to restart the print. ~~I think firmware 2.02RC3 inserts a G92 command in resurrect.g prior to calling resurrect-prologue.g, and if that is the case then you probably only need a M116 command in resurrect-prologue.g.~~

EDIT: I just checked the code. To avoid resetting the Duet (which will power the motors down, potentially losing position) you can pause the print, back up the resurrect.g file, then cancel the print. Then use a text editor to remove the G92 command from resurrect.g. Then use M916 to resume the print (as before you will need M116 in the resurrect-prologue.g file).

dc42

PS - I'm trying to track down how the move counts became so different. Did you pause the print a large number of times?

shogran

@dc42 said in Print stopped at 83% for seemingly no reason:

I'm sorry, I think you have found a firmware bug. This line from your report:

Scheduled moves: 402064, completed moves: 401876

has a difference of 188, which is much greater than the size of the move queue. So it must be failing to count some completed moves. That has caused the code queue to become full with M106 commands; so the firmware is waiting for the completed move count to catch up.

I think your only hope of saving the print is to pause it, check that a resurrect.g file has been created, use Emergency Stop to reset the Duet, and use the resurrect feature (M916) to restart the print. ~~I think firmware 2.02RC3 inserts a G92 command in resurrect.g prior to calling resurrect-prologue.g, and if that is the case then you probably only need a M116 command in resurrect-prologue.g.~~

EDIT: I just checked the code. To avoid resetting the Duet (which will power the motors down, potentially losing position) you can pause the print, back up the resurrect.g file, then cancel the print. Then use a text editor to remove the G92 command from resurrect.g. Then use M916 to resume the print (as before you will need M116 in the resurrect-prologue.g file).

@dc42 said in Print stopped at 83% for seemingly no reason:

PS - I'm trying to track down how the move counts became so different. Did you pause the print a large number of times?

Thanks for the feedback dc, much appreciated.

I did pause and resume the file a few times when I noticed it wasn't printing but not while the print was working. I do have stall detection enabled which I know triggered a few times; could that have caused it/acted the same way as a pause?

dc42

@shogran said in Print stopped at 83% for seemingly no reason:

I do have stall detection enabled which I know triggered a few times; could that have caused it/acted the same way as a pause?

Thanks, that may be where the problem is. What action do you have configured when a stall is detected?

shogran

@dc42 said in Print stopped at 83% for seemingly no reason:

@shogran said in Print stopped at 83% for seemingly no reason:

I do have stall detection enabled which I know triggered a few times; could that have caused it/acted the same way as a pause?

Thanks, that may be where the problem is. What action do you have configured when a stall is detected?

This is my stall detection line:

M915 X Y S10 R3 F0 ; Stall detection, set X and Y to sensitivity 10, pause print, rehome, unfiltered

and this is the rehome.g file it's using:

G91                       ; relative mode
G1 Z5 F200 S2; lower bed
G1 S1 X-305 Y274.5 F3000   ; course home X or Y
G1 S1 X-305              ; course home X
G1 S1 Y274.5               ; course home Y
G1 X4 Y-4 F600             ; move away from the endstops
G1 S1 X-305                ; fine home X
G90
G1 Y274.5             ; fine home Y
G91
G1 Z-5 F200 S2; lower bed

dc42

Thanks.

It looks like it occured at layer 127 (Z height 17.66mm)

The GCode file says that layer 127 has Z height 12.700mm. So do you mean layer 127, or 12.700mm height? It's possible that DWC is displaying the wrong layer count.

I've reviewed the part of the firmware that handles an emergency pause (which is done when there is power loss or a stall is detected), and I found two issues:

When an emergency pause occurs, an off-by-one error meant that the number of scheduled moves ends up one higher than it should be;
After an emergency pause, the queue of deferred commands was not being resynced to allow for moves that were discarded from the queue.

What I think happened was that you had a large number of repeated stalls around the same point in the print, during or shortly before bridging moves. Each one caused the number of scheduled moves to increase by one, and some deferred M106 commands to be left in the queue instead of being purged. This explains the following part of the M122 report:

Code queue is not empty:
Queued 'M106 S204' for move 401967
Queued 'M106 S102' for move 401968
Queued 'M106 S204' for move 401970
Queued 'M106 S102' for move 401975
Queued 'M106 S204' for move 401978
Queued 'M106 S102' for move 401979
Queued 'M106 S204' for move 401981
Queued 'M106 S102' for move 401986
Queued 'M106 S204' for move 401989
Queued 'M106 S102' for move 401990
Queued 'M106 S204' for move 402016
Queued 'M106 S102' for move 402017
Queued 'M106 S204' for move 402019
Queued 'M106 S102' for move 402024
Queued 'M106 S204' for move 402027
Queued 'M106 S102' for move 402028

The move numbers are so close together that I think the multiple pairs M106 moves must repetitions of the same sequence of bridging moves.

Do you think it possible that the stall detection was triggered 150 or more times, with at least the last 8 being all around the same place? If you have logging enabled, then the stalls may be recorded in the log file.

shogran

@dc42 said in Print stopped at 83% for seemingly no reason:

Thanks.

It looks like it occured at layer 127 (Z height 17.66mm)

The GCode file says that layer 127 has Z height 12.700mm. So do you mean layer 127, or 12.700mm height? It's possible that DWC is displaying the wrong layer count.

Ah, that would make sense actually as it was printing with a 0.1mm layer height. DWC may have been displaying the wrong info in this instance.

I've reviewed the part of the firmware that handles an emergency pause (which is done when there is power loss or a stall is detected), and I found two issues:

When an emergency pause occurs, an off-by-one error meant that the number of scheduled moves ends up one higher than it should be;

After an emergency pause, the queue of deferred commands was not being resynced to allow for moves that were discarded from the queue.

What I think happened was that you had a large number of repeated stalls around the same point in the print, during or shortly before bridging moves. Each one caused the number of scheduled moves to increase by one, and some deferred M106 commands to be left in the queue instead of being purged. This explains the following part of the M122 report:

Code queue is not empty:
Queued 'M106 S204' for move 401967
Queued 'M106 S102' for move 401968
Queued 'M106 S204' for move 401970
Queued 'M106 S102' for move 401975
Queued 'M106 S204' for move 401978
Queued 'M106 S102' for move 401979
Queued 'M106 S204' for move 401981
Queued 'M106 S102' for move 401986
Queued 'M106 S204' for move 401989
Queued 'M106 S102' for move 401990
Queued 'M106 S204' for move 402016
Queued 'M106 S102' for move 402017
Queued 'M106 S204' for move 402019
Queued 'M106 S102' for move 402024
Queued 'M106 S204' for move 402027
Queued 'M106 S102' for move 402028

The move numbers are so close together that I think the multiple pairs M106 moves must repetitions of the same sequence of bridging moves.

Do you think it possible that the stall detection was triggered 150 or more times, with at least the last 8 being all around the same place? If you have logging enabled, then the stalls may be recorded in the log file.

Unfortunately I didn't have logging enabled, though I have now added it to my config.g for future use. I think it's possible that it was triggered that many times as I've only recently enabled it and am still tuning the sensitivity. I've definitely been in the room a few times when it has triggered a few times in the space of a few seconds on the same point before moving past it.

dc42

Note, the sensitivity to stalls increases with motor temperature. So a stall sensitivity in the M915 command that was OK when the motors were cold may be too sensitive when the motors are hot due to printing for some time.

shogran

@dc42 Thanks, I've restarted the print with a reduced sensitivity on the stall detection and logging enabled. I'll update here with the outcome.

shogran

@dc42 Printed without any issues after reducing the sensitivity and not having the stall detection trigger at all that I could see. Looks like that may have been the cause.

dc42

@shogran said in Print stopped at 83% for seemingly no reason:

@dc42 Printed without any issues after reducing the sensitivity and not having the stall detection trigger at all that I could see. Looks like that may have been the cause.

i'm glad it worked. I've put a fix for the off-by-one issue in the forthcoming 2.02RC6 release. Please accept my apologies for the failed print.