DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1



  • @chrishamm
    I've been having big issues since the latest update to 3.01-R10 / DWC 2.1.5 / DSF 2.1.1 where the Pi has become incredibly unstable to the extent that it drops WiFi connection and totally locks up the Pi. I can move the mouse around on the screen but can't otherwise interact. If left in this state long enough it ends up nuking the CPU such that the max temperature warning icon shows on the LCD, meaning that it hit 85 degrees.

    It's taken a while to track down to provide any proof as the system either works perfectly, or is usually completely unresponsive. It first happened immediately after I updated and had to power cycle the whole system to get things working again. While I haven't found a 100% foolproof way of recreating the issue, it usually seems to happen immediately after a cold start rather than mid print etc. If I power cycle, its seems to boot fine after. I'm wondering whether it's something to do with communicating with the D3 and waiting for a response? Twice it also appears to have happened when a new DWC was opened on a remote PC, but I wouldn't like to say whether those were anything more than coincidence. Either way, it's frequent enough that I'll be going back to RC9 etc for now.

    The last time it happened (turning on from cold), I just about managed to SSH in and grab the screen shot below. 400% is clearly as bad as things can get as it's effectively maxing out all 4 cores. In the 30 seconds or so that I managed to maintain connection, the CPU usage for DCS didn't drop below 374%. After that the whole Pi crashes. Reverting back to RC9 etc and everything seems fine.
    2020-04-26 (3).png



  • Can you do a journalctl -fu duetcontrolserver and see if there are any issues?
    If you do a systemctl stop duetwebserver does the usage go back top normal?



  • To be honest, the time I got the screenshot was the only time that SSH has managed to connect long enough to do anything useful. I was about to try pulling the journal log before it died again. I can't do it straight on the Pi as aside from the mouse moving, the gui is unresponsive. Similarly, I have no way to try stopping until I can maintain a connection long enough, but my suspicion is that stopping the service will recover the system, but it'd be interesting to know what restarting does.

    Since the time I've got the screenshot I've managed to recreate the issue a good number of times, but not such that I can interact with the system 😕

    Edit: Is there an easy way to load a terminal in the gui on boot and run journalctl -fu duetcontrolserver?



  • Try this... With the Pi running, even on the previous DSF release, so a systemctl stop duetwebserver duetcontrolserver and systemctl disable duetwebserver duetcontrolserver. That will keep the services from starting on reboot. Upgrade to the latest packages. I don't remember if the upgrade re-enables the services so to be safe, do the disable again and reboot.

    Once the system is back up and running, in one terminal window do the journalctl -fu duetcontrolserver, in another run top, and in another do a systemctl start duetcontrolserver. See what happens. If everything is stable, do a start on the duetwebserver and connect via DWC and see what happens.



  • I'm pretty sure I have also seen this problem, but I've not been able to reproduce it. It was just after updating to RC10/2.1.1. I did the rPi update/upgrade and then went to check if it was working using a browser and DWC. I forced a reload of DWC (to make sure I was using the updated DWC) and it refreshed and said it was trying to connect, at that point the window I had open running ssh to the pi popped up a notification that the pi connection had been lost. Following this I was unable to ssh back to the pi and the DWC web server was no longer responding.

    I rebooted the pi and everything seemed fine. I was then moving around testing various things and the same thing happened again. To be honest I thought at the time it was my pi overheating or something, but having seen this report I'm now suspect it was the same issue.



  • Okay, so I managed to induce it this time simply by restarting the Pi from command line. I can tell when it's going to happen before the gui fully loads on the Pi because putty takes an age to show the login prompt. This time I managed to get 3 sessions open - one to run top to see the CPU usage, one for the journal log (below) and one to restart the DCS service. the service session up to 17:43:37 was exhibiting the problem. I then restarted the process sudo systemctl restart duetcontrolserver.service (at 17:44:38) at which point DCS restarted and everything was fine. The log shows nothing else though 😕

    pi@starttex:~ $ sudo journalctl -u duetcontrolserver -f
    -- Logs begin at Sun 2020-04-26 17:43:32 BST. --
    Apr 26 17:43:35 starttex systemd[1]: Started Duet Control Server.
    Apr 26 17:43:38 starttex DuetControlServer[359]: Duet Control Server v2.1.1
    Apr 26 17:43:38 starttex DuetControlServer[359]: Written by Christian Hammacher for Duet3D
    Apr 26 17:43:38 starttex DuetControlServer[359]: Licensed under the terms of the GNU Public License Version 3
    Apr 26 17:43:39 starttex DuetControlServer[359]: [info] Settings loaded
    Apr 26 17:43:40 starttex DuetControlServer[359]: [info] Environment initialized
    Apr 26 17:43:40 starttex DuetControlServer[359]: [info] Connection to Duet established
    Apr 26 17:43:40 starttex DuetControlServer[359]: [info] IPC socket created at /var/run/dsf/dcs.sock
    Apr 26 17:44:37 starttex DuetControlServer[359]: [info] System time has been changed
    Apr 26 17:44:38 starttex systemd[1]: Stopping Duet Control Server...
    Apr 26 17:44:38 starttex DuetControlServer[359]: [warn] Received SIGTERM, shutting down...
    Apr 26 17:44:38 starttex systemd[1]: duetcontrolserver.service: Main process exited, code=exited, status=143/n/a
    Apr 26 17:44:38 starttex systemd[1]: duetcontrolserver.service: Failed with result 'exit-code'.
    Apr 26 17:44:38 starttex systemd[1]: Stopped Duet Control Server.
    Apr 26 17:44:38 starttex systemd[1]: Started Duet Control Server.
    Apr 26 17:44:38 starttex DuetControlServer[1719]: Duet Control Server v2.1.1
    Apr 26 17:44:38 starttex DuetControlServer[1719]: Written by Christian Hammacher for Duet3D
    Apr 26 17:44:38 starttex DuetControlServer[1719]: Licensed under the terms of the GNU Public License Version 3
    Apr 26 17:44:39 starttex DuetControlServer[1719]: [info] Settings loaded
    Apr 26 17:44:39 starttex DuetControlServer[1719]: [info] Environment initialized
    Apr 26 17:44:39 starttex DuetControlServer[1719]: [info] Connection to Duet established
    Apr 26 17:44:39 starttex DuetControlServer[1719]: [info] IPC socket created at /var/run/dsf/dcs.sock
    Apr 26 17:45:09 starttex DuetControlServer[1719]: [info] System time has been changed
    


  • @gloomyandy said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    ...I forced a reload of DWC (to make sure I was using the updated DWC) and it refreshed and said it was trying to connect, at that point the window I had open running ssh to the pi popped up a notification that the pi connection had been lost. Following this I was unable to ssh back to the pi and the DWC web server was no longer responding...

    This is exactly the behaviour I observe if I actually manage to connect.



  • Concur - RPi is unusable, unpredictable, certainly wouldn't trust this printer to do anything useful currently.

    Can't even access via SSH

    Only way to get it back is power cycle.

    No point going any further with this version, now need to try and roll back this version - can't use it, can't even reliably help debug.



  • If you're unable to access SSH there is a serial console on the GPIO header, might be a bit tricky to get to it with the Duet ribbon cable in place unless you made your own cable for this purpose.

    (of course there is also a HDMI port and a USB port for a keyboard..)



  • More hassle than I'm in the mood for, it is clear that the RPi runs for a while, then no more connection after a time ,,,,

    Currently trying to roll back (never done it before) ... so have RPi SD card on the bench ....



  • @Garfield said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    More hassle than I'm in the mood for, it is clear that the RPi runs for a while, then no more connection after a time ,,,,

    Currently trying to roll back (never done it before) ... so have RPi SD card on the bench ....

    I've done this a few times now. The easiest way I've found is to remove whats there first: sudo apt remove duet* reprap*

    Then to go back you have to install the components by specifying the version for each package. This will get you back to RC9 etc: sudo apt install duetsoftwareframework=2.1.0 duetcontrolserver=2.1.0 duettools=2.1.0 duetwebcontrol=2.1.4 reprapfirmware=2.1.0-1 duetruntime=2.1.0



  • Very much appreciate the heads up.

    Notes taken and stored ....



  • I just had a crash with this in the console

    Warning: Lost connection to Duet (Timeout while waiting for transfer ready pin)
    

    Don't know if its the same.
    I'm having to reboot the pi to get back in.



  • @jay_s_uk said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    Timeout while waiting for transfer ready pin

    that would imply DCS is blaming the duet; but not a guarantee DCS isnt at fault still i guess



  • @bearer

    The pi has been rebooted and as soon as I try and do something, I lose all connection to the pi. I can't even SSH into it.
    Cutting power and gets it back up and running again.

    First time round, the first thing I tried to do was run my tool unlock macro and the same thing happened again



  • Spoke too soon. The whole thing has died again.



  • Back down at RC9 but my CPU fan is still not working correctly - really don't want to go back to RC7 but can't handle the constant noise.

    I KNOW this fan can work correctly - has done so since RC1 .... why can't I see fan 2 on the dashboard ? (doesn't even appear in the display filter)

    Has something changed in gcode ????

    M308 S2 Y"mcu-temp" A"CPU"
    M950 F2 C"!out4" A"MCU" Q32000 L5  
    M106 P2 T40:45 H2      ; set Duet cooling fan	
    


  • @Garfield said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    More hassle than I'm in the mood for

    you could probably ssh in, stop DuetControlServer service and run it in the foreground to try and capture any relevant debugging info. installing and using screen would pervent DCS from being terminated if the ssh session is terminated.

    sudo apt install -y screen and then run sudo systemctl stop duetcontrolserver followed by sudo screen /opt/dsf/bin/DuetControlServer -l debug



  • At the time you couldn't ssh - the RPi wasn't responsive at all, if you had an SSH session open it just stopped responding.

    I will try the screen though - what does that offer? - a non DWC web gui ?



  • @jay_s_uk said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    Spoke too soon.

    if you've got access to console/ssh could you also run something like top and see if it spots something to suggest DCS get stuck in a loop?



  • @Garfield said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    I will try the screen though - what does that offer?

    its a terminal multiplexer / window manager or sometihng like so. it achieves that dcs will keep running if you have a network glitch. if you run dcs in the foreground and ssh stops all the processes in that shell are terminated - with screen they can keep running.



  • First message

    [warn] RepRapFirmware got a bad header checksum

    and then [screen is terminating]



  • @Garfield said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    [warn] RepRapFirmware got a bad header checksum

    I was getting those when I very first setup my D3 & Pi. Firmly re-seating the ribbon cable on both ends cleared them up.



  • @bearer said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    @jay_s_uk said in DCS Crash with 3.01-R10 / DWC 2.1.5 / DSF 2.1.1:

    Spoke too soon.

    if you've got access to console/ssh could you also run something like top and see if it spots something to suggest DCS get stuck in a loop?

    Terminal dies as soon as the web connection does.
    I'll try and run DCS through the session and see what it spits out. It'll be later on though as I'm on bedtime duty now.



  • the only thing I could think of that with respect to DCS to complaining about the RDY pin is basically an interrupt storm which can grind the Pi to a halt. not sure if relevant though.


Log in to reply