Deadlock reading the object model from a plugin
-
Hello,
I am developping a interception plugin on my duet, and I am experiencing occasional deadlocks when reading from the object model, which cause some prints to hang forever.
More details:
- My plugin intercepts some GCode meta-commands and must retrieve a sensor value using a call to
CommandConnection.get_object_model()
from thedsf-python
package. - Most of the time, this just works, but sometimes, it hangs forever, causing the print to never resume because it is stuck in my plugin.
- When it happened, I checked the logs using
sudo journalctl -u duetcontrolserver -r
, and I got the following suspicious lines:
Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel) Oct 04 09:02:35 alpha DuetControlServer[29146]: [warn] Resending packet #0 (request GetObjectModel)
I suspect that the command to retrieve the object model is stuck in a deadlock, which sounds like a synchronization issue.
I have tried to call
CommandConnection.sync_object_model()
before retrieving the object model, but the deadlock also occured once in that case.Do you have any ideas on how to solve this issue ?
Do you think that I am doing something wrong, or could it be an issue in the firmware ?
In that case, should I update it to the latest version (3.5.3) ?Thank you very much for your help!
Antoine
- My plugin intercepts some GCode meta-commands and must retrieve a sensor value using a call to
-
@Ant1 I think you will need @chrishamm to have a look at this issue.
Ian
-
@Ant1 It sounds like the object model response is too long so RRF fails to send it over. A plugin can only request the object model from DCS, which keeps a copy of it for all other applications, so I don't think it's the dsf-python call that is responsible for that problem.
Please do update to the latest version (3.5.3) because I'm quite sure we had to implement a different query method for
tools[]
andmove.axes[]
for large configurations. -
@chrishamm Okay I will update my machines and see if it solves the issue. Thanks!
Otherwise, is there a way to only get a specific key of the object model that prevents full updates to be propagated to the raspberry PI ? Maybe this would help reducing too long responses.
-
@Ant1 No, only plugins can subscribe to specific parts of the object model. Please let me know if this is still an issue with 3.5.3 and if it is, please share your configuration as well.
-
@chrishamm Okay, good to know. I've made the update to 3.5.3 on one of my machines this week, and it seems to be working for now. I will make an update if I still observe issues.
Thanks for your help!
-
@Ant1 Hey! Unfortunately the issue just happened again today
Here is information with regard to my configuration:
config.gAnd here are some information about the plugin that I am developping. I only included the interception part, but if you want to see the other files I can share.
plugin.json
intercept.py -
@Ant1 Sorry to hear that. Did you see the same
Resending packet #0 (request GetObjectModel)
messages this time, too? Your config doesn't look terribly complex, so I find it surprising that you're running out of output buffer space - that's the only reason for that particular log message.You could try to disable PanelDue for testing purposes and check if the problem persists then. PanelDue, networking, and SBC interfaces share the same output buffer pool.
Btw, you should cancel and discard the code being intercepted if
flush
returnsfalse
, else the code action may be still executed even though the underlying code or (macro) file is already cancelled. -
@chrishamm
Yes, it was the same issue withResending package #0
.
Okay I will check if disabling the PanelDue works.
And thank you for the tip, I will add that line of code to the plugin. -
@Ant1 Oh and by the way, we also observed a new, more concerning issue with the plugin. When resuming the GCode execution after a custom command has been intercepted by my plugin, the printer shifts everything vertically, which basically makes the print fail.
I have checked the GCode, and there is always an instruction
G1 Z...
after the call to the macro that gets intercepted, so the machine should move to a specific height. But instead, it moves 1 or 2mm higher than that and it prints in the air.We have checked, and this issue only happens when the plugin is activated. Could it also be some kind of synchronization issue between the plugin and the duet ? I am kinda lost on this one...
-
@Ant1 I can't see what happens in your control class but maybe you open a second command connection and send instructions to a different channel instead of the channel where the code is intercepted. If you have relative positioning on the channel where the code is intercepted from (e.g. DWC ->
HTTP
channel), absolute positioning may be still set on the default code channel for commands (SBC
channel).I generally recommend you use only the intercept connection to send further commands before resuming regular execution. Btw, there seems to be an issue with blocking message boxes that I still need to fix. I've already logged it here:
https://github.com/Duet3D/DuetSoftwareFramework/issues/207 -
@chrishamm Thanks for your issue on GitHub! Is there any way I could help tracking it down ?
Regarding the command connection, I am passing the intercept connection to the different parts of my plugin so that all the commands are sent through it, so this should be working, right ?
The weird thing is that this issue happens randomly. That is, if I launch the same GCode several times in a row, it might succeed, or fail, but at different places. The only common denominator is that it fails to move at the right height after it resumes execution from my plugin.
-
@Ant1 said in Deadlock reading the object model from a plugin:
Is there any way I could help tracking it down ?
Not yet, I need to have a look at this problem again in detail before I can comment on it further. If you can, offload the M291 logic to a separate macro for now and invoke that using
M98
instead. A quick fix for a blocking M291 request from a custom code might be to prioritize M292, however I need evaluate that option again. I'll keep you posted.@Ant1 said in Deadlock reading the object model from a plugin:
Regarding the command connection, I am passing the intercept connection to the different parts of my plugin so that all the commands are sent through it, so this should be working, right ?
Not necessarily. When you run further codes on the same channel but using a different command connection, those codes are queued behind the code that is being intercepted. By sending codes through the intercept connection instead, you can ensure that those codes are executed before the code being intercepted finishes. Perhaps that side-effect upsets your custom logic.
-
@chrishamm
Yes, that's what I meant, currently myInterceptConnection
is shared between all the modules of my code, so that all the commands get executed before the interception finishes.What is weird though is that we used to have a separate
CommandConnection
in a previous version of the plugin and that bug never happened back then... -
@Ant1 I've got fixes ready for the message box deadlock in the DSF v3.5 and v3.6 branches. Prioritizing M292 does indeed fix it. That lets me run your
test.py
script without problems.If you want to diagnose the order of codes being processed, either check out the
CodeLogger
utility in/opt/dsf/bin
or enabledebug
logging via/opt/dsf/conf/config.json
. -
@chrishamm Okay thanks for all the pointers ! I switched all my machines to 3.5.3 and made the changes with the connections that you suggested, and it solves most of my problems.
For the problem that I mentionned due to the machine being shifter along the Z axis after executing a macro from the plugin, I found something that might have caused the issue:
When intercepting a command inside the plugin, I was always starting by sending
G90
andM83
to the machine to make sure that whatever mode it is in, I can always perform moves and extrusions in absolute and relative mode, respectively. I suspect that those two commands might have caused the machine to shift position along Z due to synchronization issues because I removed them and the issue never happened again...