Skip to content

Conversation

@tpwrules
Copy link
Contributor

@tpwrules tpwrules commented Dec 17, 2025

This is the ultimate fix for all the hacks we had to do in #24231 and #27056 to ensure floating point registers are properly saved and restored when handling Lua errors. Unlike those, this fixes the problem at the source, which is newlib's setjmp, used in ChibiOS on STM32. The problem could theoretically happen on Cortex-M4 too, this fixes it there as well. The problem could also happen at any setjmp, of which we had one other that was not protected.

I've included a fixed setjmp under a new name based on newlib's patch. We can't use linker wrapping to replace the existing implementation as we also need to grow jmp_buf by 10% or so. I did add a linker wrap to ensure setjmp can't be used unaware in some other part of the code. At such time as we banish older compilers with the broken newlib we can remove ap_setjmp.

Tested the fault handler script created in a previous PR on both Cube Orange and Cube Black (non-trivial!). I tested to 100k iterations and three scripting restarts. I also manually set up the FP registers in the main thread function and validated with a debugger and with prints that they were correctly preserved (and did not get preserved when the various patches were reverted).

This is a prerequisite to fix another hardfault-on-restart in scripting in the panic handling code. To be clear, this particular change doesn’t fix any crash I discovered or believe is easily possible, it just makes the code cleaner and safer.

@tpwrules tpwrules marked this pull request as ready for review December 17, 2025 04:17
@tridge
Copy link
Contributor

tridge commented Dec 17, 2025

@tpwrules very nice!
what we need to do is ensure we have a lua script that reproduces the problem with the previous fixes reverted.
I think the script I may have used before was based on examples/fault_handling.lua?
I know we did have a script that would reliably trigger the issue. If we can show this PR makes that script run reliably then that would be a very good start

@tpwrules
Copy link
Contributor Author

tpwrules commented Dec 17, 2025

That’s exactly what I did, see the second to last paragraph. I also custom wrote my own which was just math.random(1,0) in its own file.

@tpwrules
Copy link
Contributor Author

tpwrules commented Dec 19, 2025

I am unable to reproduce the "scripting dies" failure from the first PR. The first PR correctly identified the issue, but messed up the fix, corrupting the stack along with the FP registers on a Lua error, causing a high chance of a hard fault. The second PR then fixed the fix, but left other setjmps broken (which is what this PR aims to fix for good).

However, with some instrumentation it is possible to reproduce the original failure after reverting the two original PRs but before applying this one.

git checkout -b setjmp-test 86c8eb034b2a4216a0a6
git revert 188df1312d93aff0e405
git revert a497c06e83cc7cacf3db

Then apply this diff and flash your board (I used CubeOrange, no guarantees others will optimize the same):

diff --git a/libraries/AP_Scripting/AP_Scripting.cpp b/libraries/AP_Scripting/AP_Scripting.cpp
index 4a3ff370be..5d9c365505 100644
--- a/libraries/AP_Scripting/AP_Scripting.cpp
+++ b/libraries/AP_Scripting/AP_Scripting.cpp
@@ -295,6 +295,10 @@ MAV_RESULT AP_Scripting::handle_command_int_packet(const mavlink_command_int_t &
 #endif
 
 void AP_Scripting::thread(void) {
+    // s16 is callee-saved, gcc expects other functions to save it if it is used
+    register float canary asm("s17");
+    asm("vmov.f32 s17, #0.5"); // load here so gcc doesn't know the value
+
     while (true) {
         // reset flags
         _stop = false;
@@ -319,7 +323,7 @@ void AP_Scripting::thread(void) {
             lua->run();
 
             // only reachable if the lua backend has died for any reason
-            GCS_SEND_TEXT(MAV_SEVERITY_CRITICAL, "Scripting: %s", "stopped");
+            GCS_SEND_TEXT(MAV_SEVERITY_CRITICAL, "Scripting: %s canary: %f", "stopped", canary);
         }
         delete lua;
         lua = nullptr;

Then load (only) this script:

print("faulting in 10 seconds")
local goes = 0
function update()
  goes = goes + 1
  if goes < 10 then return update, 1000 end
  math.random(1, 0)
end
return update, 1000

A sample mavproxy transcript follows:

Details
STABILIZE> reboot
STABILIZE> Got COMMAND_ACK: PREFLIGHT_REBOOT_SHUTDOWN: ACCEPTED
Device /dev/serial/by-id/usb-Hex_ProfiCNC_CubeOrange_39004C000E51303231383439-if00 is dead
Device /dev/serial/by-id/usb-Hex_ProfiCNC_CubeOrange_39004C000E51303231383439-if00 reopened OK
AP_Logger_File: buffer size=176128
MS5611 found on bus 4 address 0x02
MS5611 found on bus 1 address 0x03
AP: Initialising ArduPilot
disabling flow control on serial 1
disabling flow control on serial 2
AP: Calibrating barometer
AP: Barometer 1 calibration complete
AP: Barometer 2 calibration complete
Init Gyro**
AP: ArduPilot Ready
AP: AHRS: DCM active
AP: RCOut: PWM:1-6
AP: GPS 1: specified as DroneCAN1-125
AP: faulting in 10 seconds
AP: GPS 1: detected DroneCAN1-125
Time has wrapped
Time has wrapped 12339 94460
AP: EKF3 IMU0 initialised
AP: EKF3 IMU1 initialised
AP: EKF3 IMU2 initialised
AP: AHRS: EKF3 active
AP: EKF primary changed:1
AP: EKF3 IMU0 tilt alignment complete
AP: EKF3 IMU1 tilt alignment complete
AP: EKF3 IMU2 tilt alignment complete
AP: EKF3 IMU0 MAG0 initial yaw alignment complete
AP: EKF3 IMU1 MAG0 initial yaw alignment complete
AP: EKF3 IMU2 MAG0 initial yaw alignment complete

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 0.500000
AP: Scripting: restarted
AP: faulting in 10 seconds

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 0.500000
AP: Scripting: restarted
AP: faulting in 10 seconds

STABILIZE> AP: PreArm: Hardware safety switch
AP: PreArm: 3D Accel calibration needed
AP: PreArm: Compass not calibrated
Lua: /APM/scripts/test.lua:6: bad argument #1 to 'random' (interval is empty)
AP: Lua: /APM/scripts/test.lua:6: bad argument #1 to 'random' (interval is empty)

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 1.797500
AP: Scripting: restarted
AP: faulting in 10 seconds

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 1.797500
AP: Scripting: restarted
AP: faulting in 10 seconds

It shows that the first two restarts happen before the fault, so they show a valid canary value of 0.5. Then the fault happens and math.random changes s17 and does not restore it before longjmping out (which longjmp should, this being the whole bug!). Then restarts after show a bogus canary, even if the script does not actually fault.

Once this PR is applied the canary is always 0.5 even after a fault, showing that the FP registers are correctly preserved and addressing the root cause.

@andyp1per
Copy link
Contributor

Amazing work @tpwrules !

Necessary to work around a bug in newlib. This applies to all hardfloat
targets, including Cortex-M4.

The jmp_buf is now 12 bytes larger, totaling 104 bytes.

Info from newlib commit 15ad816dddf836def06cd0330ec0efa9ce50e5bf.
It properly saves and restores the floating point registers.
No longer necessary; verified with a debugger. These were also never
applied to Cortex-M4.
Should instead use ap_setjmp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants