AP_Common: add Cortex-M-specific working setjmp implementation #31740

tpwrules · 2025-12-17T04:17:41Z

This is the ultimate fix for all the hacks we had to do in #24231 and #27056 to ensure floating point registers are properly saved and restored when handling Lua errors. Unlike those, this fixes the problem at the source, which is newlib's setjmp, used in ChibiOS on STM32. The problem could theoretically happen on Cortex-M4 too, this fixes it there as well. The problem could also happen at any setjmp, of which we had one other that was not protected.

I've included a fixed setjmp under a new name based on newlib's patch. We can't use linker wrapping to replace the existing implementation as we also need to grow jmp_buf by 10% or so. I did add a linker wrap to ensure setjmp can't be used unaware in some other part of the code. At such time as we banish older compilers with the broken newlib we can remove ap_setjmp.

Tested the fault handler script created in a previous PR on both Cube Orange and Cube Black (non-trivial!). I tested to 100k iterations and three scripting restarts. I also manually set up the FP registers in the main thread function and validated with a debugger and with prints that they were correctly preserved (and did not get preserved when the various patches were reverted).

This is a prerequisite to fix another hardfault-on-restart in scripting in the panic handling code. To be clear, this particular change doesn’t fix any crash I discovered or believe is easily possible, it just makes the code cleaner and safer.

tridge · 2025-12-17T07:06:09Z

@tpwrules very nice!
what we need to do is ensure we have a lua script that reproduces the problem with the previous fixes reverted.
I think the script I may have used before was based on examples/fault_handling.lua?
I know we did have a script that would reliably trigger the issue. If we can show this PR makes that script run reliably then that would be a very good start

tpwrules · 2025-12-17T14:33:05Z

That’s exactly what I did, see the second to last paragraph. I also custom wrote my own which was just math.random(1,0) in its own file.

tpwrules · 2025-12-19T00:27:24Z

I am unable to reproduce the "scripting dies" failure from the first PR. The first PR correctly identified the issue, but messed up the fix, corrupting the stack along with the FP registers on a Lua error, causing a high chance of a hard fault. The second PR then fixed the fix, but left other setjmps broken (which is what this PR aims to fix for good).

However, with some instrumentation it is possible to reproduce the original failure after reverting the two original PRs but before applying this one.

git checkout -b setjmp-test 86c8eb034b2a4216a0a6
git revert 188df1312d93aff0e405
git revert a497c06e83cc7cacf3db

Then apply this diff and flash your board (I used CubeOrange, no guarantees others will optimize the same):

diff --git a/libraries/AP_Scripting/AP_Scripting.cpp b/libraries/AP_Scripting/AP_Scripting.cpp
index 4a3ff370be..5d9c365505 100644
--- a/libraries/AP_Scripting/AP_Scripting.cpp
+++ b/libraries/AP_Scripting/AP_Scripting.cpp
@@ -295,6 +295,10 @@ MAV_RESULT AP_Scripting::handle_command_int_packet(const mavlink_command_int_t &
 #endif
 
 void AP_Scripting::thread(void) {
+    // s16 is callee-saved, gcc expects other functions to save it if it is used
+    register float canary asm("s17");
+    asm("vmov.f32 s17, #0.5"); // load here so gcc doesn't know the value
+
     while (true) {
         // reset flags
         _stop = false;
@@ -319,7 +323,7 @@ void AP_Scripting::thread(void) {
             lua->run();
 
             // only reachable if the lua backend has died for any reason
-            GCS_SEND_TEXT(MAV_SEVERITY_CRITICAL, "Scripting: %s", "stopped");
+            GCS_SEND_TEXT(MAV_SEVERITY_CRITICAL, "Scripting: %s canary: %f", "stopped", canary);
         }
         delete lua;
         lua = nullptr;

Then load (only) this script:

print("faulting in 10 seconds")
local goes = 0
function update()
  goes = goes + 1
  if goes < 10 then return update, 1000 end
  math.random(1, 0)
end
return update, 1000

A sample mavproxy transcript follows:

Details

STABILIZE> reboot
STABILIZE> Got COMMAND_ACK: PREFLIGHT_REBOOT_SHUTDOWN: ACCEPTED
Device /dev/serial/by-id/usb-Hex_ProfiCNC_CubeOrange_39004C000E51303231383439-if00 is dead
Device /dev/serial/by-id/usb-Hex_ProfiCNC_CubeOrange_39004C000E51303231383439-if00 reopened OK
AP_Logger_File: buffer size=176128
MS5611 found on bus 4 address 0x02
MS5611 found on bus 1 address 0x03
AP: Initialising ArduPilot
disabling flow control on serial 1
disabling flow control on serial 2
AP: Calibrating barometer
AP: Barometer 1 calibration complete
AP: Barometer 2 calibration complete
Init Gyro**
AP: ArduPilot Ready
AP: AHRS: DCM active
AP: RCOut: PWM:1-6
AP: GPS 1: specified as DroneCAN1-125
AP: faulting in 10 seconds
AP: GPS 1: detected DroneCAN1-125
Time has wrapped
Time has wrapped 12339 94460
AP: EKF3 IMU0 initialised
AP: EKF3 IMU1 initialised
AP: EKF3 IMU2 initialised
AP: AHRS: EKF3 active
AP: EKF primary changed:1
AP: EKF3 IMU0 tilt alignment complete
AP: EKF3 IMU1 tilt alignment complete
AP: EKF3 IMU2 tilt alignment complete
AP: EKF3 IMU0 MAG0 initial yaw alignment complete
AP: EKF3 IMU1 MAG0 initial yaw alignment complete
AP: EKF3 IMU2 MAG0 initial yaw alignment complete

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 0.500000
AP: Scripting: restarted
AP: faulting in 10 seconds

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 0.500000
AP: Scripting: restarted
AP: faulting in 10 seconds

STABILIZE> AP: PreArm: Hardware safety switch
AP: PreArm: 3D Accel calibration needed
AP: PreArm: Compass not calibrated
Lua: /APM/scripts/test.lua:6: bad argument #1 to 'random' (interval is empty)
AP: Lua: /APM/scripts/test.lua:6: bad argument #1 to 'random' (interval is empty)

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 1.797500
AP: Scripting: restarted
AP: faulting in 10 seconds

STABILIZE> scripting restart
STABILIZE> Got COMMAND_ACK: SCRIPTING: ACCEPTED
AP: Scripting: stopped canary: 1.797500
AP: Scripting: restarted
AP: faulting in 10 seconds

It shows that the first two restarts happen before the fault, so they show a valid canary value of 0.5. Then the fault happens and math.random changes s17 and does not restore it before longjmping out (which longjmp should, this being the whole bug!). Then restarts after show a bogus canary, even if the script does not actually fault.

Once this PR is applied the canary is always 0.5 even after a fault, showing that the FP registers are correctly preserved and addressing the root cause.

andyp1per · 2025-12-19T07:39:39Z

Amazing work @tpwrules !

Necessary to work around a bug in newlib. This applies to all hardfloat targets, including Cortex-M4. The jmp_buf is now 12 bytes larger, totaling 104 bytes. Info from newlib commit 15ad816dddf836def06cd0330ec0efa9ce50e5bf.

It properly saves and restores the floating point registers.

No longer necessary; verified with a debugger. These were also never applied to Cortex-M4.

Should instead use ap_setjmp.

tpwrules requested review from bugobliterator and tridge December 17, 2025 04:17

tpwrules marked this pull request as ready for review December 17, 2025 04:17

tpwrules added the DevCallEU label Dec 17, 2025

tpwrules force-pushed the pr/stm32-setjmp branch from f5c98f0 to babe58c Compare December 17, 2025 04:23

peterbarker removed the DevCallEU label Dec 17, 2025

tpwrules added 4 commits December 19, 2025 20:03

AP_Common: add Cortex-M-specific working setjmp implementation

ee7d437

Necessary to work around a bug in newlib. This applies to all hardfloat targets, including Cortex-M4. The jmp_buf is now 12 bytes larger, totaling 104 bytes. Info from newlib commit 15ad816dddf836def06cd0330ec0efa9ce50e5bf.

AP_Scripting: use ap_setjmp

5e20905

It properly saves and restores the floating point registers.

AP_Scripting: remove old setjmp workarounds

577822f

No longer necessary; verified with a debugger. These were also never applied to Cortex-M4.

ardupilotwaf: error on setjmp use on ChibiOS

e56f088

Should instead use ap_setjmp.

tpwrules force-pushed the pr/stm32-setjmp branch from babe58c to e56f088 Compare December 20, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AP_Common: add Cortex-M-specific working setjmp implementation #31740

AP_Common: add Cortex-M-specific working setjmp implementation #31740

tpwrules commented Dec 17, 2025 •

edited

Loading

Uh oh!

tridge commented Dec 17, 2025

Uh oh!

tpwrules commented Dec 17, 2025 •

edited

Loading

Uh oh!

tpwrules commented Dec 19, 2025 •

edited

Loading

Uh oh!

andyp1per commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AP_Common: add Cortex-M-specific working setjmp implementation #31740

Are you sure you want to change the base?

AP_Common: add Cortex-M-specific working setjmp implementation #31740

Conversation

tpwrules commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridge commented Dec 17, 2025

Uh oh!

tpwrules commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpwrules commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyp1per commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tpwrules commented Dec 17, 2025 •

edited

Loading

tpwrules commented Dec 17, 2025 •

edited

Loading

tpwrules commented Dec 19, 2025 •

edited

Loading