How do I optimize code for faster execution?

 Optimizing code for faster execution—especially in embedded systems like with STM32—means improving how efficiently the CPU runs your program. This involves algorithmic improvements, compiler settings, hardware-aware coding, and memory access optimization.



Here’s a breakdown of strategies you can apply:


 1. Algorithmic Optimization (Most Important)

 Use efficient algorithms and data structures:

  • Replace O(n²) loops with O(n log n) when possible (e.g., sorting).

  • Minimize nested loops and redundant computations.

  • Cache reused calculations.

 Bad:

c

for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) result += array[i] * array[j];

 Better:

c

int sum = 0; for (int i = 0; i < N; i++) { sum += array[i]; } result = sum * sum;

 2. Use Compiler Optimization Flags

If you're using GCC (like in STM32CubeIDE):

  • -O0 – No optimization (default for debugging).

  • -O1, -O2, -O3 – Increasing levels of optimization.

  • -Os – Optimize for size (useful in constrained environments).

  • -Ofast – Aggressive optimizations, might ignore strict IEEE rules.

 Recommended:

Set your Release build to -O2 or -O3 in Project > Properties > C/C++ Build > Settings > Tool Settings > Optimization.


 3. Use Fixed-Point Instead of Floating-Point (If No FPU)

  • Many STM32 chips (e.g., STM32F103) don’t have an FPU.

  • Floating-point math (float, double) is slow if emulated in software.

  • Use integers or fixed-point libraries.

 Slow:

c

float result = value * 0.01;

 Faster:

c

int result = (value * 100) / 10000; // Fixed-point equivalent

 4. Avoid Dynamic Memory Allocation

  • Avoid using malloc() and free() inside performance-critical loops.

  • Prefer static or stack allocation.


 5. Inline Functions and Loop Unrolling

  • Use inline for small, frequently called functions.

  • Unroll loops if they are small and constant-bounded.

Example:

c

// Instead of: for (int i = 0; i < 4; i++) result += array[i]; // Do this: result = array[0] + array[1] + array[2] + array[3];

 6. Memory Optimization

  • Place time-critical functions in RAM (use __attribute__((section(".ramfunc")))) if flash wait states are an issue.

  • Minimize cache misses or bus contention (if using DMA or peripherals).

  • Optimize flash-to-RAM access when using constant tables.


 7. Use DMA (Direct Memory Access)

Offload data transfer (e.g., ADC, UART, SPI) to DMA so the CPU can focus on processing instead of moving data.


 8. Profile and Benchmark

  • Use cycle counters or SysTick timer to measure execution time of functions.

  • STM32CubeIDE includes SWV (Serial Wire Viewer) and ITM trace for profiling (on supported MCUs).

c

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; DWT->CYCCNT = 0; // Run function uint32_t cycles = DWT->CYCCNT;

 9. Avoid Expensive Operations

OperationFaster Alternative
pow(x, 2)x * x
Division /Bit-shift (for powers of 2)
% (modulo)Bitmask (for powers of 2)


Summary

TechniqueBenefit
Use -O2 or -O3 compiler flagsBasic speedup via compiler
Optimize algorithmsHuge speed gain
Replace float with intMuch faster on MCUs
Use DMA for data movementFrees CPU cycles
Avoid malloc in real-time codeReduces fragmentation
Profile and time-critical codeTargets bottlenecks

评论

此博客中的热门博文

How To Connect Stm32 To PC?

What are the common HDL languages used in FPGA design?

How do you set up ADC (Analog-to-Digital Converter) in STM32?