How do I optimize code for faster execution?

五月 21, 2025

Optimizing code for faster execution—especially in embedded systems like with STM32—means improving how efficiently the CPU runs your program. This involves algorithmic improvements, compiler settings, hardware-aware coding, and memory access optimization.

Here’s a breakdown of strategies you can apply:

1. Algorithmic Optimization (Most Important)

Use efficient algorithms and data structures:

Replace O(n²) loops with O(n log n) when possible (e.g., sorting).
Minimize nested loops and redundant computations.
Cache reused calculations.

Bad:

c

for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    result += array[i] * array[j];

Better:

c

int sum = 0;
for (int i = 0; i < N; i++) {
  sum += array[i];
}
result = sum * sum;

2. Use Compiler Optimization Flags

If you're using GCC (like in STM32CubeIDE):

-O0 – No optimization (default for debugging).
-O1, -O2, -O3 – Increasing levels of optimization.
-Os – Optimize for size (useful in constrained environments).
-Ofast – Aggressive optimizations, might ignore strict IEEE rules.

3. Use Fixed-Point Instead of Floating-Point (If No FPU)

Many STM32 chips (e.g., STM32F103) don’t have an FPU.
Floating-point math (float, double) is slow if emulated in software.
Use integers or fixed-point libraries.

Slow:

c

float result = value * 0.01;

Faster:

c

int result = (value * 100) / 10000;  // Fixed-point equivalent

4. Avoid Dynamic Memory Allocation

Avoid using malloc() and free() inside performance-critical loops.
Prefer static or stack allocation.

5. Inline Functions and Loop Unrolling

Use inline for small, frequently called functions.
Unroll loops if they are small and constant-bounded.

Example:

c

// Instead of:
for (int i = 0; i < 4; i++) result += array[i];

// Do this:
result = array[0] + array[1] + array[2] + array[3];

6. Memory Optimization

Place time-critical functions in RAM (use __attribute__((section(".ramfunc")))) if flash wait states are an issue.
Minimize cache misses or bus contention (if using DMA or peripherals).
Optimize flash-to-RAM access when using constant tables.

7. Use DMA (Direct Memory Access)

Offload data transfer (e.g., ADC, UART, SPI) to DMA so the CPU can focus on processing instead of moving data.

8. Profile and Benchmark

Use cycle counters or SysTick timer to measure execution time of functions.
STM32CubeIDE includes SWV (Serial Wire Viewer) and ITM trace for profiling (on supported MCUs).

c

CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
DWT->CYCCNT = 0;
// Run function
uint32_t cycles = DWT->CYCCNT;

9. Avoid Expensive Operations

Operation	Faster Alternative
`pow(x, 2)`	`x * x`
Division `/`	Bit-shift (for powers of 2)
`%` (modulo)	Bitmask (for powers of 2)

Summary

Technique	Benefit
Use `-O2` or `-O3` compiler flags	Basic speedup via compiler
Optimize algorithms	Huge speed gain
Replace `float` with `int`	Much faster on MCUs
Use DMA for data movement	Frees CPU cycles
Avoid malloc in real-time code	Reduces fragmentation
Profile and time-critical code	Targets bottlenecks

搜索此博客

Electronics Introduction