Latency optimization for image processing pipelines on FPGAs using HLS

七月 18, 2025

Let’s dive deeper into latency optimization for image processing pipelines on FPGAs using HLS. This is critical for real-time applications like video processing, autonomous vehicles, or medical imaging.

Key Challenges in Image Processing HLS Designs

High Data Volume: Pixels must be processed at low latency (e.g., <16.7 ms/frame for 60 FPS).
Memory Bottlenecks: Off-chip DDR access can dominate latency.
Dependency Chains: Sequential operations (e.g., filters) introduce delays.

Step-by-Step Latency Optimization Techniques

1. Algorithm-Level Optimizations

A. Window Buffering (Line Buffers)

Instead of processing entire frames, use sliding windows (e.g., 3×3 kernels for convolution).
Reduces off-chip memory accesses by caching neighboring pixels in on-chip BRAM.

#pragma HLS ARRAY_PARTITION variable=line_buffer complete dim=1
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        // Shift pixels through window
        for (int i = 0; i < 2; i++) {
            line_buffer[i][x] = line_buffer[i+1][x];
        }
        line_buffer[2][x] = read_pixel(x, y);  // New pixel
        // Process 3x3 window
        if (y >= 2 && x >= 1 && x < width-1) {
            process_kernel(line_buffer, x);
        }
    }
}

B. Fixed-Point Arithmetic

Replace float with ap_fixed<16,8> to reduce DSP usage and pipeline stages.
Example: Sobel edge detection (8-bit pixels → 12-bit gradients):

typedef ap_fixed<12,4> grad_t;
grad_t sobel(hls::stream<uint8_t>& in_stream) {
    #pragma HLS PIPELINE II=1
    uint8_t window[3][3];
    // Load 3x3 window from stream
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < 3; j++) {
            window[i][j] = in_stream.read();
        }
    }
    // Fixed-point Sobel calculation
    grad_t gx = (window[0][0] - window[0][2]) + 2*(window[1][0] - window[1][2]) + ...;
    grad_t gy = ...;
    return hls::sqrt(gx*gx + gy*gy);  // Approximate sqrt with LUT
}

2. Microarchitecture-Level Optimizations

A. Super-Pipelining

Break long combinational paths into smaller stages.
Use #pragma HLS PIPELINE II=1 with fine-grained task splitting:

void process_pixel(hls::stream<uint8_t>& in, hls::stream<uint8_t>& out) {
    #pragma HLS PIPELINE II=1
    uint8_t pixel = in.read();
    // Stage 1: Gamma correction
    uint8_t corrected = gamma_lut[pixel];
    // Stage 2: Thresholding
    uint8_t binary = (corrected > 128) ? 255 : 0;
    out.write(binary);
}

B. Dataflow for Multi-Stage Pipelines

Overlap execution of independent stages (e.g., debayer → denoise → edge detect).

void image_pipeline(hls::stream<uint8_t>& raw, hls::stream<uint8_t>& out) {
    #pragma HLS DATAFLOW
    hls::stream<uint8_t> debayer_out, denoise_out;
    debayer(raw, debayer_out);      // Stage 1
    denoise(debayer_out, denoise_out); // Stage 2
    sobel(denoise_out, out);        // Stage 3
}

C. Coalesced Memory Access

Use burst transfers to DDR with #pragma HLS INTERFACE m_axi latency=32.
Align data to 512-bit AXI4 bus width:

void read_frame(int* frame, hls::stream<uint64_t>& out) {
    #pragma HLS INTERFACE m_axi port=frame bundle=gmem0 latency=32
    #pragma HLS PIPELINE II=1
    for (int i = 0; i < SIZE; i+=8) {
        uint64_t chunk = *reinterpret_cast<uint64_t*>(&frame[i]);
        out.write(chunk);  // 64-bit chunks reduce DDR accesses
    }
}

3. Resource-Aware Optimizations

A. DSP vs. LUT Tradeoffs

Force HLS to use LUT-based multipliers for low-latency paths:

#pragma HLS RESOURCE variable=mul core=FMul_LUT  // Instead of DSP48

B. BRAM vs. URAM Selection

Use URAM (UltraRAM) for large buffers (>32 KB) to reduce block RAM usage:

#pragma HLS RESOURCE variable=frame_buffer storage_type=uram

4. Verification & Metrics

Latency Measurement:
- Use #pragma HLS LATENCY min=1 max=10 to constrain and report.
Throughput Check:
- Ensure II=1 is achieved in all pipelines.
Hardware Utilization:
- Monitor BRAM/DSP/FF usage in synthesis reports.

Example: Optimized 1080p Grayscale Conversion

Goal: Process 1920×1080 @ 60 FPS (124.4 MHz pixel clock).
Optimized HLS:

void grayscale(hls::stream<ap_uint<24>>& rgb_in, hls::stream<ap_uint<8>>& gray_out) {
    #pragma HLS PIPELINE II=1
    #pragma HLS INTERFACE axis port=rgb_in
    #pragma HLS INTERFACE axis port=gray_out
    ap_uint<24> pixel = rgb_in.read();
    ap_uint<8> r = pixel(7,0), g = pixel(15,8), b = pixel(23,16);
    gray_out.write((r * 77 + g * 150 + b * 29) >> 8);  // Fixed-point RGB2Gray
}

Performance:

Latency: 1 cycle/pixel → 0.008 µs/pixel (meets 60 FPS).
Resources: 1 DSP48, 128 LUTs.

Final Checklist for Low-Latency HLS

✅ Pipeline all loops with II=1.
✅ Use fixed-point unless floating-point is unavoidable.
✅ Partition arrays for parallel access.
✅ Leverage dataflow for multi-stage pipelines.
✅ Optimize memory access with burst/AXI-Stream.
✅ Verify timing closure with co-simulation.

搜索此博客

Electronics Introduction