Latency optimization for image processing pipelines on FPGAs using HLS

 Let’s dive deeper into latency optimization for image processing pipelines on FPGAs using HLS. This is critical for real-time applications like video processing, autonomous vehicles, or medical imaging.




Key Challenges in Image Processing HLS Designs

  1. High Data Volume: Pixels must be processed at low latency (e.g., <16.7 ms/frame for 60 FPS).

  2. Memory Bottlenecks: Off-chip DDR access can dominate latency.

  3. Dependency Chains: Sequential operations (e.g., filters) introduce delays.


Step-by-Step Latency Optimization Techniques

1. Algorithm-Level Optimizations

A. Window Buffering (Line Buffers)

  • Instead of processing entire frames, use sliding windows (e.g., 3×3 kernels for convolution).

  • Reduces off-chip memory accesses by caching neighboring pixels in on-chip BRAM.

cpp
#pragma HLS ARRAY_PARTITION variable=line_buffer complete dim=1
for (int y = 0; y < height; y++) {
    for (int x = 0; x < width; x++) {
        // Shift pixels through window
        for (int i = 0; i < 2; i++) {
            line_buffer[i][x] = line_buffer[i+1][x];
        }
        line_buffer[2][x] = read_pixel(x, y);  // New pixel
        // Process 3x3 window
        if (y >= 2 && x >= 1 && x < width-1) {
            process_kernel(line_buffer, x);
        }
    }
}

B. Fixed-Point Arithmetic

  • Replace float with ap_fixed<16,8> to reduce DSP usage and pipeline stages.

  • Example: Sobel edge detection (8-bit pixels → 12-bit gradients):

cpp
typedef ap_fixed<12,4> grad_t;
grad_t sobel(hls::stream<uint8_t>& in_stream) {
    #pragma HLS PIPELINE II=1
    uint8_t window[3][3];
    // Load 3x3 window from stream
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < 3; j++) {
            window[i][j] = in_stream.read();
        }
    }
    // Fixed-point Sobel calculation
    grad_t gx = (window[0][0] - window[0][2]) + 2*(window[1][0] - window[1][2]) + ...;
    grad_t gy = ...;
    return hls::sqrt(gx*gx + gy*gy);  // Approximate sqrt with LUT
}

2. Microarchitecture-Level Optimizations

A. Super-Pipelining

  • Break long combinational paths into smaller stages.

  • Use #pragma HLS PIPELINE II=1 with fine-grained task splitting:

cpp
void process_pixel(hls::stream<uint8_t>& in, hls::stream<uint8_t>& out) {
    #pragma HLS PIPELINE II=1
    uint8_t pixel = in.read();
    // Stage 1: Gamma correction
    uint8_t corrected = gamma_lut[pixel];
    // Stage 2: Thresholding
    uint8_t binary = (corrected > 128) ? 255 : 0;
    out.write(binary);
}

B. Dataflow for Multi-Stage Pipelines

  • Overlap execution of independent stages (e.g., debayer → denoise → edge detect).

cpp
void image_pipeline(hls::stream<uint8_t>& raw, hls::stream<uint8_t>& out) {
    #pragma HLS DATAFLOW
    hls::stream<uint8_t> debayer_out, denoise_out;
    debayer(raw, debayer_out);      // Stage 1
    denoise(debayer_out, denoise_out); // Stage 2
    sobel(denoise_out, out);        // Stage 3
}

C. Coalesced Memory Access

  • Use burst transfers to DDR with #pragma HLS INTERFACE m_axi latency=32.

  • Align data to 512-bit AXI4 bus width:

cpp
void read_frame(int* frame, hls::stream<uint64_t>& out) {
    #pragma HLS INTERFACE m_axi port=frame bundle=gmem0 latency=32
    #pragma HLS PIPELINE II=1
    for (int i = 0; i < SIZE; i+=8) {
        uint64_t chunk = *reinterpret_cast<uint64_t*>(&frame[i]);
        out.write(chunk);  // 64-bit chunks reduce DDR accesses
    }
}

3. Resource-Aware Optimizations

A. DSP vs. LUT Tradeoffs

  • Force HLS to use LUT-based multipliers for low-latency paths:

cpp
#pragma HLS RESOURCE variable=mul core=FMul_LUT  // Instead of DSP48

B. BRAM vs. URAM Selection

  • Use URAM (UltraRAM) for large buffers (>32 KB) to reduce block RAM usage:

cpp
#pragma HLS RESOURCE variable=frame_buffer storage_type=uram

4. Verification & Metrics

  1. Latency Measurement:

    • Use #pragma HLS LATENCY min=1 max=10 to constrain and report.

  2. Throughput Check:

    • Ensure II=1 is achieved in all pipelines.

  3. Hardware Utilization:

    • Monitor BRAM/DSP/FF usage in synthesis reports.


Example: Optimized 1080p Grayscale Conversion

Goal: Process 1920×1080 @ 60 FPS (124.4 MHz pixel clock).
Optimized HLS:

cpp
void grayscale(hls::stream<ap_uint<24>>& rgb_in, hls::stream<ap_uint<8>>& gray_out) {
    #pragma HLS PIPELINE II=1
    #pragma HLS INTERFACE axis port=rgb_in
    #pragma HLS INTERFACE axis port=gray_out
    ap_uint<24> pixel = rgb_in.read();
    ap_uint<8> r = pixel(7,0), g = pixel(15,8), b = pixel(23,16);
    gray_out.write((r * 77 + g * 150 + b * 29) >> 8);  // Fixed-point RGB2Gray
}

Performance:

  • Latency: 1 cycle/pixel → 0.008 µs/pixel (meets 60 FPS).

  • Resources: 1 DSP48, 128 LUTs.


Final Checklist for Low-Latency HLS

✅ Pipeline all loops with II=1.
✅ Use fixed-point unless floating-point is unavoidable.
✅ Partition arrays for parallel access.
✅ Leverage dataflow for multi-stage pipelines.
✅ Optimize memory access with burst/AXI-Stream.
✅ Verify timing closure with co-simulation.

评论

此博客中的热门博文

Detailed Explanation of STM32 HAL Library Clock System

How do you set up ADC (Analog-to-Digital Converter) in STM32?

How To Connect Stm32 To PC?