How SimAI Predicts Prefill/Decode Time

From GPU profiling to per-request execution time prediction — the complete pipeline

AICB · Vidur · Sklearn
Back to SimAI Overview

Table of Contents

  1. The Big Picture
  2. Step 1: GPU Profiling (AICB)
  3. Why Execution Time is Not Linear with Token Count
  4. Scheduling & Batching Strategies
  5. Compute Time Calculation in Detail
  6. Training Prediction Models (Vidur)
  7. Per-Request Prediction from Trace
  8. How Tokens are Aggregated in a Batch
  9. Supported GPU & LLM Models
  10. AICB vs Vidur: Two Prediction Paths

The Big Picture

SimAI's execution time prediction answers a single question: given a specific GPU, a specific LLM, and a trace of requests (each with different prefill/decode token counts), how long does each request take? The answer comes from a three-stage pipeline.

End-to-End Prediction Pipeline
STAGE 1: GPU PROFILING Run actual CUDA kernels on target GPU Measure each GEMM & attention op Multiple (m, batch, seq) combinations Output: profiling CSV files Requires: SM90+ GPU (H100/B100) one-time STAGE 2: TRAIN MODELS Fit polynomial regression on profiling data Separate models for each component: attn_prefill, attn_decode, mlp, comm Output: prediction lookup table CPU only, no GPU needed once STAGE 3: PER-REQUEST PREDICTION Load trace (each request has different tokens) Form batches, aggregate token counts Lookup table → per-component time Output: prefill_time + decode_time per request CPU only, runs for every batch in trace Stage 1 requires real GPU hardware; Stages 2 & 3 run on CPU only
Key Insight: AICB (Stage 1) generates raw profiling data. Vidur (Stages 2–3) trains ML models on that data, then uses them to predict execution time for any token count — including counts never profiled. This is how SimAI handles traces where every request has a different prefill length.

Step 1: GPU Profiling (AICB)

AICB profiles actual CUDA kernel execution times on a target GPU. For each model, it runs every operation (GEMM, attention, MLP) with specific tensor dimensions, measures the GPU time in microseconds, and writes the results to files.

What is GEMM?

GEMM (General Matrix Multiply) is the fundamental operation in LLM inference. Every Linear layer in a Transformer is a matrix multiplication:

// Every Linear layer = one GEMM
output = input × weight
//  (m, k)    (k, n)  →  (m, n)
//   ↑         ↑          
//   tokens   fixed by model architecture

// Example: LLaMA-8B MLP up projection
//   m = num_tokens (variable!)
//   k = 4096  (hidden_size, fixed)
//   n = 14336 (ffn_size, fixed)

The critical variable is m — the number of tokens being processed. In prefill, m = seq_length (e.g. 4096); in decode, m = batch_size (e.g. 32). Everything else is fixed by the model architecture.

What AICB Profiles per Layer

Operation GEMM Dimensions What It Does
attn_pre_proj(m, hidden) × (hidden, qkv_dim)Projects input into Q, K, V
attn_flashFlashAttention kernelComputes softmax(QK)V
attn_post_proj(m, head_dim×heads/tp) × (..., hidden)Output projection back to hidden dim
mlp_up_proj(m, hidden) × (hidden, ffn×2)MLP gate + up projection
mlp_down_proj(m, ffn) × (ffn, hidden)MLP down projection

How Timing Works

aicb/utils/utils.py
def cuda_timing_decorator(func):
    def wrapper(*args, **kwargs):
        start_event = torch.cuda.Event(enable_timing=True)
        end_event   = torch.cuda.Event(enable_timing=True)

        start_event.record()
        result = func(*args, **kwargs)     # run actual CUDA kernel
        end_event.record()
        torch.cuda.synchronize()

        elapsed = start_event.elapsed_time(end_event)  # GPU-side µs
        return result, elapsed
    return wrapper

AICB uses torch.cuda.Event for GPU-side timing (not Python's time.time()). This measures actual kernel execution time, excluding CPU launch overhead. Each operation is profiled multiple times (default 10) and averaged.

Important: A single AICB run profiles at one specific (m, batch_size, seq_length) combination. To build a prediction model that generalizes to arbitrary token counts, you need to run AICB at multiple different values of m and collect the results into profiling CSV files.

Why Execution Time is Not Linear with Token Count

A GEMM of (m, k) × (k, n) requires 2 × m × k × n FLOPs — linear in m. But actual GPU execution time is not linear in m, because a GPU has two bottlenecks:

// actual execution time = max(compute_time, memory_time)

// When m is small (decode, m=1~32):
//   FLOPs are few, but weight matrix still must be loaded from HBM
//   → memory-bound: time ≈ weight_size / HBM_bandwidth
//   → nearly independent of m!

// When m is large (prefill, m=4096):
//   FLOPs dominate over memory load time
//   → compute-bound: time ≈ FLOPs / GPU_peak_FLOPS
//   → scales linearly with m

Concrete Example

Consider the MLP up-projection GEMM (m, 4096) × (4096, 14336) on an H100:

m FLOPs Weight Load (fixed) Compute Time Memory Time Bottleneck
1117M112 MB~0.1 µs~34 µsMemory
323.7G112 MB~4 µs~34 µsMemory
12815G112 MB~15 µs~34 µsMemory
51260G112 MB~60 µs~34 µsTransition
4096480G112 MB~480 µs~34 µsCompute
Why This Matters: From m=1 to m=128, FLOPs increase 128×, but execution time barely changes (both ~34µs). If you only profiled m=4096 and linearly extrapolated m=32, you'd predict 3.75µs — off by 9× from the actual ~34µs. This is why profiling at a single m and linearly scaling is wrong, and why SimAI needs ML models to capture this non-linear relationship.

Discussion: What if We Separate Compute Time and Memory Time?

The non-linearity above comes from max(compute_time, memory_time). A natural question is: if we model compute_time and memory_time separately, does each one scale linearly with m?

Component Formula Linearity in m
Compute Time 2 × m × k × n / peak_FLOPS Strictly linear (passes through origin)
Memory Time (m×k + k×n + m×n) × elem_size / bandwidth Affine: a×m + b, where b = k×n (weight matrix, constant)
Actual Time max(compute, memory) Non-linear (piecewise, regime-dependent)

So compute_time is strictly linear in m. But memory_time is affine (a×m + b) — it has a constant offset b = k × n × elem_size / bandwidth from loading the weight matrix, which does not depend on m. For small m, this constant dominates, making memory_time appear nearly flat. Separating them makes each component approximately predictable, but memory_time is not a pure linear-through-origin function.

How SimAI's AICB Bypasses This Problem Entirely

SimAI does not build a roofline model or separate compute vs. memory time. Instead, AICB profiles each kernel at the actual micro-batch size on real GPU hardware using CUDA events, so the measured wall-clock time already implicitly captures max(compute, memory).

// AiobMegatron.py — profiles each operation on real GPU
// Records wall-clock time per kernel (already includes max(compute, memory))

for _ in range(epoch_num):
    Emb_output, Emb_time = self.Embedding(input)
    self.time_list["Emb"].append({"time_gpu": Emb_time})

    for _ in range(num_layers):
        // Each op timed individually via CUDA events
        lay_out, layernorm = self.Layernorm(Emb_output)
        atten_output, atten_qkv, ... = self.Attention(lay_out)
        mlp_out, mlp_linear_1, mlp_gelu, mlp_linear_2 = self.Mlp(lay2_out)
        // All times stored in self.time_list

These measured times are then averaged via extract_averages() to build a compute_cache — a lookup table mapping each operation to its profiled time in microseconds:

// utils.py — extract_averages()

compute_cache = {
    "attention_forward": 1250,   // measured avg (µs)
    "attention_backward": 1250,
    "mlp_forward": 890,
    "mlp_backward": 890,
    "grad_forward": 15,
    "grad_backward": 42,
    ...
}

The workload generator then replays these fixed profiled times for each microbatch × each layer, with no scaling applied:

// SimAI_training_workload_generator.py

for _ in range(ga_num):          // iterate over microbatches
    for layer in layers:          // iterate over model layers
        forward_compute_time = _get_aiob_compute_time(
            compute_cache, "forward", layer_name
        )                             // table lookup, no math
Key Insight: AICB sidesteps the roofline modeling question entirely. By profiling at the target micro-batch size, the measured time already reflects whichever regime (memory-bound or compute-bound) the kernel actually operates in. There is no compute/memory decomposition and no linear extrapolation — just measured reality replayed in the simulator. The calculate_stats() function tracks variance and P99 of these measurements for stability analysis, but not a compute vs. memory breakdown.

Scheduling & Batching Strategies

Before predicting execution time, SimAI must decide which requests go into each batch. This is handled by the replica scheduler, which implements one of three strategies. Each strategy has different constraints on how prefill and decode requests are mixed.

Common Constraints

vidur/config/config.py
class BaseReplicaSchedulerConfig:
    batch_size_cap: int = 128             # max requests per batch
    block_size: int = 16                  # KV cache block size (tokens)
    watermark_blocks_fraction: float = 0.01  # reserved memory fraction

Memory is the hard constraint: each request occupies KV cache blocks proportional to its context length. The scheduler must ensure total allocated blocks do not exceed GPU memory (minus a watermark reserve).

Strategy A: vLLM — Token-Budget Continuous Batching

vidur/scheduler/replica_scheduler/vllm_replica_scheduler.py

The vLLM scheduler packs requests until a token budget is exhausted. Prefill and decode requests can coexist in the same batch.

# Core constraint:
num_batch_tokens = len(all_tokens_list) × max(all_tokens_list)
# Must satisfy: num_batch_tokens ≤ max_tokens_in_batch (default 4096)

# Per-request token count:
if request.is_prefill_complete:
    next_tokens = 1               # decode: 1 token per iteration
else:
    next_tokens = request.num_prefill_tokens  # prefill: all tokens at once
The num_requests × max_seq_len formulation means a single long prefill request can consume the entire token budget, blocking shorter requests. This is the classic vLLM V1 scheduling behavior.

Strategy B: Sarathi — Chunk-Based Scheduling

vidur/scheduler/replica_scheduler/sarathi_replica_scheduler.py

Sarathi splits long prefills into fixed-size chunks, preventing a single prefill from monopolizing the GPU. Decode requests are always processed (1 token each), and remaining budget goes to prefill chunks.

# Prefill tokens capped by remaining chunk budget:
next_tokens = min(
    request.num_prefill_tokens - request.num_processed_tokens,
    chunk_size - num_batch_tokens,      # remaining budget in this chunk
)
# chunk_size default: 512 tokens

Strategy C: LightLLM — Separate Prefill/Decode Batches

vidur/scheduler/replica_scheduler/lightllm_replica_scheduler.py

LightLLM never mixes prefill and decode in the same batch. It alternates between prefill-only and decode-only batches, with a fairness mechanism to prevent decode starvation.

def _get_next_batch(self):
    if not self._preempted_requests:        # no ongoing decode
        return self._get_prefill_batch()

    if self._num_waiting_iters ≥ max_waiting_iters:
        return self._get_prefill_batch()     # fairness: don't starve prefill

    return self._get_decode_batch()          # default: continue decode

Comparison

Strategy Prefill/Decode Mixing Prefill Chunking Budget Constraint
vLLM Mixed No (full prefill) n_reqs × max_seq ≤ 4096
Sarathi Mixed Yes (fixed chunks) total_tokens ≤ chunk_size
LightLLM Separate batches No max_tokens ≤ 4096

Compute Time Calculation in Detail

Once a batch is formed, SimAI computes its execution time by summing per-component predictions. Every component time comes from a pre-trained polynomial model looked up by the batch's aggregated token count. Here is the complete formula.

Per-Layer Block Time

vidur/entities/execution_time.py

Each transformer layer's execution time is the sum of an attention block and an MLP block:

// Per-layer block time:
block_time = attention_time + mlp_time + residual_add

// Attention block (8 sub-components):
attention_time = pre_proj          // QKV projection        ← f(total_tokens)
               + post_proj         // output projection     ← f(total_tokens)
               + rope              // rotary embedding      ← f(total_tokens)
               + kv_cache_save     // write K,V to cache    ← f(total_tokens)
               + prefill_attn      // prefill attention     ← f(kv_cache, chunk²)
               + decode_attn       // decode attention      ← f(batch_size, kv_cache)
               + tp_allreduce      // tensor parallel comm  ← f(total_tokens)
               + attn_norm         // layer norm            ← f(total_tokens)

// MLP block (5 sub-components):
mlp_time = up_proj                 // gate + up projection  ← f(total_tokens)
         + down_proj               // down projection       ← f(total_tokens)
         + activation              // SiLU / GeLU           ← f(total_tokens)
         + tp_allreduce            // tensor parallel comm  ← f(total_tokens)
         + mlp_norm                // layer norm            ← f(total_tokens)

Total Model Time

// Standard (dense) models:
model_time = block_time × num_layers_per_pipeline_stage
           + pipeline_parallel_communication

// MoE models (DeepSeek, Qwen3-MoE): per-layer calculation
// because dense layers and MoE layers have different times
model_time = ∑ block_time(layer_id) for layer_id in [start..end]
           + pipeline_parallel_communication

// Final output:
total_time = model_time + cpu_overhead

How Each Component Gets Its Input

The key insight is that different components aggregate the batch's tokens differently:

Component Input Features How Aggregated from Batch
MLP / Projection / Norm (total_tokens,) Sum all tokens, round to multiple of 8
Prefill Attention (kv_cache, chunk_size²) kv = ∑per-request context; chunk = L2 norm then squared
Decode Attention (batch_size, avg_kv) Count decode requests; average their kv_cache sizes
TP AllReduce (total_tokens,) Same as MLP (message size ∝ tokens)

Concrete Example: Decode Attention Lookup

vidur/execution_time_predictor/sklearn_execution_time_predictor.py
def _get_attention_decode_execution_time(self, batch):
    decode_batch_size, avg_kv_cache = self._get_batch_decode_attention_params(batch)
    if decode_batch_size == 0:
        return 0

    base_time = self._predictions["attn_decode"][(decode_batch_size, avg_kv_cache)]
    # Add batching overhead when batch_size > 1
    return base_time × (1 + overhead_fraction × int(decode_batch_size > 1))

Concrete Example: Prefill Attention Aggregation

def _get_attention_prefill_execution_time(self, batch):
    # Collect (kv_cache_size, prefill_chunk_size) per prefill request
    prefill_params = self._get_batch_prefill_attention_params(batch)
    if len(prefill_params) == 0:
        return 0

    kv_sizes, chunk_sizes = zip(*prefill_params)
    agg_kv    = sum(kv_sizes)
    agg_chunk = sum([x**2 for x in chunk_sizes]) ** 0.5   # L2 norm

    return self._predictions["attn_prefill"][
        (agg_kv, round(agg_chunk)**2)                     # squared back
    ]
Why L2 norm for prefill chunks? Prefill attention is roughly O(chunk_size²) per request (self-attention). When batching N requests, the total work is ∑ chunk_i², not (∑ chunk_i)². The L2 norm √(∑ chunk_i²) preserves this property: squaring it back gives the correct total FLOPs. A simple sum would overestimate by including cross-request attention that doesn't happen.

Training Prediction Models (Vidur)

Vidur takes the profiling CSV files from Stage 1 and trains separate sklearn models for each type of operation. The key technique is polynomial regression — fitting a curve that captures the memory-bound → compute-bound transition.

Models Trained

Model Name Input Features Predicts
attn_prefill(kv_cache_size, prefill_chunk_size²)Prefill attention time
attn_decode(batch_size, kv_cache_size)Decode attention time
attn_pre_proj(num_tokens)QKV projection time
mlp_up_proj(num_tokens)MLP up+gate time
mlp_down_proj(num_tokens)MLP down time
all_reduce(num_tokens)TP communication time

Training Pipeline

vidur/execution_time_predictor/sklearn_execution_time_predictor.py
# 1. Load profiling CSV (multiple m values from AICB runs)
# 2. Fit polynomial regression
estimator = make_pipeline(
    PolynomialFeatures(degree=n),    # captures non-linear curve
    LinearRegression(fit_intercept=True)
)
# time = a&sub0; + a&sub1;·tokens + a&sub2;·tokens² + ... + a_n·tokens^n

# 3. Pre-compute predictions for all possible token counts
num_token_range = np.arange(1, max_tokens + 1)
predictions[model_name] = model.predict(num_token_range)
# Result: { (1,): 5.2, (2,): 5.3, ..., (4096,): 480.1, ... }
Key Insight: After training, all predictions are pre-computed into a lookup dictionary. During simulation, predicting execution time for any token count is a simple O(1) dictionary lookup — no ML inference at runtime.

Why Separate Models for Prefill vs Decode Attention?

Attention is the only component that behaves fundamentally differently between prefill and decode. MLP and projection layers only depend on num_tokens, but attention depends on the interaction pattern:

Prefill Attention

Processes all input tokens at once. Time depends on kv_cache_size (context so far) and prefill_chunk_size² (quadratic in chunk size because every token attends to every other token).

features = ["kv_cache_size",
            "prefill_chunk_size_squared"]

Decode Attention

Generates one token per request. Time depends on batch_size (how many requests) and kv_cache_size (how much context each request has accumulated).

features = ["batch_size",
            "kv_cache_size"]

Per-Request Prediction from Trace

A trace is a CSV file where each row is a request with a specific num_prefill_tokens and num_decode_tokens. Vidur processes the trace by forming batches and predicting execution time per batch.

Prediction Flow for One Batch

vidur/execution_time_predictor/base_execution_time_predictor.py
def get_execution_time(self, batch, pipeline_stage) -> ExecutionTime:
    # For each batch, compute every time component:
    return ExecutionTime(
        attention_prefill_execution_time  = self._get_attention_prefill_execution_time(batch),
        attention_decode_execution_time   = self._get_attention_decode_execution_time(batch),
        attention_layer_pre_proj_time     = self._get_attention_layer_pre_proj_execution_time(batch),
        attention_layer_post_proj_time    = self._get_attention_layer_post_proj_execution_time(batch),
        mlp_layer_up_proj_time            = self._get_mlp_layer_up_proj_execution_time(batch),
        mlp_layer_down_proj_time          = self._get_mlp_layer_down_proj_execution_time(batch),
        tensor_parallel_communication     = self._get_tensor_parallel_communication_time(batch),
        pipeline_parallel_communication   = self._get_pipeline_parallel_communication_time(batch),
        # ... + RoPE, KV cache save, norms, CPU overhead
    )

How model_time is Computed

vidur/entities/execution_time.py
@property
def model_time(self):
    # Per layer:
    attention_time = (pre_proj + post_proj + rope + kv_cache_save
                     + prefill_time + decode_time
                     + tp_communication + norm)
    mlp_time = (up_proj + down_proj + act
               + tp_communication + norm)
    block_time = attention_time + mlp_time + add

    # Total model time:
    return block_time × num_layers_per_pipeline_stage + cpu_overhead
Key Insight: Every component time is computed via dictionary lookup based on the batch's aggregated token count. There is no iterative simulation within a single forward pass — each lookup is O(1). The total model time multiplies per-layer time by the number of layers.

How Tokens are Aggregated in a Batch

A batch can contain multiple requests with different token counts. The prediction models expect a single set of input features, so Vidur must aggregate tokens from all requests in the batch. The aggregation method differs by component:

Component Aggregation Method Lookup Key
MLP / Linear layers sum(all tokens), round to multiple of 8 (total_tokens_rounded,)
Prefill Attention kv_cache = sum; chunk = L2 norm (∑kv, round(√∑chunk²)²)
Decode Attention count decode reqs; avg kv cache (batch_size, avg_kv_rounded)
TP Communication sum(all tokens), round to multiple of 8 (total_tokens_rounded,)

Worked Example

Suppose a batch contains 3 requests:

Request Phase Tokens to Process KV Cache
APrefill5120
BPrefill20480
CDecode11000
# MLP lookup:
total_tokens = 512 + 2048 + 1 = 2561
total_tokens_rounded = (2561 + 7) // 8 * 8 = 2568
mlp_time = predictions["mlp_up_proj"][(2568,)]

# Prefill attention lookup (requests A & B):
agg_kv     = 0 + 0 = 0
agg_chunk  = sqrt(512² + 2048²) = sqrt(4456448) ≈ 2111
prefill_time = predictions["attn_prefill"][(0, 2111²)]

# Decode attention lookup (request C only):
decode_batch_size = 1
avg_kv = 1000  # rounded to granularity
decode_time = predictions["attn_decode"][(1, 1000)]

Supported GPU & LLM Models

AICB Inference Profiling (Kernel-Level)

AICB requires a hand-written "Mocked Model" + "AIOB Profiler" for each LLM. Currently only 3 models are implemented:

Model Type Special Kernels GPU Requirement
DeepSeek-V3 (671B)MoE + MLADeepGEMM FP8, FlashMLASM90+ (H100)
Qwen3-MoE (235B)MoEvLLM-basedSM90+
Qwen3-Next (80B)MoE + GDNGDN (experimental)SM90+
No LLaMA: LLaMA is not in AICB's inference profiling, despite being the simplest architecture (standard MHA + FFN, no special kernels needed). This is because SimAI is built by Alibaba — they prioritize their own models (Qwen) and the most popular domestic model (DeepSeek).

Vidur Scheduling Simulation (Model Configs)

Vidur has model configs for many more models. These can be used with the sklearn prediction path (no AICB needed, but requires running Vidur's own profiling scripts on a GPU):

LLaMA-2: 7B, 70B
LLaMA-3: 8B, 70B
CodeLLaMA: 34B
Qwen: 72B
InternLM: 20B, 2-20B
Phi-2: (Microsoft)
DeepSeek-V3: 671B

AICB vs Vidur: Two Prediction Paths

SimAI offers two distinct paths for execution time prediction. Understanding when to use each is critical:

AICB + astra-sim Vidur sklearn
Question answered Is this hardware config viable? How does this scheduling strategy perform?
Prefill length Fixed per run Varies per request (from trace)
Prediction method Direct profiling at exact (m, batch, seq) Polynomial model trained on multi-point profiling
Handles varying token counts? No — one run per (m) value Yes — generalizes via model
Network simulation Full NS-3 / SimCCL simulation Analytical formula or SimAI integration
Accuracy Exact at profiled point Approximate (polynomial fit)
Supported models 3 (DeepSeek, Qwen3-MoE, Qwen3-Next) 10+ (LLaMA, Qwen, InternLM, Phi, ...)
GPU needed Yes (SM90+, per run) Yes for profiling, then CPU only
Summary: To predict prefill/decode times from a trace with varying request lengths, the Vidur sklearn path is the right choice. AICB provides the raw profiling data that Vidur trains on, but AICB alone cannot handle varying token counts without re-profiling each one. The two components work together: AICB generates high-fidelity data points, Vidur interpolates between them.