SimAI - How Prefill/Decode Time Prediction Works

Section 1

The Big Picture

SimAI's execution time prediction answers a single question: given a specific GPU, a specific LLM, and a trace of requests (each with different prefill/decode token counts), how long does each request take? The answer comes from a three-stage pipeline.

End-to-End Prediction Pipeline

Key Insight: AICB (Stage 1) generates raw profiling data. Vidur (Stages 2–3) trains ML models on that data, then uses them to predict execution time for any token count — including counts never profiled. This is how SimAI handles traces where every request has a different prefill length.

Section 2

Step 1: GPU Profiling (AICB)

AICB profiles actual CUDA kernel execution times on a target GPU. For each model, it runs every operation (GEMM, attention, MLP) with specific tensor dimensions, measures the GPU time in microseconds, and writes the results to files.

What is GEMM?

GEMM (General Matrix Multiply) is the fundamental operation in LLM inference. Every Linear layer in a Transformer is a matrix multiplication:

// Every Linear layer = one GEMM
output = input × weight
//  (m, k)    (k, n)  →  (m, n)
//   ↑         ↑          
//   tokens   fixed by model architecture

// Example: LLaMA-8B MLP up projection
//   m = num_tokens (variable!)
//   k = 4096  (hidden_size, fixed)
//   n = 14336 (ffn_size, fixed)

The critical variable is m — the number of tokens being processed. In prefill, m = seq_length (e.g. 4096); in decode, m = batch_size (e.g. 32). Everything else is fixed by the model architecture.

What AICB Profiles per Layer

Operation	GEMM Dimensions	What It Does
`attn_pre_proj`	(m, hidden) × (hidden, qkv_dim)	Projects input into Q, K, V
`attn_flash`	FlashAttention kernel	Computes softmax(QK)V
`attn_post_proj`	(m, head_dim×heads/tp) × (..., hidden)	Output projection back to hidden dim
`mlp_up_proj`	(m, hidden) × (hidden, ffn×2)	MLP gate + up projection
`mlp_down_proj`	(m, ffn) × (ffn, hidden)	MLP down projection

How Timing Works

aicb/utils/utils.py

def cuda_timing_decorator(func):
    def wrapper(*args, **kwargs):
        start_event = torch.cuda.Event(enable_timing=True)
        end_event   = torch.cuda.Event(enable_timing=True)

        start_event.record()
        result = func(*args, **kwargs)     # run actual CUDA kernel
        end_event.record()
        torch.cuda.synchronize()

        elapsed = start_event.elapsed_time(end_event)  # GPU-side µs
        return result, elapsed
    return wrapper

AICB uses torch.cuda.Event for GPU-side timing (not Python's time.time()). This measures actual kernel execution time, excluding CPU launch overhead. Each operation is profiled multiple times (default 10) and averaged.

Important: A single AICB run profiles at one specific (m, batch_size, seq_length) combination. To build a prediction model that generalizes to arbitrary token counts, you need to run AICB at multiple different values of m and collect the results into profiling CSV files.

Section 3

Why Execution Time is Not Linear with Token Count

A GEMM of (m, k) × (k, n) requires 2 × m × k × n FLOPs — linear in m. But actual GPU execution time is not linear in m, because a GPU has two bottlenecks:

// actual execution time = max(compute_time, memory_time)

// When m is small (decode, m=1~32):
//   FLOPs are few, but weight matrix still must be loaded from HBM
//   → memory-bound: time ≈ weight_size / HBM_bandwidth
//   → nearly independent of m!

// When m is large (prefill, m=4096):
//   FLOPs dominate over memory load time
//   → compute-bound: time ≈ FLOPs / GPU_peak_FLOPS
//   → scales linearly with m

Concrete Example

Consider the MLP up-projection GEMM (m, 4096) × (4096, 14336) on an H100:

m	FLOPs	Weight Load (fixed)	Compute Time	Memory Time	Bottleneck
1	117M	112 MB	~0.1 µs	~34 µs	Memory
32	3.7G	112 MB	~4 µs	~34 µs	Memory
128	15G	112 MB	~15 µs	~34 µs	Memory
512	60G	112 MB	~60 µs	~34 µs	Transition
4096	480G	112 MB	~480 µs	~34 µs	Compute

Why This Matters: From m=1 to m=128, FLOPs increase 128×, but execution time barely changes (both ~34µs). If you only profiled m=4096 and linearly extrapolated m=32, you'd predict 3.75µs — off by 9× from the actual ~34µs. This is why profiling at a single m and linearly scaling is wrong, and why SimAI needs ML models to capture this non-linear relationship.

Discussion: What if We Separate Compute Time and Memory Time?

The non-linearity above comes from max(compute_time, memory_time). A natural question is: if we model compute_time and memory_time separately, does each one scale linearly with m?

Component	Formula	Linearity in m
Compute Time	`2 × m × k × n / peak_FLOPS`	Strictly linear (passes through origin)
Memory Time	`(m×k + k×n + m×n) × elem_size / bandwidth`	Affine: a×m + b, where b = k×n (weight matrix, constant)
Actual Time	`max(compute, memory)`	Non-linear (piecewise, regime-dependent)

So compute_time is strictly linear in m. But memory_time is affine (a×m + b) — it has a constant offset b = k × n × elem_size / bandwidth from loading the weight matrix, which does not depend on m. For small m, this constant dominates, making memory_time appear nearly flat. Separating them makes each component approximately predictable, but memory_time is not a pure linear-through-origin function.

How SimAI's AICB Bypasses This Problem Entirely

SimAI does not build a roofline model or separate compute vs. memory time. Instead, AICB profiles each kernel at the actual micro-batch size on real GPU hardware using CUDA events, so the measured wall-clock time already implicitly captures max(compute, memory).

// AiobMegatron.py — profiles each operation on real GPU
// Records wall-clock time per kernel (already includes max(compute, memory))

for _ in range(epoch_num):
    Emb_output, Emb_time = self.Embedding(input)
    self.time_list["Emb"].append({"time_gpu": Emb_time})

    for _ in range(num_layers):
        // Each op timed individually via CUDA events
        lay_out, layernorm = self.Layernorm(Emb_output)
        atten_output, atten_qkv, ... = self.Attention(lay_out)
        mlp_out, mlp_linear_1, mlp_gelu, mlp_linear_2 = self.Mlp(lay2_out)
        // All times stored in self.time_list

These measured times are then averaged via extract_averages() to build a compute_cache — a lookup table mapping each operation to its profiled time in microseconds:

// utils.py — extract_averages()

compute_cache = {
    "attention_forward": 1250,   // measured avg (µs)
    "attention_backward": 1250,
    "mlp_forward": 890,
    "mlp_backward": 890,
    "grad_forward": 15,
    "grad_backward": 42,
    ...
}

The workload generator then replays these fixed profiled times for each microbatch × each layer, with no scaling applied:

// SimAI_training_workload_generator.py

for _ in range(ga_num):          // iterate over microbatches
    for layer in layers:          // iterate over model layers
        forward_compute_time = _get_aiob_compute_time(
            compute_cache, "forward", layer_name
        )                             // table lookup, no math

Key Insight: AICB sidesteps the roofline modeling question entirely. By profiling at the target micro-batch size, the measured time already reflects whichever regime (memory-bound or compute-bound) the kernel actually operates in. There is no compute/memory decomposition and no linear extrapolation — just measured reality replayed in the simulator. The calculate_stats() function tracks variance and P99 of these measurements for stability analysis, but not a compute vs. memory breakdown.

Section 4

Scheduling & Batching Strategies

Before predicting execution time, SimAI must decide which requests go into each batch. This is handled by the replica scheduler, which implements one of three strategies. Each strategy has different constraints on how prefill and decode requests are mixed.

Common Constraints

vidur/config/config.py

class BaseReplicaSchedulerConfig:
    batch_size_cap: int = 128             # max requests per batch
    block_size: int = 16                  # KV cache block size (tokens)
    watermark_blocks_fraction: float = 0.01  # reserved memory fraction

Memory is the hard constraint: each request occupies KV cache blocks proportional to its context length. The scheduler must ensure total allocated blocks do not exceed GPU memory (minus a watermark reserve).

Strategy A: vLLM — Token-Budget Continuous Batching

vidur/scheduler/replica_scheduler/vllm_replica_scheduler.py

The vLLM scheduler packs requests until a token budget is exhausted. Prefill and decode requests can coexist in the same batch.

# Core constraint:
num_batch_tokens = len(all_tokens_list) × max(all_tokens_list)
# Must satisfy: num_batch_tokens ≤ max_tokens_in_batch (default 4096)

# Per-request token count:
if request.is_prefill_complete:
    next_tokens = 1               # decode: 1 token per iteration
else:
    next_tokens = request.num_prefill_tokens  # prefill: all tokens at once

The num_requests × max_seq_len formulation means a single long prefill request can consume the entire token budget, blocking shorter requests. This is the classic vLLM V1 scheduling behavior.

Strategy B: Sarathi — Chunk-Based Scheduling

vidur/scheduler/replica_scheduler/sarathi_replica_scheduler.py

Sarathi splits long prefills into fixed-size chunks, preventing a single prefill from monopolizing the GPU. Decode requests are always processed (1 token each), and remaining budget goes to prefill chunks.

# Prefill tokens capped by remaining chunk budget:
next_tokens = min(
    request.num_prefill_tokens - request.num_processed_tokens,
    chunk_size - num_batch_tokens,      # remaining budget in this chunk
)
# chunk_size default: 512 tokens

Strategy C: LightLLM — Separate Prefill/Decode Batches

vidur/scheduler/replica_scheduler/lightllm_replica_scheduler.py

LightLLM never mixes prefill and decode in the same batch. It alternates between prefill-only and decode-only batches, with a fairness mechanism to prevent decode starvation.

def _get_next_batch(self):
    if not self._preempted_requests:        # no ongoing decode
        return self._get_prefill_batch()

    if self._num_waiting_iters ≥ max_waiting_iters:
        return self._get_prefill_batch()     # fairness: don't starve prefill

    return self._get_decode_batch()          # default: continue decode

Comparison

Strategy	Prefill/Decode Mixing	Prefill Chunking	Budget Constraint
vLLM	Mixed	No (full prefill)	`n_reqs × max_seq ≤ 4096`
Sarathi	Mixed	Yes (fixed chunks)	`total_tokens ≤ chunk_size`
LightLLM	Separate batches	No	`max_tokens ≤ 4096`

Section 5

Compute Time Calculation in Detail

Once a batch is formed, SimAI computes its execution time by summing per-component predictions. Every component time comes from a pre-trained polynomial model looked up by the batch's aggregated token count. Here is the complete formula.

Per-Layer Block Time

vidur/entities/execution_time.py

Each transformer layer's execution time is the sum of an attention block and an MLP block:

// Per-layer block time:
block_time = attention_time + mlp_time + residual_add

// Attention block (8 sub-components):
attention_time = pre_proj          // QKV projection        ← f(total_tokens)
               + post_proj         // output projection     ← f(total_tokens)
               + rope              // rotary embedding      ← f(total_tokens)
               + kv_cache_save     // write K,V to cache    ← f(total_tokens)
               + prefill_attn      // prefill attention     ← f(kv_cache, chunk²)
               + decode_attn       // decode attention      ← f(batch_size, kv_cache)
               + tp_allreduce      // tensor parallel comm  ← f(total_tokens)
               + attn_norm         // layer norm            ← f(total_tokens)

// MLP block (5 sub-components):
mlp_time = up_proj                 // gate + up projection  ← f(total_tokens)
         + down_proj               // down projection       ← f(total_tokens)
         + activation              // SiLU / GeLU           ← f(total_tokens)
         + tp_allreduce            // tensor parallel comm  ← f(total_tokens)
         + mlp_norm                // layer norm            ← f(total_tokens)

Total Model Time

// Standard (dense) models:
model_time = block_time × num_layers_per_pipeline_stage
           + pipeline_parallel_communication

// MoE models (DeepSeek, Qwen3-MoE): per-layer calculation
// because dense layers and MoE layers have different times
model_time = ∑ block_time(layer_id) for layer_id in [start..end]
           + pipeline_parallel_communication

// Final output:
total_time = model_time + cpu_overhead

How Each Component Gets Its Input

The key insight is that different components aggregate the batch's tokens differently:

Component	Input Features	How Aggregated from Batch
MLP / Projection / Norm	`(total_tokens,)`	Sum all tokens, round to multiple of 8
Prefill Attention	`(kv_cache, chunk_size²)`	kv = ∑per-request context; chunk = L2 norm then squared
Decode Attention	`(batch_size, avg_kv)`	Count decode requests; average their kv_cache sizes
TP AllReduce	`(total_tokens,)`	Same as MLP (message size ∝ tokens)

Concrete Example: Decode Attention Lookup

vidur/execution_time_predictor/sklearn_execution_time_predictor.py

def _get_attention_decode_execution_time(self, batch):
    decode_batch_size, avg_kv_cache = self._get_batch_decode_attention_params(batch)
    if decode_batch_size == 0:
        return 0

    base_time = self._predictions["attn_decode"][(decode_batch_size, avg_kv_cache)]
    # Add batching overhead when batch_size > 1
    return base_time × (1 + overhead_fraction × int(decode_batch_size > 1))

Concrete Example: Prefill Attention Aggregation

def _get_attention_prefill_execution_time(self, batch):
    # Collect (kv_cache_size, prefill_chunk_size) per prefill request
    prefill_params = self._get_batch_prefill_attention_params(batch)
    if len(prefill_params) == 0:
        return 0

    kv_sizes, chunk_sizes = zip(*prefill_params)
    agg_kv    = sum(kv_sizes)
    agg_chunk = sum([x**2 for x in chunk_sizes]) ** 0.5   # L2 norm

    return self._predictions["attn_prefill"][
        (agg_kv, round(agg_chunk)**2)                     # squared back
    ]

Why L2 norm for prefill chunks? Prefill attention is roughly O(chunk_size²) per request (self-attention). When batching N requests, the total work is ∑ chunk_i², not (∑ chunk_i)². The L2 norm √(∑ chunk_i²) preserves this property: squaring it back gives the correct total FLOPs. A simple sum would overestimate by including cross-request attention that doesn't happen.

Section 6

Training Prediction Models (Vidur)

Vidur takes the profiling CSV files from Stage 1 and trains separate sklearn models for each type of operation. The key technique is polynomial regression — fitting a curve that captures the memory-bound → compute-bound transition.

Models Trained

Model Name	Input Features	Predicts
`attn_prefill`	`(kv_cache_size, prefill_chunk_size²)`	Prefill attention time
`attn_decode`	`(batch_size, kv_cache_size)`	Decode attention time
`attn_pre_proj`	`(num_tokens)`	QKV projection time
`mlp_up_proj`	`(num_tokens)`	MLP up+gate time
`mlp_down_proj`	`(num_tokens)`	MLP down time
`all_reduce`	`(num_tokens)`	TP communication time

Training Pipeline

vidur/execution_time_predictor/sklearn_execution_time_predictor.py

# 1. Load profiling CSV (multiple m values from AICB runs)
# 2. Fit polynomial regression
estimator = make_pipeline(
    PolynomialFeatures(degree=n),    # captures non-linear curve
    LinearRegression(fit_intercept=True)
)
# time = a&sub0; + a&sub1;·tokens + a&sub2;·tokens² + ... + a_n·tokens^n

# 3. Pre-compute predictions for all possible token counts
num_token_range = np.arange(1, max_tokens + 1)
predictions[model_name] = model.predict(num_token_range)
# Result: { (1,): 5.2, (2,): 5.3, ..., (4096,): 480.1, ... }

Key Insight: After training, all predictions are pre-computed into a lookup dictionary. During simulation, predicting execution time for any token count is a simple O(1) dictionary lookup — no ML inference at runtime.

Why Separate Models for Prefill vs Decode Attention?

Attention is the only component that behaves fundamentally differently between prefill and decode. MLP and projection layers only depend on num_tokens, but attention depends on the interaction pattern:

Prefill Attention

Processes all input tokens at once. Time depends on kv_cache_size (context so far) and prefill_chunk_size² (quadratic in chunk size because every token attends to every other token).

features = ["kv_cache_size",
            "prefill_chunk_size_squared"]

Decode Attention

Generates one token per request. Time depends on batch_size (how many requests) and kv_cache_size (how much context each request has accumulated).

features = ["batch_size",
            "kv_cache_size"]

Section 7

Per-Request Prediction from Trace

A trace is a CSV file where each row is a request with a specific num_prefill_tokens and num_decode_tokens. Vidur processes the trace by forming batches and predicting execution time per batch.

Prediction Flow for One Batch

vidur/execution_time_predictor/base_execution_time_predictor.py

def get_execution_time(self, batch, pipeline_stage) -> ExecutionTime:
    # For each batch, compute every time component:
    return ExecutionTime(
        attention_prefill_execution_time  = self._get_attention_prefill_execution_time(batch),
        attention_decode_execution_time   = self._get_attention_decode_execution_time(batch),
        attention_layer_pre_proj_time     = self._get_attention_layer_pre_proj_execution_time(batch),
        attention_layer_post_proj_time    = self._get_attention_layer_post_proj_execution_time(batch),
        mlp_layer_up_proj_time            = self._get_mlp_layer_up_proj_execution_time(batch),
        mlp_layer_down_proj_time          = self._get_mlp_layer_down_proj_execution_time(batch),
        tensor_parallel_communication     = self._get_tensor_parallel_communication_time(batch),
        pipeline_parallel_communication   = self._get_pipeline_parallel_communication_time(batch),
        # ... + RoPE, KV cache save, norms, CPU overhead
    )

How model_time is Computed

vidur/entities/execution_time.py

@property
def model_time(self):
    # Per layer:
    attention_time = (pre_proj + post_proj + rope + kv_cache_save
                     + prefill_time + decode_time
                     + tp_communication + norm)
    mlp_time = (up_proj + down_proj + act
               + tp_communication + norm)
    block_time = attention_time + mlp_time + add

    # Total model time:
    return block_time × num_layers_per_pipeline_stage + cpu_overhead

Key Insight: Every component time is computed via dictionary lookup based on the batch's aggregated token count. There is no iterative simulation within a single forward pass — each lookup is O(1). The total model time multiplies per-layer time by the number of layers.

Section 8

How Tokens are Aggregated in a Batch

A batch can contain multiple requests with different token counts. The prediction models expect a single set of input features, so Vidur must aggregate tokens from all requests in the batch. The aggregation method differs by component:

Component	Aggregation Method	Lookup Key
MLP / Linear layers	`sum(all tokens)`, round to multiple of 8	`(total_tokens_rounded,)`
Prefill Attention	kv_cache = sum; chunk = L2 norm	`(∑kv, round(√∑chunk²)²)`
Decode Attention	count decode reqs; avg kv cache	`(batch_size, avg_kv_rounded)`
TP Communication	`sum(all tokens)`, round to multiple of 8	`(total_tokens_rounded,)`

Worked Example

Suppose a batch contains 3 requests:

Request	Phase	Tokens to Process	KV Cache
A	Prefill	512	0
B	Prefill	2048	0
C	Decode	1	1000

# MLP lookup:
total_tokens = 512 + 2048 + 1 = 2561
total_tokens_rounded = (2561 + 7) // 8 * 8 = 2568
mlp_time = predictions["mlp_up_proj"][(2568,)]

# Prefill attention lookup (requests A & B):
agg_kv     = 0 + 0 = 0
agg_chunk  = sqrt(512² + 2048²) = sqrt(4456448) ≈ 2111
prefill_time = predictions["attn_prefill"][(0, 2111²)]

# Decode attention lookup (request C only):
decode_batch_size = 1
avg_kv = 1000  # rounded to granularity
decode_time = predictions["attn_decode"][(1, 1000)]

Section 9

Supported GPU & LLM Models

AICB Inference Profiling (Kernel-Level)

AICB requires a hand-written "Mocked Model" + "AIOB Profiler" for each LLM. Currently only 3 models are implemented:

Model	Type	Special Kernels	GPU Requirement
DeepSeek-V3 (671B)	MoE + MLA	DeepGEMM FP8, FlashMLA	SM90+ (H100)
Qwen3-MoE (235B)	MoE	vLLM-based	SM90+
Qwen3-Next (80B)	MoE + GDN	GDN (experimental)	SM90+

No LLaMA: LLaMA is not in AICB's inference profiling, despite being the simplest architecture (standard MHA + FFN, no special kernels needed). This is because SimAI is built by Alibaba — they prioritize their own models (Qwen) and the most popular domestic model (DeepSeek).

Vidur Scheduling Simulation (Model Configs)

Vidur has model configs for many more models. These can be used with the sklearn prediction path (no AICB needed, but requires running Vidur's own profiling scripts on a GPU):

LLaMA-2: 7B, 70B

LLaMA-3: 8B, 70B

CodeLLaMA: 34B

Qwen: 72B

InternLM: 20B, 2-20B

Phi-2: (Microsoft)

DeepSeek-V3: 671B

Section 10

AICB vs Vidur: Two Prediction Paths

SimAI offers two distinct paths for execution time prediction. Understanding when to use each is critical:

	AICB + astra-sim	Vidur sklearn
Question answered	Is this hardware config viable?	How does this scheduling strategy perform?
Prefill length	Fixed per run	Varies per request (from trace)
Prediction method	Direct profiling at exact (m, batch, seq)	Polynomial model trained on multi-point profiling
Handles varying token counts?	No — one run per (m) value	Yes — generalizes via model
Network simulation	Full NS-3 / SimCCL simulation	Analytical formula or SimAI integration
Accuracy	Exact at profiled point	Approximate (polynomial fit)
Supported models	3 (DeepSeek, Qwen3-MoE, Qwen3-Next)	10+ (LLaMA, Qwen, InternLM, Phi, ...)
GPU needed	Yes (SM90+, per run)	Yes for profiling, then CPU only

Summary: To predict prefill/decode times from a trace with varying request lengths, the Vidur sklearn path is the right choice. AICB provides the raw profiling data that Vidur trains on, but AICB alone cannot handle varying token counts without re-profiling each one. The two components work together: AICB generates high-fidelity data points, Vidur interpolates between them.

How SimAI Predicts Prefill/Decode Time

Table of Contents

The Big Picture

Step 1: GPU Profiling (AICB)

What is GEMM?

What AICB Profiles per Layer

How Timing Works

Why Execution Time is Not Linear with Token Count

Concrete Example

Discussion: What if We Separate Compute Time and Memory Time?

How SimAI's AICB Bypasses This Problem Entirely

Scheduling & Batching Strategies

Common Constraints

Strategy A: vLLM — Token-Budget Continuous Batching

Strategy B: Sarathi — Chunk-Based Scheduling

Strategy C: LightLLM — Separate Prefill/Decode Batches

Comparison

Compute Time Calculation in Detail

Per-Layer Block Time

Total Model Time

How Each Component Gets Its Input

Concrete Example: Decode Attention Lookup

Concrete Example: Prefill Attention Aggregation

Training Prediction Models (Vidur)

Models Trained

Training Pipeline

Why Separate Models for Prefill vs Decode Attention?

Prefill Attention

Decode Attention

Per-Request Prediction from Trace

Prediction Flow for One Batch

How model_time is Computed

How Tokens are Aggregated in a Batch

Worked Example

Supported GPU & LLM Models

AICB Inference Profiling (Kernel-Level)

Vidur Scheduling Simulation (Model Configs)

AICB vs Vidur: Two Prediction Paths