From GPU profiling to per-request execution time prediction — the complete pipeline
AICB · Vidur · SklearnSimAI's execution time prediction answers a single question: given a specific GPU, a specific LLM, and a trace of requests (each with different prefill/decode token counts), how long does each request take? The answer comes from a three-stage pipeline.
AICB profiles actual CUDA kernel execution times on a target GPU. For each model, it runs every operation (GEMM, attention, MLP) with specific tensor dimensions, measures the GPU time in microseconds, and writes the results to files.
GEMM (General Matrix Multiply) is the fundamental operation in LLM inference. Every Linear layer in a Transformer is a matrix multiplication:
// Every Linear layer = one GEMM output = input × weight // (m, k) (k, n) → (m, n) // ↑ ↑ // tokens fixed by model architecture // Example: LLaMA-8B MLP up projection // m = num_tokens (variable!) // k = 4096 (hidden_size, fixed) // n = 14336 (ffn_size, fixed)
The critical variable is m — the number of tokens being processed. In prefill, m = seq_length (e.g. 4096); in decode, m = batch_size (e.g. 32). Everything else is fixed by the model architecture.
| Operation | GEMM Dimensions | What It Does |
|---|---|---|
attn_pre_proj | (m, hidden) × (hidden, qkv_dim) | Projects input into Q, K, V |
attn_flash | FlashAttention kernel | Computes softmax(QK)V |
attn_post_proj | (m, head_dim×heads/tp) × (..., hidden) | Output projection back to hidden dim |
mlp_up_proj | (m, hidden) × (hidden, ffn×2) | MLP gate + up projection |
mlp_down_proj | (m, ffn) × (ffn, hidden) | MLP down projection |
def cuda_timing_decorator(func): def wrapper(*args, **kwargs): start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) start_event.record() result = func(*args, **kwargs) # run actual CUDA kernel end_event.record() torch.cuda.synchronize() elapsed = start_event.elapsed_time(end_event) # GPU-side µs return result, elapsed return wrapper
AICB uses torch.cuda.Event for GPU-side timing (not Python's time.time()). This measures actual kernel execution time, excluding CPU launch overhead. Each operation is profiled multiple times (default 10) and averaged.
(m, batch_size, seq_length) combination. To build a prediction model that generalizes to arbitrary token counts, you need to run AICB at multiple different values of m and collect the results into profiling CSV files.
A GEMM of (m, k) × (k, n) requires 2 × m × k × n FLOPs — linear in m. But actual GPU execution time is not linear in m, because a GPU has two bottlenecks:
// actual execution time = max(compute_time, memory_time) // When m is small (decode, m=1~32): // FLOPs are few, but weight matrix still must be loaded from HBM // → memory-bound: time ≈ weight_size / HBM_bandwidth // → nearly independent of m! // When m is large (prefill, m=4096): // FLOPs dominate over memory load time // → compute-bound: time ≈ FLOPs / GPU_peak_FLOPS // → scales linearly with m
Consider the MLP up-projection GEMM (m, 4096) × (4096, 14336) on an H100:
| m | FLOPs | Weight Load (fixed) | Compute Time | Memory Time | Bottleneck |
|---|---|---|---|---|---|
| 1 | 117M | 112 MB | ~0.1 µs | ~34 µs | Memory |
| 32 | 3.7G | 112 MB | ~4 µs | ~34 µs | Memory |
| 128 | 15G | 112 MB | ~15 µs | ~34 µs | Memory |
| 512 | 60G | 112 MB | ~60 µs | ~34 µs | Transition |
| 4096 | 480G | 112 MB | ~480 µs | ~34 µs | Compute |
m and linearly scaling is wrong, and why SimAI needs ML models to capture this non-linear relationship.
The non-linearity above comes from max(compute_time, memory_time). A natural question is: if we model compute_time and memory_time separately, does each one scale linearly with m?
| Component | Formula | Linearity in m |
|---|---|---|
| Compute Time | 2 × m × k × n / peak_FLOPS |
Strictly linear (passes through origin) |
| Memory Time | (m×k + k×n + m×n) × elem_size / bandwidth |
Affine: a×m + b, where b = k×n (weight matrix, constant) |
| Actual Time | max(compute, memory) |
Non-linear (piecewise, regime-dependent) |
So compute_time is strictly linear in m. But memory_time is affine (a×m + b) — it has a constant offset b = k × n × elem_size / bandwidth from loading the weight matrix, which does not depend on m. For small m, this constant dominates, making memory_time appear nearly flat. Separating them makes each component approximately predictable, but memory_time is not a pure linear-through-origin function.
SimAI does not build a roofline model or separate compute vs. memory time. Instead, AICB profiles each kernel at the actual micro-batch size on real GPU hardware using CUDA events, so the measured wall-clock time already implicitly captures max(compute, memory).
// AiobMegatron.py — profiles each operation on real GPU // Records wall-clock time per kernel (already includes max(compute, memory)) for _ in range(epoch_num): Emb_output, Emb_time = self.Embedding(input) self.time_list["Emb"].append({"time_gpu": Emb_time}) for _ in range(num_layers): // Each op timed individually via CUDA events lay_out, layernorm = self.Layernorm(Emb_output) atten_output, atten_qkv, ... = self.Attention(lay_out) mlp_out, mlp_linear_1, mlp_gelu, mlp_linear_2 = self.Mlp(lay2_out) // All times stored in self.time_list
These measured times are then averaged via extract_averages() to build a compute_cache — a lookup table mapping each operation to its profiled time in microseconds:
// utils.py — extract_averages() compute_cache = { "attention_forward": 1250, // measured avg (µs) "attention_backward": 1250, "mlp_forward": 890, "mlp_backward": 890, "grad_forward": 15, "grad_backward": 42, ... }
The workload generator then replays these fixed profiled times for each microbatch × each layer, with no scaling applied:
// SimAI_training_workload_generator.py for _ in range(ga_num): // iterate over microbatches for layer in layers: // iterate over model layers forward_compute_time = _get_aiob_compute_time( compute_cache, "forward", layer_name ) // table lookup, no math
calculate_stats() function tracks variance and P99 of these measurements for stability analysis, but not a compute vs. memory breakdown.
Before predicting execution time, SimAI must decide which requests go into each batch. This is handled by the replica scheduler, which implements one of three strategies. Each strategy has different constraints on how prefill and decode requests are mixed.
class BaseReplicaSchedulerConfig: batch_size_cap: int = 128 # max requests per batch block_size: int = 16 # KV cache block size (tokens) watermark_blocks_fraction: float = 0.01 # reserved memory fraction
Memory is the hard constraint: each request occupies KV cache blocks proportional to its context length. The scheduler must ensure total allocated blocks do not exceed GPU memory (minus a watermark reserve).
The vLLM scheduler packs requests until a token budget is exhausted. Prefill and decode requests can coexist in the same batch.
# Core constraint: num_batch_tokens = len(all_tokens_list) × max(all_tokens_list) # Must satisfy: num_batch_tokens ≤ max_tokens_in_batch (default 4096) # Per-request token count: if request.is_prefill_complete: next_tokens = 1 # decode: 1 token per iteration else: next_tokens = request.num_prefill_tokens # prefill: all tokens at once
num_requests × max_seq_len formulation means a single long prefill request can consume the entire token budget, blocking shorter requests. This is the classic vLLM V1 scheduling behavior.
Sarathi splits long prefills into fixed-size chunks, preventing a single prefill from monopolizing the GPU. Decode requests are always processed (1 token each), and remaining budget goes to prefill chunks.
# Prefill tokens capped by remaining chunk budget: next_tokens = min( request.num_prefill_tokens - request.num_processed_tokens, chunk_size - num_batch_tokens, # remaining budget in this chunk ) # chunk_size default: 512 tokens
LightLLM never mixes prefill and decode in the same batch. It alternates between prefill-only and decode-only batches, with a fairness mechanism to prevent decode starvation.
def _get_next_batch(self): if not self._preempted_requests: # no ongoing decode return self._get_prefill_batch() if self._num_waiting_iters ≥ max_waiting_iters: return self._get_prefill_batch() # fairness: don't starve prefill return self._get_decode_batch() # default: continue decode
| Strategy | Prefill/Decode Mixing | Prefill Chunking | Budget Constraint |
|---|---|---|---|
| vLLM | Mixed | No (full prefill) | n_reqs × max_seq ≤ 4096 |
| Sarathi | Mixed | Yes (fixed chunks) | total_tokens ≤ chunk_size |
| LightLLM | Separate batches | No | max_tokens ≤ 4096 |
Once a batch is formed, SimAI computes its execution time by summing per-component predictions. Every component time comes from a pre-trained polynomial model looked up by the batch's aggregated token count. Here is the complete formula.
Each transformer layer's execution time is the sum of an attention block and an MLP block:
// Per-layer block time: block_time = attention_time + mlp_time + residual_add // Attention block (8 sub-components): attention_time = pre_proj // QKV projection ← f(total_tokens) + post_proj // output projection ← f(total_tokens) + rope // rotary embedding ← f(total_tokens) + kv_cache_save // write K,V to cache ← f(total_tokens) + prefill_attn // prefill attention ← f(kv_cache, chunk²) + decode_attn // decode attention ← f(batch_size, kv_cache) + tp_allreduce // tensor parallel comm ← f(total_tokens) + attn_norm // layer norm ← f(total_tokens) // MLP block (5 sub-components): mlp_time = up_proj // gate + up projection ← f(total_tokens) + down_proj // down projection ← f(total_tokens) + activation // SiLU / GeLU ← f(total_tokens) + tp_allreduce // tensor parallel comm ← f(total_tokens) + mlp_norm // layer norm ← f(total_tokens)
// Standard (dense) models: model_time = block_time × num_layers_per_pipeline_stage + pipeline_parallel_communication // MoE models (DeepSeek, Qwen3-MoE): per-layer calculation // because dense layers and MoE layers have different times model_time = ∑ block_time(layer_id) for layer_id in [start..end] + pipeline_parallel_communication // Final output: total_time = model_time + cpu_overhead
The key insight is that different components aggregate the batch's tokens differently:
| Component | Input Features | How Aggregated from Batch |
|---|---|---|
| MLP / Projection / Norm | (total_tokens,) |
Sum all tokens, round to multiple of 8 |
| Prefill Attention | (kv_cache, chunk_size²) |
kv = ∑per-request context; chunk = L2 norm then squared |
| Decode Attention | (batch_size, avg_kv) |
Count decode requests; average their kv_cache sizes |
| TP AllReduce | (total_tokens,) |
Same as MLP (message size ∝ tokens) |
def _get_attention_decode_execution_time(self, batch): decode_batch_size, avg_kv_cache = self._get_batch_decode_attention_params(batch) if decode_batch_size == 0: return 0 base_time = self._predictions["attn_decode"][(decode_batch_size, avg_kv_cache)] # Add batching overhead when batch_size > 1 return base_time × (1 + overhead_fraction × int(decode_batch_size > 1))
def _get_attention_prefill_execution_time(self, batch): # Collect (kv_cache_size, prefill_chunk_size) per prefill request prefill_params = self._get_batch_prefill_attention_params(batch) if len(prefill_params) == 0: return 0 kv_sizes, chunk_sizes = zip(*prefill_params) agg_kv = sum(kv_sizes) agg_chunk = sum([x**2 for x in chunk_sizes]) ** 0.5 # L2 norm return self._predictions["attn_prefill"][ (agg_kv, round(agg_chunk)**2) # squared back ]
O(chunk_size²) per request (self-attention). When batching N requests, the total work is ∑ chunk_i², not (∑ chunk_i)². The L2 norm √(∑ chunk_i²) preserves this property: squaring it back gives the correct total FLOPs. A simple sum would overestimate by including cross-request attention that doesn't happen.
Vidur takes the profiling CSV files from Stage 1 and trains separate sklearn models for each type of operation. The key technique is polynomial regression — fitting a curve that captures the memory-bound → compute-bound transition.
| Model Name | Input Features | Predicts |
|---|---|---|
attn_prefill | (kv_cache_size, prefill_chunk_size²) | Prefill attention time |
attn_decode | (batch_size, kv_cache_size) | Decode attention time |
attn_pre_proj | (num_tokens) | QKV projection time |
mlp_up_proj | (num_tokens) | MLP up+gate time |
mlp_down_proj | (num_tokens) | MLP down time |
all_reduce | (num_tokens) | TP communication time |
# 1. Load profiling CSV (multiple m values from AICB runs) # 2. Fit polynomial regression estimator = make_pipeline( PolynomialFeatures(degree=n), # captures non-linear curve LinearRegression(fit_intercept=True) ) # time = a&sub0; + a&sub1;·tokens + a&sub2;·tokens² + ... + a_n·tokens^n # 3. Pre-compute predictions for all possible token counts num_token_range = np.arange(1, max_tokens + 1) predictions[model_name] = model.predict(num_token_range) # Result: { (1,): 5.2, (2,): 5.3, ..., (4096,): 480.1, ... }
O(1) dictionary lookup — no ML inference at runtime.
Attention is the only component that behaves fundamentally differently between prefill and decode. MLP and projection layers only depend on num_tokens, but attention depends on the interaction pattern:
Processes all input tokens at once. Time depends on kv_cache_size (context so far) and prefill_chunk_size² (quadratic in chunk size because every token attends to every other token).
features = ["kv_cache_size",
"prefill_chunk_size_squared"]
Generates one token per request. Time depends on batch_size (how many requests) and kv_cache_size (how much context each request has accumulated).
features = ["batch_size",
"kv_cache_size"]
A trace is a CSV file where each row is a request with a specific num_prefill_tokens and num_decode_tokens. Vidur processes the trace by forming batches and predicting execution time per batch.
def get_execution_time(self, batch, pipeline_stage) -> ExecutionTime: # For each batch, compute every time component: return ExecutionTime( attention_prefill_execution_time = self._get_attention_prefill_execution_time(batch), attention_decode_execution_time = self._get_attention_decode_execution_time(batch), attention_layer_pre_proj_time = self._get_attention_layer_pre_proj_execution_time(batch), attention_layer_post_proj_time = self._get_attention_layer_post_proj_execution_time(batch), mlp_layer_up_proj_time = self._get_mlp_layer_up_proj_execution_time(batch), mlp_layer_down_proj_time = self._get_mlp_layer_down_proj_execution_time(batch), tensor_parallel_communication = self._get_tensor_parallel_communication_time(batch), pipeline_parallel_communication = self._get_pipeline_parallel_communication_time(batch), # ... + RoPE, KV cache save, norms, CPU overhead )
@property def model_time(self): # Per layer: attention_time = (pre_proj + post_proj + rope + kv_cache_save + prefill_time + decode_time + tp_communication + norm) mlp_time = (up_proj + down_proj + act + tp_communication + norm) block_time = attention_time + mlp_time + add # Total model time: return block_time × num_layers_per_pipeline_stage + cpu_overhead
A batch can contain multiple requests with different token counts. The prediction models expect a single set of input features, so Vidur must aggregate tokens from all requests in the batch. The aggregation method differs by component:
| Component | Aggregation Method | Lookup Key |
|---|---|---|
| MLP / Linear layers | sum(all tokens), round to multiple of 8 |
(total_tokens_rounded,) |
| Prefill Attention | kv_cache = sum; chunk = L2 norm | (∑kv, round(√∑chunk²)²) |
| Decode Attention | count decode reqs; avg kv cache | (batch_size, avg_kv_rounded) |
| TP Communication | sum(all tokens), round to multiple of 8 |
(total_tokens_rounded,) |
Suppose a batch contains 3 requests:
| Request | Phase | Tokens to Process | KV Cache |
|---|---|---|---|
| A | Prefill | 512 | 0 |
| B | Prefill | 2048 | 0 |
| C | Decode | 1 | 1000 |
# MLP lookup: total_tokens = 512 + 2048 + 1 = 2561 total_tokens_rounded = (2561 + 7) // 8 * 8 = 2568 mlp_time = predictions["mlp_up_proj"][(2568,)] # Prefill attention lookup (requests A & B): agg_kv = 0 + 0 = 0 agg_chunk = sqrt(512² + 2048²) = sqrt(4456448) ≈ 2111 prefill_time = predictions["attn_prefill"][(0, 2111²)] # Decode attention lookup (request C only): decode_batch_size = 1 avg_kv = 1000 # rounded to granularity decode_time = predictions["attn_decode"][(1, 1000)]
AICB requires a hand-written "Mocked Model" + "AIOB Profiler" for each LLM. Currently only 3 models are implemented:
| Model | Type | Special Kernels | GPU Requirement |
|---|---|---|---|
| DeepSeek-V3 (671B) | MoE + MLA | DeepGEMM FP8, FlashMLA | SM90+ (H100) |
| Qwen3-MoE (235B) | MoE | vLLM-based | SM90+ |
| Qwen3-Next (80B) | MoE + GDN | GDN (experimental) | SM90+ |
Vidur has model configs for many more models. These can be used with the sklearn prediction path (no AICB needed, but requires running Vidur's own profiling scripts on a GPU):
SimAI offers two distinct paths for execution time prediction. Understanding when to use each is critical:
| AICB + astra-sim | Vidur sklearn | |
|---|---|---|
| Question answered | Is this hardware config viable? | How does this scheduling strategy perform? |
| Prefill length | Fixed per run | Varies per request (from trace) |
| Prediction method | Direct profiling at exact (m, batch, seq) | Polynomial model trained on multi-point profiling |
| Handles varying token counts? | No — one run per (m) value | Yes — generalizes via model |
| Network simulation | Full NS-3 / SimCCL simulation | Analytical formula or SimAI integration |
| Accuracy | Exact at profiled point | Approximate (polynomial fit) |
| Supported models | 3 (DeepSeek, Qwen3-MoE, Qwen3-Next) | 10+ (LLaMA, Qwen, InternLM, Phi, ...) |
| GPU needed | Yes (SM90+, per run) | Yes for profiling, then CPU only |