Step-by-step walkthrough of how a request flows through SimAI's 6 stages, with detailed input/output file formats, CSV headers, backend comparison, and configuration parameters at each step.
vidur-alibabacloud · AICB · astra-sim · NS-3When a simulated inference request enters SimAI, it travels through six distinct stages spanning all five components. The following step-flow traces this journey from arrival to metrics collection.
A RequestArrivalEvent fires in the discrete-event simulation loop. The event carries num_prefill_tokens, num_decode_tokens, and an arrival timestamp. A Request entity is created with an empty DAG (nx.DiGraph()) that will be populated as the request progresses through the system.
Requests are produced by a Request Generator (vidur/request_generator/). SimAI supports two top-level modes:
Arrival intervals and token lengths are generated independently. You pick one Interval Generator and one Length Generator.
| Interval Generators | |||
|---|---|---|---|
| Type | Config Parameters | Formula / Logic | Source File |
POISSON |
qps (float, default 0.5), seed |
interval = −ln(1−rand()) / qps, capped at 3σ | vidur/request_generator/poisson_request_interval_generator.py |
GAMMA |
qps (float, default 0.2), cv (float, default 0.5), seed |
shape = 1/cv², scale = 1/(qps×shape) | vidur/request_generator/gamma_request_interval_generator.py |
STATIC |
Fixed interval value | Constant inter-arrival time | vidur/request_generator/static_request_interval_generator.py |
TRACE |
trace_file (CSV path), start_time, end_time, time_scale_factor |
Reads CSV column arrival_time (datetime), computes inter-arrival via .diff(), applies time_scale_factor |
vidur/request_generator/trace_request_interval_generator.py |
arrival_time column with datetime strings (e.g. 2021-01-04 12:00:15). Default trace: data/processed_traces/AzureFunctionsInvocationTraceForTwoWeeksJan2021Processed.csv
| Length Generators | |||
|---|---|---|---|
| Type | Config Parameters | Logic | Source File |
FIXED |
prefill_tokens (int, default 2048), decode_tokens (int, default 512) |
Returns constant (prefill, decode) pair | vidur/request_generator/fixed_request_length_generator.py |
ZIPF |
min_tokens (1024), max_tokens (4096), theta (0.6), prefill_to_decode_ratio (20.0), scramble, seed |
Zipf-distributed total tokens, split by ratio | vidur/request_generator/zipf_request_length_generator.py |
UNIFORM |
min_tokens (1024), max_tokens (4096), prefill_to_decode_ratio (20.0), seed |
Uniform distribution between min/max, split by ratio | vidur/request_generator/uniform_request_length_generator.py |
TRACE |
trace_file (CSV path), prefill_scale_factor (1.0), decode_scale_factor (1.0), max_tokens (4096), seed |
Reads CSV columns num_prefill_tokens, num_decode_tokens; applies scale factors and clips |
vidur/request_generator/trace_request_length_generator.py |
num_prefill_tokens and num_decode_tokens integer columns. Default trace: data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv
Reads arrival time and token counts from a single CSV file. Config: TraceRequestGeneratorConfig.
data/processed_traces/splitwise_conv.csv):
arrived_at,num_prefill_tokens,num_decode_tokens
0.0102006,1024,10
0.0105234,2048,15
0.0215440,1536,8
Columns: arrived_at (float, seconds) — absolute arrival timestamp; num_prefill_tokens (int) — input/prompt tokens; num_decode_tokens (int) — output/generation tokens.trace_file, prefill_scale_factor (1.0), decode_scale_factor (1.0), time_scale_factor (1.0), max_tokens (4096), seed (42).
# Synthetic: Poisson arrival + Fixed length
python -m vidur.main \
--request_generator_config_type synthetic \
--interval_generator_config_type poisson \
--poisson_request_interval_generator_config_qps 100 \
--length_generator_config_type fixed \
--fixed_request_length_generator_config_prefill_tokens 1024 \
--fixed_request_length_generator_config_decode_tokens 10
# Trace Replay
python -m vidur.main \
--request_generator_config_type trace_replay \
--trace_request_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \
--trace_request_generator_config_time_scale_factor 0.5
Defined in vidur/entities/request.py. At this stage only the identity and token fields are set; timing/DAG fields are populated in later steps.
Request(
# ── Identity & tokens (set at creation) ──
_id: int # auto-incremented unique ID
_arrived_at: float # arrival timestamp (seconds)
_num_prefill_tokens: int # input / prompt tokens
_num_decode_tokens: int # output / generation tokens
_num_processed_tokens: int = 0 # tokens processed so far
# ── DAG (empty now, populated in Step 2) ──
dag: nx.DiGraph() # task dependency graph
node_id: int = 0 # counter for unique node IDs in DAG
nodes: dict = {} # node_id → Node object mapping
root_node: Node = None # entry point (PromptTask)
request_type: RequestType # PREFILL=1 / DECODE=2 / MIXED=0
# ── Replica assignment (set in Step 2) ──
prefill_replica_id: int = None
decode_replica_id: int = None
# ── Timing fields (populated in Steps 3–6) ──
_scheduled_at: float = 0 # first scheduling time
_prefill_completed_at: float = 0 # TTFT base timestamp
_completed_at: float = 0 # request completion time
_execution_time: float = 0 # GPU execution time
_model_execution_time: float = 0 # model-specific execution time
_scheduling_delay: float = 0 # wait before first scheduling
_preempted_time: float = 0 # time lost to preemption
prefill_arrived_at: float # prefill start (= arrived_at)
decode_arrived_at: float = inf # decode start (after KV transfer)
# ── PD communication (set in Step 5) ──
pd_p2p_comm_size: float = inf # KV cache transfer bytes
pd_p2p_comm_time: float = inf # KV cache transfer time (s)
pd_p2p_comm_bandwidth: float = 0 # effective bandwidth
pd_p2p_bytes_per_token: int = None # 2 (fp16) / 4 (fp32) / 8 (fp64)
pd_p2p_comm_dtype: str = None # "float16", "float32", etc.
# ── Status flags ──
_scheduled: bool = False
_preempted: bool = False
_completed: bool = False
_is_prefill_complete: bool = False
_num_restarts: int = 0
)
The generator also creates a RequestArrivalEvent(_time=arrived_at, _request=request) which is pushed into the simulation's min-heap event queue.
The global scheduler routes the request to a replica pool. In PD disaggregation mode, it separates prefill and decode into different replica pools.
| Input | Request object from Step 1 |
|---|---|
| Global Scheduler Types |
RANDOM,
ROUND_ROBIN,
LOR (Least Outstanding Requests),
SPLIT_WISE (PD-aware)
|
| Key Config |
pd_node_ratio (float, default 0.5) —
fraction of replicas dedicated to prefillnum_replicas (int, default 1) —
total GPU replicas in cluster
|
Sets request.prefill_replica_id and request.decode_replica_id, then builds the DAG. The exact DAG shape depends on whether the prefill and decode replicas are the same or different (code in splitwise_global_scheduler.py):
PromptTask ──→ TokenTask
# linked via prefill_task.chain = [decode_task]
# no KV transfer needed
PromptTask ──→ KVCacheTransferFlow ──→ TokenTask
# edges in request.dag (nx.DiGraph)
# KV transfer bridges the two replicas
# ── Base Node (all DAG nodes inherit from this) ──
Node(BaseEntity):
node_id: int # unique ID within this request's DAG
state: NodeState # NONE=0 → QUEUED=1 → RUNNING=2 → COMPLETED=4
# (or BLOCKED=3 if preempted, ABORTED=5)
request: Request # back-reference to owning request
chain: list[Node] # consecutive nodes (used in same-replica case)
num_preemptions: int = 0
# ── PromptTask (prefill stage) ──
PromptTask(Task → Node):
task_type: TaskType = PROMPT # = 1
prompt_size: int # = request.num_prefill_tokens
tokens_per_iteration: int # = prompt_size (process all at once)
processing_tokens: int = 0 # tokens being processed this iteration
processed_tokens: int = 0 # total tokens processed
generating_tokens: int = 0 # tokens being generated this iteration
generated_tokens: int = 0 # total generated (max 1 for prefill)
is_prefill_complete: bool = False
cleanup_memory: bool = False # keep KV cache for decode
batch_size: int = 1
duration: float = 0.0 # set by execution time predictor
# memory = 2 × num_tokens × mlp_hidden_dim × num_layers × bytes_per_token
# ── TokenTask (decode stage) ──
TokenTask(Task → Node):
task_type: TaskType = TOKEN # = 2
token_size: int # = request.num_decode_tokens − 1
# (1 token already generated in prefill)
tokens_per_iteration: int = 1 # auto-regressive: 1 token at a time
is_prefill_complete: bool = True
# same processing/generated fields as PromptTask
# ── KVCacheTransferFlow (PD transfer) ──
KVCacheTransferFlow(Flow → Node):
flow_type: FlowType = KVCacheTransfer # = 1
src: Replica # prefill replica
dest: Replica # decode replica
size: float # KV cache size in bytes
# = estimate_kv_cache_size(prefill_tokens, replica)
duration: float = 0.0 # set after transfer simulation
notify: bool = False
batch_size: int = 1
After DAG construction, the global scheduler emits a GlobalScheduleEvent which triggers ReplicaScheduleEvent(s) for the assigned replica(s).
The replica scheduler on the target replica builds a batch from waiting requests, considering KV cache memory, batch size limits, and token limits.
| Input | List of Request objects with DAG nodes assigned to this replica |
|---|---|
| Replica Scheduler Types |
SARATHI (chunk_size=512),
VLLM (max_tokens_in_batch=4096),
ORCA,
LIGHTLLM (max_waiting_iters=10),
FASTER_TRANSFORMER,
SPLIT_WISE (max_tokens_in_batch=4096)
|
| Common Config |
batch_size_cap (128),
block_size (16),
watermark_blocks_fraction (0.01),
num_blocks (optional)
|
Batch(BaseEntity):
_replica_id: int # which replica executes this batch
_requests: List[Request] # requests grouped into this batch
_num_tokens: List[int] # tokens per request in this batch
_total_num_tokens: int # sum of all tokens
_num_prefill_tokens: int # prefill tokens (non-completed only)
_total_num_tokens_rounded: int # tokens rounded to 8-token boundary
_scheduled_at: float = None # when batch was scheduled
_completed_at: float = None # when batch completed
# Properties:
num_decode_tokens → _total_num_tokens − _num_prefill_tokens
all_requests_completed → bool # True if all done
preempted_requests → List[Request]
completed_requests → List[Request]
The simulation is event-driven. Each event generates successor events that are pushed into a min-heap sorted by timestamp. The full chain from scheduling to completion:
# ── Event chain for a single batch ──
RequestArrivalEvent(time, request)
→ GlobalScheduleEvent(time) # route to replica
→ ReplicaScheduleEvent(time, replica_id) # build batch
→ BatchStageArrivalEvent(time, replica_id, stage_id=0, batch)
→ ReplicaStageScheduleEvent(time, replica_id, stage_id)
# calls execution_time_predictor → gets duration
→ BatchStageEndEvent(time + duration, replica_id, stage_id, batch)
# If pipeline has more stages (PP > 1):
BatchStageEndEvent → BatchStageArrivalEvent(next stage) + ReplicaStageScheduleEvent(current)
# If last pipeline stage:
BatchStageEndEvent → BatchEndEvent(time, replica_id, batch)
# BatchEndEvent handles PD logic:
BatchEndEvent:
# For each request that completed prefill:
# calculates pd_p2p_comm_time
# sets request.decode_arrived_at = prefill_completed_at + pd_p2p_comm_time
# → ReplicaScheduleEvent(decode_arrived_at, decode_replica_id)
_time (float), _id (auto-incremented int), _event_type (EventType enum), and _priority_number = (time, id, event_type) for min-heap ordering.
This step answers: “How long does this batch take to execute?” The answer is always composed of two parts:
execution_time = (block_time × num_layers_per_stage + pp_comm_time) / 1000block_time = attention_layer_time + mlp_layer_time + add_time| Backend | Compute Time Method | Granularity | Speed |
|---|---|---|---|
vidur |
RandomForest trained on GPU profiling CSVs → O(1) lookup table | Homogeneous (all layers identical) | Fastest |
aicb |
Pre-profiled per-layer CSV (real GPU, no ML training) | Per-layer heterogeneous + phase-aware (prefill/decode) | Very Fast |
simai_analytical |
← Same as vidur (identical RandomForest compute path) | ||
simai_simulation |
← Same as vidur (identical RandomForest compute path) | ||
| Backend | TP (AllReduce) | PP (SendRecv) | EP/MoE (AllToAll) |
|---|---|---|---|
vidur |
✓ RandomForest on all_reduce.csv | ✓ RandomForest on send_recv.csv (all 4 backends) | ✗ Not modeled |
aicb |
✗ = 0 (TODO in code) | ✓ comm_size / bandwidth | |
simai_analytical |
✓ SimAI_analytical binary + busbw.yaml | ✗ Not modeled | |
simai_simulation |
✓ NS-3 packet-level simulation | ✗ Not modeled |
vidur, simai_analytical, simai_simulation share the same RandomForest compute path; they only differ in TP communication estimation.aicb is the only backend that models EP/MoE AllToAll communication (via bandwidth formula), but it currently does NOT model TP AllReduce (= 0 with TODO in base_execution_time_predictor.py:84).send_recv.csv).
Trains 11 separate RandomForest models from pre-collected GPU profiling data. After training, predictions are pre-generated for all possible token counts and stored as O(1) lookup tables (Python dicts).
# Flow:
Batch(batch_size, num_tokens_per_request)
→ extract features (num_tokens, kv_cache_size)
→ lookup pre-computed dict: {(num_tokens,): time_ms}
→ sum across 11 sub-models per layer
→ execution_time
Pre-collected by running each model operator on a real GPU at various token counts. The RandomForest learns f(num_tokens) → time_ms for each operator.
# File: data/profiling/compute/{DEVICE}/{MODEL}/mlp.csv
# Example: data/profiling/compute/h100/meta-llama/Llama-2-7b-hf/mlp.csv
#
# Feature columns (input to RandomForest):
num_tokens # int — tokens in this profiling run
num_tensor_parallel_workers # int — TP degree used during profiling
n_head, n_kv_head, n_embd # model architecture params
n_expanded_embd, vocab_size # model architecture params
use_gated_mlp # bool
# Target columns (what RandomForest predicts) — each with min/max/mean/median/std:
time_stats.emb.* # embedding layer time
time_stats.input_layernorm.* # input LayerNorm time
time_stats.attn_pre_proj.* # QKV projection time
time_stats.attn_rope.* # rotary position encoding time
time_stats.attn_post_proj.* # output projection time
time_stats.post_attention_layernorm.* # post-attention LayerNorm time
time_stats.mlp_up_proj.* # MLP up-projection time
time_stats.mlp_act.* # activation function time
time_stats.mlp_down_proj.* # MLP down-projection time
time_stats.add.* # residual add time
# File: data/profiling/compute/{DEVICE}/{MODEL}/attention.csv
#
# Additional feature columns:
num_tokens, kv_cache_size # input features
# Target columns:
time_stats.attn_prefill.* # prefill attention kernel time
time_stats.attn_decode.* # decode attention kernel time
time_stats.attn_kv_cache_save.* # KV cache write time
Pre-collected by running collective operations at various message sizes. Also trained via RandomForest: f(message_size) → time_ms.
# File: data/profiling/network/{NETWORK_DEVICE}/all_reduce.csv
# Example: data/profiling/network/h100_pairwise_nvlink/all_reduce.csv
#
# Columns:
size # int — message size in bytes
num_workers # int — number of GPUs
rank # int — GPU rank
collective # str — operation type ("all_reduce")
devices_per_node # int — GPUs per machine
max_devices_per_node # int
time_stats.all_reduce.min # float — min observed time (ms)
time_stats.all_reduce.max # float
time_stats.all_reduce.mean # float — training target
time_stats.all_reduce.median # float
time_stats.all_reduce.std # float
# File: data/profiling/network/{NETWORK_DEVICE}/send_recv.csv
# Same structure but for point-to-point send/recv (used for PP communication)
n_estimators ∈ {250,500,750}, max_depth ∈ {8,16,32}, min_samples_split ∈ {2,5,10}. Scoring = MAPE. Trained models are cached as pickle files in cache/.
Completely different architecture. No RandomForest training. Instead, AICB pre-profiles every model layer on a real GPU (DeepGEMM for matmul, FlashMLA for attention) and saves per-layer compute time + comm size to a CSV. At simulation time, it does a direct lookup per layer.
# Flow:
Batch(batch_size, num_tokens, phase=prefill|decode)
→ determine AICB CSV file by (model, tp, pp, ep, bs, seq, phase)
→ load CSV: per-layer comp_time + comm_size
→ for each layer:
│ compute = comp_time (from CSV)
│ comm = comm_size / bandwidth (from replica config)
│ layer_time = compute + comm
→ sum across all layers
→ execution_time
# File: results/workload/vidur-{model}-world_size{ws}-tp{tp}-pp{pp}-ep{ep}-bs{bs}-seq{seq}-{phase}.csv
# Example: results/workload/vidur-deepseek-671B-world_size16-tp2-pp1-ep8-bs1-seq1024-prefill.csv
#
# Format: TAB-separated, no header row
# Columns:
layer_id # int — layer index (0, 1, 2, ...)
layer_name # str — "attention", "mlp", "moe", etc.
comp_time # float — compute time in NANOSECONDS (from real GPU profiling)
comm_size # float — communication size in BYTES (for collective ops)
# Example rows:
0 attention 125000 0
0 mlp 98000 0
0 moe 210000 67108864
1 attention 125000 0
1 mlp 98000 0
1 moe 210000 67108864
# For MoE layers (expert parallelism):
if phase == "prefill":
bandwidth = rdma_bandwidth # cross-node (default 800 Gbps)
else: # decode
bandwidth = nvlink_bandwidth # intra-node (default 1600 Gbps)
moe_comm_time = comm_size / (bandwidth * 1024**3 / 8) # Gbps → Bytes/s
layer_time = comp_time * 1e-9 + moe_comm_time # ns → s
# For attention/MLP layers:
# TP communication time = 0 (currently disabled in code)
layer_time = comp_time * 1e-9
Compute time: identical to vidur (same RandomForest, same profiling CSVs). Communication time: replaces vidur's RandomForest with an external SimAI_analytical binary that uses a topology-aware analytical model.
# Flow:
Batch(batch_size, num_tokens)
→ [COMPUTE] same RandomForest as vidur → comp_time
→ [COMM] generate workload file (AllReduce size = hidden_dim × tokens × 2 bytes)
→ run: SimAI_analytical -w {workload} -g {world_size} -g_p_s 8
→ parse output CSV → comm_time (μs → ms)
→ execution_time = comp_time + comm_time + NCCL_overhead
Uses the exact same mlp.csv, attention.csv files described in Backend A for compute time.
Specifies the effective bandwidth (GB/s) for each collective operation within each parallelism group. The analytical model uses these values to estimate communication latency without packet-level simulation.
# File: example/busbw.yaml
test # scenario name (first line)
TP: # Tensor Parallelism group
allreduce,: 300 # AllReduce bandwidth in GB/s
allgather,: 280 # AllGather bandwidth in GB/s
reducescatter,: 280 # ReduceScatter bandwidth in GB/s
alltoall,: 230 # AllToAll bandwidth in GB/s
EP: # Expert Parallelism group
allreduce,: null
allgather,: 45
reducescatter,: 45
alltoall,: 80
PP: # Pipeline Parallelism group
busbw: 47.5 # single value for send_recv (GB/s)
# Topology: example/topo (network topology definition)
# Config: astra-sim-alibabacloud/inputs/config/SimAI.conf
# (collective algorithm selection, buffer sizes, etc.)
# Output: results/analytical_EndToEnd_{id}.csv
# Parser reads: last row, 6th column (index 5) = total comm time in μs
# Converts to ms: latency = float(rows[-1][5]) * 1e-3
# Results are cached by MD5(workload_params + topo + conf)
# to avoid redundant runs
Compute time: identical to vidur & simai_analytical. Communication time: replaces the analytical formula with a full NS-3 packet-level RDMA simulation. Same workload generation as simai_analytical, but runs through NS-3 instead.
# Flow (only comm path differs from simai_analytical):
Batch(batch_size, num_tokens)
→ [COMPUTE] same RandomForest as vidur → comp_time
→ [COMM] generate same workload file
→ run: AS_SEND_LAT=6 AS_NVLS_ENABLE=1 \
→ SimAI_simulator -t 16 -w {workload} -n {topo} -c {conf}
→ parse output CSV → comm_time (μs → ms)
→ execution_time = comp_time + comm_time + NCCL_overhead
# Output: results/ncclFlowModel_EndToEnd_{id}.csv
# Parser reads: last row, 2nd column (index 1) = total comm time in μs
SimAI_analytical uses bandwidth formulas for O(1) estimation; SimAI_simulator runs a full NS-3 simulation that models packet-level congestion, ECMP routing, and PFC back-pressure. More accurate for large-scale topologies with contention, but orders of magnitude slower.
This text format is used by the astra-sim / SimAI binaries for training workload simulation (not inference). Included here for reference as it shares the same SimAI ecosystem.
# Line 1: model config metadata
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 16 pp: 12 ...
# Line 2: total number of operations
1789
# Remaining lines: one operation per line (space-delimited, 12 fields)
# op_name | layer_id | iterations | fwd_comm_type | fwd_comm_size | fwd_compute
# | bwd_comm_type | bwd_comm_size | bwd_compute | dp_comm_type | dp_comm_size | priority
attention_column -1 1750840 ALLGATHER 50331648 875420 REDUCESCATTER 0 875420 NONE 0 100
mlp_moelayer -1 1750840 ALLTOALL 67108864 875420 ALLTOALL 67108864 875420 NONE 0 100
# comm_type values: ALLREDUCE | ALLGATHER | REDUCESCATTER | ALLTOALL | NONE
# comm_size: bytes fwd/bwd_compute: microseconds priority: typically 100
# Regardless of backend, the final output is:
execution_time: float # seconds — written into BatchStage
# This becomes:
BatchStageEndEvent(_time = current_time + execution_time, batch, stage_id)
# → eventually → BatchEndEvent when last pipeline stage completes
# Final time breakdown for all backends:
model_time = (
(attention_layer_time + mlp_layer_time + add_time) # per-layer block
× num_layers_per_stage # = num_layers / PP
+ pipeline_parallel_communication_time # send_recv between stages
) / 1000 # ms → s
attention_layer_time = attn_pre_proj + attn_post_proj + attn_rope
+ attn_kv_cache_save + attn_prefill_or_decode
+ tensor_parallel_comm_time # ← THIS is what differs per backend
+ attn_norm_time
tensor_parallel_comm_time =
backend_predicted_comm_ms
+ nccl_cpu_launch_overhead_ms # default 0.02
+ nccl_cpu_skew_overhead_per_device_ms × tp**1.25 # non-linear scaling
In PD disaggregation mode, once the prefill replica completes, the KV cache must be transferred to the decode replica.
| Input | KVCacheTransferFlow DAG node from the request, containing KV cache size (derived from model's num_kv_heads, attention_head_dim, num_layers, num_prefill_tokens) |
|---|---|
| Network Config |
pd_p2p_comm_bandwidth (int, default 800 Gbps)pd_p2p_comm_dtype (str, default "float16")nvlink_bandwidth (int, default 1600 Gbps)rdma_bandwidth (int, default 800 Gbps)
|
| Simulation Config (NS-3) |
simai_simulation_topo — topology file (generated by gen_Topo_Template.py)simai_simulation_config — astra-sim-alibabacloud/inputs/config/SimAI.conf
|
Writes into the request: pd_p2p_comm_time (float, seconds), pd_p2p_comm_size (int, bytes), pd_p2p_comm_bandwidth (float), pd_p2p_bytes_per_token (float). Triggers DecodeCompletionEvent scheduling on the decode replica.
Once all DAG nodes complete, vidur records the full metrics suite and exports them to CSV files.
Completed Request objects with all timing fields populated by Steps 1–5.
simulator_output/YYYY-MM-DD_HH-MM-SS-XXXXXX/
# request_metrics.csv columns (from metrics/constants.py):
Request Id,
request_e2e_time, # total latency (s)
request_execution_time, # GPU execution time (s)
request_model_execution_time, # model-specific execution (s)
request_preemption_time, # preemption wait (s)
request_scheduling_delay, # pre-schedule wait (s)
prefill_e2e_time, # TTFT: prefill_completed_at - arrived_at
decode_time, # decode duration (s)
tbt, # time-between-tokens: decode_time / num_decode_tokens
arrived_at, # arrival timestamp (s)
scheduled_at, # first scheduling timestamp (s)
prefill_completed_at, # TTFT base timestamp (s)
decode_arrived_at, # decode phase start (s)
completed_at, # completion timestamp (s)
request_num_prefill_tokens, # input tokens (int)
request_num_decode_tokens, # output tokens (int)
prefill_replica_id, # prefill replica (PD mode)
decode_replica_id, # decode replica (PD mode)
pd_p2p_comm_size, # KV cache transfer bytes
pd_p2p_comm_time, # KV cache transfer time (s)
pd_p2p_comm_bandwidth, # effective bandwidth
pd_p2p_bytes_per_token, # bytes per token in transfer
pd_p2p_comm_dtype # data type (float16, etc.)
| File | Config Toggle | Description |
|---|---|---|
batch_metrics.csv |
store_batch_metrics (true) |
batch_size, batch_num_tokens, batch_num_prefill_tokens, batch_num_decode_tokens, batch_execution_time |
chrome_trace.json |
enable_chrome_trace (true) |
Chrome DevTools timeline for visual profiling |
event_trace.json |
write_json_trace (false) |
Detailed timestamped event log |
| CDF/histogram plots (PNG) | store_plots (true) |
Latency CDFs, batch size histograms, utilization time series |