Skip to content

fix: restore http metrics for V0 engine #17471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Apr 30, 2025

by lazily importing prometheus_client in vllm/v1/spec_decode/metrics.py.

FIX #17406

by lazily importing prometheus_client in `vllm/v1/spec_decode/metrics.py`.

FIX vllm-project#17406

Signed-off-by: David Xia <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor Author

@davidxia davidxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @markmc

@@ -116,6 +115,8 @@ class SpecDecodingProm:

def __init__(self, speculative_config: Optional[SpeculativeConfig],
labelnames: list[str], labelvalues: list[str]):
import prometheus_client
Copy link
Contributor Author

@davidxia davidxia Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work because of this?

These types are defined in this file to avoid importing vllm.engine.metrics
and therefore importing prometheus_client.
This is required due to usage of Prometheus multiprocess mode to enable
metrics after splitting out the uvicorn process from the engine process.
Prometheus multiprocess mode requires setting PROMETHEUS_MULTIPROC_DIR
before prometheus_client is imported. Typically, this is done by setting
the env variable before launch, but since we are a library, we need to
do this in Python code and lazily import prometheus_client.

Is lazily importing here the right approach or do we want to define types in a central file like the docstring above states?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not currently using multiprocess mode in V1, so we don't do this sort of hack

See also https://docs.vllm.ai/en/stable/design/v1/metrics.html#multi-process-mode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess something changed to cause vllm.v1.spec_decode.metrics to be imported even when you're using V0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is vllm.v1.spec_decode.metrics not supposed to be imported with V0? Is the fix to find the cause of this import and remove it for V0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is vllm.v1.spec_decode.metrics not supposed to be imported with V0? Is the fix to find the cause of this import and remove it for V0?

Maybe. I'd prefer to avoid the lazy import hack, unless it causes a lot of pain to avoid it

Copy link
Contributor Author

@davidxia davidxia Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this module is imported before the engine version is determined. I see vllm/v1/spec_decode/metrics.py:9 before device type=cpu is not supported by the V1 Engine. Falling back to V0.

So I think import prometheus_client is executed from here for everything. The below is from running VLLM_LOGGING_LEVEL=DEBUG vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 on this branch. Happy to contribute a fix if you have any guidance?

vllm serve cmd stack trace for vllm/v1/spec_decode/metrics.py and debug logs
$ VLLM_LOGGING_LEVEL=DEBUG vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0
DEBUG 04-30 15:58:44 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-30 15:58:44 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-30 15:58:44 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-30 15:58:44 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-30 15:58:44 [__init__.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 04-30 15:58:44 [__init__.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 04-30 15:58:44 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-30 15:58:44 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-30 15:58:44 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-30 15:58:44 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-30 15:58:44 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-30 15:58:44 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-30 15:58:44 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-30 15:58:44 [__init__.py:162] Confirmed CPU platform is available because vLLM is built with CPU.
DEBUG 04-30 15:58:44 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-30 15:58:44 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-30 15:58:44 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-30 15:58:44 [__init__.py:162] Confirmed CPU platform is available because vLLM is built with CPU.
INFO 04-30 15:58:44 [__init__.py:239] Automatically detected platform cpu.
/home/dxia/src/github.com/vllm-project/vllm/.venv/bin/vllm:33 in <module>
/home/dxia/src/github.com/vllm-project/vllm/.venv/bin/vllm:25 in importlib_load_entry_point
/home/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/cli/main.py:7 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/cli/benchmark/main.py:6 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/cli/benchmark/throughput.py:4 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/benchmarks/throughput.py:26 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/openai/api_server.py:42 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/entrypoints/launcher.py:19 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/v1/engine/__init__.py:14 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/v1/metrics/stats.py:7 in <module>
/home/dxia/src/github.com/vllm-project/vllm/vllm/v1/spec_decode/metrics.py:9 in <module>
DEBUG 04-30 15:58:47 [utils.py:136] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 04-30 15:58:47 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-30 15:58:49 [api_server.py:1043] vLLM API server version 0.8.5.dev367+g77073c77b
INFO 04-30 15:58:49 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='TinyLlama/TinyLlama-1.1B-Chat-v1.0', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', task='auto', tokenizer=None, tokenizer_mode='auto', trust_remote_code=False, dtype='auto', seed=None, hf_config_path=None, allowed_local_media_path='', revision=None, code_revision=None, rope_scaling={}, rope_theta=None, tokenizer_revision=None, max_model_len=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, max_logprobs=20, disable_sliding_window=False, disable_cascade_attn=False, skip_tokenizer_init=False, served_model_name=None, disable_async_output_proc=False, config_format='auto', hf_token=None, hf_overrides={}, override_neuron_config={}, override_pooler_config=None, logits_processor_pattern=None, generation_config='auto', override_generation_config={}, enable_sleep_mode=False, model_impl='auto', load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, guided_decoding_backend='auto', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, reasoning_parser=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, use_v2_block_manager=True, disable_log_stats=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', compilation_config=None, kv_transfer_config=None, kv_events_config=None, worker_cls='auto', worker_extension_cls='', additional_config=None, enable_reasoning=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x757594ecc720>)
WARNING 04-30 15:58:49 [utils.py:2267] Found ulimit of 32000 and failed to automatically increase with error current limit exceeds maximum limit. This can cause fd limit errors like `OSError: [Errno 24] Too many open files`. Consider increasing with ulimit -n
INFO 04-30 15:58:58 [config.py:749] This model supports multiple tasks: {'reward', 'score', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
WARNING 04-30 15:58:58 [arg_utils.py:1504] device type=cpu is not supported by the V1 Engine. Falling back to V0. 
INFO 04-30 15:58:58 [config.py:1835] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-30 15:58:58 [cpu.py:106] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 04-30 15:58:58 [cpu.py:119] uni is not supported on CPU, fallback to mp distributed executor backend.
DEBUG 04-30 15:58:58 [api_server.py:223] Multiprocessing frontend to use ipc:///tmp/596e461c-6840-4f8e-b1e4-0adc4ab3c52a for IPC Path.
INFO 04-30 15:58:58 [api_server.py:246] Started engine process with PID 181610
DEBUG 04-30 15:59:01 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-30 15:59:01 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-30 15:59:01 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-30 15:59:01 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-30 15:59:01 [__init__.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 04-30 15:59:01 [__init__.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 04-30 15:59:01 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-30 15:59:01 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-30 15:59:01 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-30 15:59:01 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-30 15:59:01 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-30 15:59:01 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-30 15:59:01 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-30 15:59:01 [__init__.py:162] Confirmed CPU platform is available because vLLM is built with CPU.
DEBUG 04-30 15:59:01 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-30 15:59:01 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-30 15:59:01 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-30 15:59:01 [__init__.py:162] Confirmed CPU platform is available because vLLM is built with CPU.
INFO 04-30 15:59:01 [__init__.py:239] Automatically detected platform cpu.
DEBUG 04-30 15:59:03 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-30 15:59:03 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.dev367+g77073c77b) with config: model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=TinyLlama/TinyLlama-1.1B-Chat-v1.0, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 04-30 15:59:05 [cpu.py:45] Using Torch SDPA backend.
DEBUG 04-30 15:59:05 [config.py:4262] enabled custom ops: Counter()
DEBUG 04-30 15:59:05 [config.py:4264] disabled custom ops: Counter()
DEBUG 04-30 15:59:05 [parallel_state.py:867] world_size=1 rank=0 local_rank=-1 distributed_init_method=tcp://10.172.0.153:54461 backend=gloo
INFO 04-30 15:59:05 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
DEBUG 04-30 15:59:05 [config.py:4262] enabled custom ops: Counter()
DEBUG 04-30 15:59:05 [config.py:4264] disabled custom ops: Counter()
DEBUG 04-30 15:59:05 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 04-30 15:59:05 [config.py:4262] enabled custom ops: Counter({'rms_norm': 45, 'silu_and_mul': 22, 'rotary_embedding': 1})
DEBUG 04-30 15:59:05 [config.py:4264] disabled custom ops: Counter()
INFO 04-30 15:59:05 [weight_utils.py:265] Using model weights format ['*.safetensors']
INFO 04-30 15:59:05 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
DEBUG 04-30 15:59:05 [utils.py:156] Loaded weight lm_head.weight with shape torch.Size([32000, 2048])
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.22it/s]
INFO 04-30 15:59:06 [loader.py:458] Loading weights took 0.54 seconds
INFO 04-30 15:59:06 [executor_base.py:112] # cpu blocks: 11915, # CPU blocks: 0
INFO 04-30 15:59:06 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 93.09x
INFO 04-30 15:59:06 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 0.61 seconds
DEBUG 04-30 15:59:06 [engine.py:155] Starting Startup Loop.
DEBUG 04-30 15:59:07 [engine.py:157] Starting Engine Loop.
DEBUG 04-30 15:59:07 [api_server.py:321] vLLM to use /tmp/tmpfgzcr5kf as PROMETHEUS_MULTIPROC_DIR
INFO 04-30 15:59:07 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-30 15:59:07 [launcher.py:28] Available routes are:
INFO 04-30 15:59:07 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 04-30 15:59:07 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 04-30 15:59:07 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-30 15:59:07 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 04-30 15:59:07 [launcher.py:36] Route: /health, Methods: GET
INFO 04-30 15:59:07 [launcher.py:36] Route: /load, Methods: GET
INFO 04-30 15:59:07 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 04-30 15:59:07 [launcher.py:36] Route: /version, Methods: GET
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /pooling, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /score, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /rerank, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /invocations, Methods: POST
INFO 04-30 15:59:07 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [181418]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: http* metrics missing when running with V0 engine
2 participants