Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

shaylapid · 2025-02-18T21:33:07Z

System Info

CPU architecture: x86_64
GPU NVIDIA H100 80GB
TensorRT-LLM backend tag: v0.17.0
Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
OS Debian GNU/Linux 11 (bullseye)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Build the model:

Start the container:

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v </path/to/git/tensorrtllm_backend>:/tensorrtllm_backend \
    -v </path/to/engines>:/model/engine \
    -v <path/to/hf-checkpoint>:/model/src \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

Quantize the model:

cd /tensorrtllm_backend/tensorrt_llm/examples/quantization;
python quantize.py \
    --model_dir /model/src  \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /model/build

Build:

trtllm-build \
    --checkpoint_dir /model/build \
    --output_dir /model/engine \
    --gpt_attention_plugin auto \
    --gemm_plugin fp8 \
    --gemm_swiglu_plugin fp8 \
    --low_latency_gemm_swiglu_plugin fp8 \
    --remove_input_padding enable \
    --context_fmha enable \
    --max_beam_width 1 \
    --max_num_tokens 1000 \
    --max_seq_len 250 \
    --max_input_len 200 \
    --max_batch_size 4 \
    --use_fused_mlp enable \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --speculative_decoding_mode lookahead_decoding \
    --max_draft_len 39

Adapt model repo:

Adding the following to config.pbtext:

parameters: {
  key: "decoding_mode"
  value: {
    string_value: "lookahead"
  }
}

Run with Tritonserver:

Start the container:

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v <path/to/model>:/models \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

start tritonserver

tritonserver --model-repository=/models

Expected behavior

Tritonserver should start successfully, and model inference should be available.

actual behavior

Tritonserver fails to start with the following assertion error:

E0218 20:57:33.147956 130 model_lifecycle.cc:654] "failed to load 'tensorrt_llm_2beam' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: 16 != 40 (/workspace/tensorrt_llm/cpp/tensorrt_llm/layers/lookaheadDecodingLayer.cpp:56)\n1 0x7ff34f6bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 95\n2 0x7ff34f9d890c tensorrt_llm::layers::LookaheadDecodingLayer<__half>::CpuAlgorithmResources::CpuAlgorithmResources(tensorrt_llm::layers::DecoderDomain const&) + 4396\n3 0x7ff34f9d90c1 tensorrt_llm::layers::LookaheadDecodingLayer<__half>::LookaheadDecodingLayer(tensorrt_llm::layers::DecoderDomain const&, std::shared_ptr<tensorrt_llm::runtime::BufferManager>) + 241\n4 0x7ff34f97e862 tensorrt_llm::layers::DecodingLayer<__half>::DecodingLayer(tensorrt_llm::executor::DecodingMode const&, tensorrt_llm::layers::DecoderDomain const&, std::shared_ptr<tensorrt_llm::runtime::BufferManager>) + 978\n5 0x7ff34f994c88 tensorrt_llm::layers::DynamicDecodeLayer<__half>::initializeLayers() + 872\n6 0x7ff34f995bf9 tensorrt_llm::layers::DynamicDecodeLayer<__half>::initialize() + 1321\n7 0x7ff34f995dfa tensorrt_llm::layers::DynamicDecodeLayer<__half>::DynamicDecodeLayer(tensorrt_llm::executor::DecodingMode const&, tensorrt_llm::layers::DecoderDomain const&, std::shared_ptr<tensorrt_llm::runtime::BufferManager>) + 202\n8 0x7ff34fa8da0b tensorrt_llm::runtime::GptDecoder<__half>::GptDecoder(tensorrt_llm::executor::DecodingMode const&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::shared_ptr<tensorrt_llm::runtime::SpeculativeDecodingModule const>) + 603\n9 0x7ff34fa9a1bc tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int, int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 3372\n10 0x7ff3504e7a99 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createDecoder(std::optional<tensorrt_llm::executor::DecodingMode> const&) + 825\n11 0x7ff3504fdec0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3168\n12 0x7ff350476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n13 0x7ff350597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n14 0x7ff3505979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optionalstd::filesystem::__cxx11::path const&, std::optional<std::basic_string_view<unsigned char, std::char_traits > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n15 0x7ff350598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optionalstd::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n16 0x7ff35057e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n17 0x7ff5e803588e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7ff5e803588e]\n18 0x7ff5e8032049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n19 0x7ff5e8032592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n20 0x7ff5e801f929 TRITONBACKEND_ModelInstanceInitialize + 153\n21 0x7ff5f6bd7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7ff5f6bd7649]\n22 0x7ff5f6bd80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7ff5f6bd80d2]\n23 0x7ff5f6bbdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7ff5f6bbdcf3]\n24 0x7ff5f6bbe0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7ff5f6bbe0a4]\n25 0x7ff5f6bc768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7ff5f6bc768d]\n26 0x7ff5f6134ec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7ff5f6134ec3]\n27 0x7ff5f6bb4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7ff5f6bb4f02]\n28 0x7ff5f6bc2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7ff5f6bc2ddc]\n29 0x7ff5f6bc6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7ff5f6bc6e12]\n30 0x7ff5f6cc78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7ff5f6cc78e1]\n31 0x7ff5f6ccac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7ff5f6ccac3c]\n32 0x7ff5f6e27305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7ff5f6e27305]\n33 0x7ff5f6391db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7ff5f6391db4]\n34 0x7ff5f612fa94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7ff5f612fa94]\n35 0x7ff5f61bca34 __clone + 68"
I0218 20:57:33.148431 130 model_lifecycle.cc:789] "failed to load 'tensorrt_llm_2beam'"
I0218 20:57:33.148569 130 server.cc:604]

additional notes

Changing --max_draft_len to 15 allows Tritonserver to start, but this prevents selecting the desired max_draft_len value.

The text was updated successfully, but these errors were encountered:

skyCreateXian · 2025-03-05T03:41:36Z

It seems that this is caused by the absence of the lookahead parameter, where the max_daft.len parameter is determined by the (W, N, G) parameter. Setting the decoding method to lookahead alone is not enough to make it run, and decoding parameters need to be set.

Add (W, N, G) parameters to config. pbtext
Modify the source code, add code for reading (W, N, G) and setting parameters, recompile
Enjoy the decoding pleasure brought by lookahead
We tested that lookahead can reduce latency by 57% on the qwen2-7B model, but due to the lack of randomness, its result is equal to the top k=1 result, which can be understood as completely aligned in terms of accuracy.

shaylapid added the bug Something isn't working label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

shaylapid commented Feb 18, 2025 •

edited

Loading

skyCreateXian commented Mar 5, 2025

Uh oh!

Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

Comments

shaylapid commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

System Info

Who can help?

Information

Tasks

Reproduction

Build the model:

Start the container:

Quantize the model:

Build:

Adapt model repo:

Adding the following to config.pbtext:

Run with Tritonserver:

Start the container:

start tritonserver

Expected behavior

actual behavior

additional notes

skyCreateXian commented Mar 5, 2025

Uh oh!

shaylapid commented Feb 18, 2025 •

edited

Loading