Skip to content

Release v3.5.0

Latest
Compare
Choose a tag to compare
@ScottTodd ScottTodd released this 11 Jun 15:52
· 90 commits to main since this release
v3.5.0
e8d9819

Highlights in this release

This release introduces key enhancements to inference performance and user experience. In particular, Llama 3.1 8B FP8 models show notable reductions in average inference time on Instinct MI300 Accelerators. Specifically, local tests demonstrate a 13% improvement in prefill throughput for longer context inputs, and decode latency reductions of 33.7% for long contexts and 18% for short contexts, indicating more efficient performance across a variety of prompt sizes.

On the UI side, SHARK-UI v0.2 brings usability improvements, better error handling, and foundational updates in preparation for upcoming features. More details are available in the full release notes below.

Llama Performance Enhancements

As part of ongoing IREE-based optimization efforts, Llama 3.1 8B FP8 models have shown measurable performance gains in single-device local testing on MI300s. The reported numbers represent the average inference delay (in milliseconds) from a single local run of IREE-compiled models using the IREE runtime. Results include:

  • 13% improvement in prefill latency for longer context inputs.
  • 33.7% reduction in decode latency for longer contexts.
  • 18% reduction in decode latency for shorter contexts.

These results demonstrate promising local performance efficiency and are a step toward delivering optimized LLM inference on diverse hardware backends using IREE.

image

Figure 1: Inference delay comparison for Llama3.1 8B FP8 on longer context inputs.

image

Figure 2: Inference delay comparison for Llama3.1 8B FP8 on shorter context inputs.

Improvements in SHARK-UI

SHARK UI v0.2 introduces some UX improvements as well as some under-the-hood changes to prepare for new features. See the release notes for more details!

New Contributors

Full changelog

List of changes

Commit history: v3.4.0...v3.5.0