Highlights in this release
This release introduces key enhancements to inference performance and user experience. In particular, Llama 3.1 8B FP8 models show notable reductions in average inference time on Instinct MI300 Accelerators. Specifically, local tests demonstrate a 13% improvement in prefill throughput for longer context inputs, and decode latency reductions of 33.7% for long contexts and 18% for short contexts, indicating more efficient performance across a variety of prompt sizes.
On the UI side, SHARK-UI v0.2 brings usability improvements, better error handling, and foundational updates in preparation for upcoming features. More details are available in the full release notes below.
Llama Performance Enhancements
As part of ongoing IREE-based optimization efforts, Llama 3.1 8B FP8 models have shown measurable performance gains in single-device local testing on MI300s. The reported numbers represent the average inference delay (in milliseconds) from a single local run of IREE-compiled models using the IREE runtime. Results include:
- 13% improvement in prefill latency for longer context inputs.
- 33.7% reduction in decode latency for longer contexts.
- 18% reduction in decode latency for shorter contexts.
These results demonstrate promising local performance efficiency and are a step toward delivering optimized LLM inference on diverse hardware backends using IREE.
Figure 1: Inference delay comparison for Llama3.1 8B FP8 on longer context inputs.
Figure 2: Inference delay comparison for Llama3.1 8B FP8 on shorter context inputs.
Improvements in SHARK-UI
SHARK UI v0.2 introduces some UX improvements as well as some under-the-hood changes to prepare for new features. See the release notes for more details!
New Contributors
- @oyazdanb made their first contribution in #1400
- @pravg-amd made their first contribution in #1419
- @vivekkhandelwal1 made their first contribution in #1452
Full changelog
List of changes
- Update project license and drop classifer by @marbre in #1383
- Bump actions/create-github-app-token from 2.0.2 to 2.0.6 in the github-actions group by @dependabot in #1385
- [docs] Update command to install editable shortfin by @jinchen62 in #1370
- Switch to
actions/python-setup
by @marbre in #1387 - Bump version to 3.5.0 after 3.4.0 release. by @ScottTodd in #1389
- Sharktank LLM split argmax op by @stbaione in #1360
- [docs] Updating documentation for may 5th 2025 release by @Muzammiluddin-Syed-ECE in #1369
- Drop support for torch 2.3 by @sogartar in #1392
- Remove seq_len parameter from kv cache by @rsuderman in #1380
- Refactor pipeline parallelism work and fix block offset by @rsuderman in #1379
- Adding benchmark_client.py from demo branch by @pdhirajkumarprasad in #1395
- [sharktank] General implementation for trivially replicable ops by @sogartar in #1384
- Fixes for recent pipeline parallelism refactor by @Alex-Vasile in #1398
- [sharktank] Tensor-parallel MoE block by @sogartar in #1393
- Bump IREE requirement pins to 3.4.0rc20250505 by @shark-pr-automator in #1382
- Perplexity test for pipeline parallelized Llama 8B by @Alex-Vasile in #1401
- address kv_cache_quantizer issues in quark dataset importer by @dan-garvey in #1402
- Bump pyenv to v2.5.5 and Python to 3.12.10 by @marbre in #1390
- Fixed missing index by @Alex-Vasile in #1399
- improving topk by @oyazdanb in #1400
- Refactor IREE PPL Tests by @Alex-Vasile in #1406
- Improve performance of ops.mean by @oyazdanb in #1366
- Bump IREE requirement pins to 3.5.0rc20250507 by @shark-pr-automator in #1405
- Cleanup InferenceTensor.invert by @Alex-Vasile in #1411
- Refactor Torch PPL Tests by @Alex-Vasile in #1413
- [tuner] Improve support for convolutions by @rkayaith in #1407
- Add PPL path for TP2 sharded Llama 8B by @Alex-Vasile in #1409
- IREE Bump to
3.5.0rc20250509
+ Disable/Enable async_caching in Shortfin by @stbaione in #1414 - [sharktank] Add reduce_scatter and split ops by @sogartar in #1420
- Restore separate threads for prefill and decode by @rsuderman in #1421
- Use numpy.argpartition instead of sfnp for select_top_k by @pravg-amd in #1419
- Bump transformers from 4.48.0 to 4.50.0 in /sharktank by @dependabot in #1339
- Bump IREE requirement pins to 3.5.0rc20250512 by @shark-pr-automator in #1425
- [shark Tank] Implementation of sharded_gather by @oyazdanb in #1415
- Modify the queue size to account for number of beams by @zphoenixrises in #1428
- Shortfin llm numpy postprocessing by @stbaione in #1430
- [sharktank] Refactor tokenizer and dump_gguf by @archana-ramalingam in #1423
- Re-enable quark parity test after bug fixes by @KyleHerndon in #1426
- [sharktank] Add exporting of sharded toy sized Resnet block IREE test data by @sogartar in #1033
- [sharktank] Add CLI tool for operations on models by @sogartar in #1212
- Make improved reservation scheduler with strobing by @rsuderman in #1432
- Decrease numy of
await device
in Batcher by @rsuderman in #1437 - [sharktank] Refactor llm sharding by @archana-ramalingam in #1424
- llama4 - patching by @oyazdanb in #1445
- Add support for rejecting requests and making sure that the server is robust. by @zphoenixrises in #1438
- [Shortfin][LLM] Create a worker/fiber pool for CPU work by @vinayakdsci in #1436
- [SharkTank] improving compare_safetensors.py tool by @oyazdanb in #1453
- [sharktank] Refactor ffn_norm and add_residual from ffn blocks by @archana-ramalingam in #1433
- Fix relative import by @Alex-Vasile in #1460
- Provide useful error message for invalid sharding by @Alex-Vasile in #1461
- [SharkTank] patching_bug by @oyazdanb in #1462
- Complete overhaul of the KVCache by @rsuderman in #1404
- Normalize negative dim for reshard_split by @Alex-Vasile in #1466
- Add .device to InferenceTensor, matching torch.Tensor.device by @Alex-Vasile in #1469
- Bug fix in dump_gguf by @Alex-Vasile in #1471
- Add missing names to toy weights by @Alex-Vasile in #1470
- Shortfin llm gpu argmax indices by @stbaione in #1454
- [sharktank] Add chunk attention support for Llama 4 model by @vivekkhandelwal1 in #1452
- Deepseek tests by @Alex-Vasile in #1467
- Bump IREE requirement pins to 3.5.0rc20250513 by @shark-pr-automator in #1434
- Bump IREE requirement pins to 3.5.0rc20250520 by @shark-pr-automator in #1476
- Fix for pipeline parallelism example by @Alex-Vasile in #1427
- Add base case to pipeline_parallelize_theta by @Alex-Vasile in #1443
- Update meta package README by @marbre in #1485
- Bump IREE requirement pins to 3.5.0rc20250521 by @shark-pr-automator in #1492
- Shortfin gpu topk by @stbaione in #1474
- View and transpose operations for TensorScaledLayout quantized tensors by @KyleHerndon in #1364
- [sharktank] Add Latent attention changes and Deepseek config by @archana-ramalingam in #1486
- Reorder KV Cache contraction dimensions by @rsuderman in #1465
- Fix CLI script by @zphoenixrises in #1473
- Copy all device memory then slice / select by @rsuderman in #1488
- Add ttft and tpot to cli script by @zphoenixrises in #1475
- Fix KVCache for PP>1 TP=1 by @Alex-Vasile in #1484
- Change setup_cache to use existing placement info by @Alex-Vasile in #1464
- [sharktank] Add fixes for deepseek toy model by @archana-ramalingam in #1497
- [mlir_kernel] Use mlir_kernel for attention kernels by @Groverkss in #1477
- Bump IREE requirement pins to 3.5.0rc20250522 by @shark-pr-automator in #1503
- Fix InferenceTensor.to missing kwarg specified dtype by @Alex-Vasile in #1504
- [mlir_kernel] Add documentation and user guide by @Groverkss in #1502
- [sharktank] Handle tensor_regex and num_blocks in dump_gguf by @archana-ramalingam in #1501
- [sharktank] Add toy theta for Deepseek by @archana-ramalingam in #1500
- Fixed missing device param in KVCache by @Alex-Vasile in #1505
- [PagedAttention] Add mlir_kernel for KVCache reads by @Groverkss in #1481
- [sharktank] Add pipeline parallelism changes after cache refactor by @archana-ramalingam in #1498
- [sharktank] Add Deepseek sharding changes by @archana-ramalingam in #1499
- [sharktank] Enable Deepseek v3 in sharktank by @archana-ramalingam in #1256
- [shortfin] Fix extra fiber appends in FiberPool.return_fiber by @vinayakdsci in #1513
- Bump IREE requirement pins to 3.5.0rc20250523 by @shark-pr-automator in #1510
- Fix how fp8 attention quantizers are used by @KyleHerndon in #1494
- [sharktank] Fix module patching and safetensors comparison tool by @sogartar in #1506
- [RoPE] Cleanup RoPE implementation by @Groverkss in #1514
- [sharktank] add to/form properties methods in LlamaModelConfig by @sogartar in #1522
- [sharktank] add vocab_size to LlamaHParams by @sogartar in #1523
- [sharktank] xfail with support for regex matching in the error message by @sogartar in #1521
- move quantizer after linear layers by @dan-garvey in #1526
- Fix the MLA case for forward_decode by @KyleHerndon in #1528
- [LLM Server] Merge
Greedy
strategies into one class by @stbaione in #1529 - [RoPE] Only use llama3 scaling if requested by @Groverkss in #1527
- [sharktank] Add mixture of experts(moe) support for Llama 4 model by @vivekkhandelwal1 in #1491
- [sharktank] move and rename get_iree_compiler_flags by @sogartar in #1524
- Bump IREE requirement pins to 3.5.0rc20250528 by @shark-pr-automator in #1520
- Revert "[RoPE] Cleanup RoPE implementation (#1514)" by @stbaione in #1533
- [sharktank] Add toy Deepseek tensor-parallel model comparison of IREE vs eager by @sogartar in #1525
- Bump IREE requirement pins to 3.5.0rc20250530 by @shark-pr-automator in #1537
- Fix _maximally_negative_value to return correct datatype by @Alex-Vasile in #1539
- [sharktank] Fix unused scale arg for paged attention by @Alex-Vasile in #1538
- [sharktank] Fix which prefill logits IREE perplexity calculations are using by @Alex-Vasile in #1535
- [tuner] create interface for constraint generation by @bangtianliu in #1511
- Add a cacheing allocator for avoiding reallocations by @rsuderman in #1519
- Add --use-attention-mask flag to eager mode by @Alex-Vasile in #1542
- Add missing device arg to _maximally_negative_value by @Alex-Vasile in #1543
- [sharktank] Add toy deepseek eager perplexity test by @archana-ramalingam in #1531
- Revert "Add a cacheing allocator for avoiding reallocations (#1519)" by @stbaione in #1548
- [sharktank] Add toy Deepseek IREE Perplexity Tests by @Alex-Vasile in #1507
- remove redundant object by @dan-garvey in #1552
- [sharktank] in device creation add arg to specify tensor parallelism by @sogartar in #1554
- [shortfin] Fix runner label for nightly workflow. by @ScottTodd in #1556
- Reland caching allocator for avoiding reallocations (#1519) by @rsuderman in #1561
- [sharktank] Fix deepseek sharding by @archana-ramalingam in #1547
- shortfin_apps.sd: Updates artifacts version for sdxl vmfbs, adds gfx1201 flagfile by @monorimet in #1563
- [sharktank] Make attention kernel functions private by @IanWood1 in #1565
- Bump IREE requirement pins to 3.5.0rc20250602 by @shark-pr-automator in #1553
- [sharktank] do not use scatter_add in MoE to improve performance by @sogartar in #1570
- [sharktank] improve doc for trace_tensor op by @sogartar in #1569
- First pass implementation of hsaco topk kernel integration by @rsuderman in #1574
- Hookup
topk
toiree_linalg_ext.topk
by @rsuderman in #1575 - Bump IREE requirement pins to 3.5.0rc20250604 by @shark-pr-automator in #1571
- Re-enable tests by @Alex-Vasile in #1572
- [sharktank] In xfail use re.search to be consistent with pytest.raises by @sogartar in #1573
- [sharktank] Fix which prefill logits eager perplexity calculations are using by @Alex-Vasile in #1540
- Bump IREE requirement pins to 3.5.0rc20250605 by @shark-pr-automator in #1579
- [sharktank] more tensor tracing doc for auto submodule key prefixing by @sogartar in #1578
- Remove
argmax
andtopk
functions fromPagedLlmModelV1
by @stbaione in #1536 - [shortfin][LLM] Lock FiberPool when index_queue is modified by @vinayakdsci in #1582
- Rework topk kernel and hip code #1577 by @rsuderman in #1584
- Add multiple version of topk kernel for {fp16/fp32} by @rsuderman in #1585
- Update transformers version to 4.52.4 by @vivekkhandelwal1 in #1581
- Re-enable and fix rotary.py by @rsuderman in #1534
- Change perplexity logit padding to align logits with tokens by @Alex-Vasile in #1588
- [sharktank] Fix pipeline parallelism perplexity regression for toy deepseek by @archana-ramalingam in #1545
- Bump IREE requirement pins to 3.5.0rc20250606 by @shark-pr-automator in #1587
Commit history: v3.4.0...v3.5.0