Release Release v3.5.0 · nod-ai/shark-ai

Highlights in this release

This release introduces key enhancements to inference performance and user experience. In particular, Llama 3.1 8B FP8 models show notable reductions in average inference time on Instinct MI300 Accelerators. Specifically, local tests demonstrate a 13% improvement in prefill throughput for longer context inputs, and decode latency reductions of 33.7% for long contexts and 18% for short contexts, indicating more efficient performance across a variety of prompt sizes.

On the UI side, SHARK-UI v0.2 brings usability improvements, better error handling, and foundational updates in preparation for upcoming features. More details are available in the full release notes below.

Llama Performance Enhancements

As part of ongoing IREE-based optimization efforts, Llama 3.1 8B FP8 models have shown measurable performance gains in single-device local testing on MI300s. The reported numbers represent the average inference delay (in milliseconds) from a single local run of IREE-compiled models using the IREE runtime. Results include:

13% improvement in prefill latency for longer context inputs.
33.7% reduction in decode latency for longer contexts.
18% reduction in decode latency for shorter contexts.

These results demonstrate promising local performance efficiency and are a step toward delivering optimized LLM inference on diverse hardware backends using IREE.

Figure 1: Inference delay comparison for Llama3.1 8B FP8 on longer context inputs.

Figure 2: Inference delay comparison for Llama3.1 8B FP8 on shorter context inputs.

Improvements in SHARK-UI

SHARK UI v0.2 introduces some UX improvements as well as some under-the-hood changes to prepare for new features. See the release notes for more details!

New Contributors

@oyazdanb made their first contribution in #1400
@pravg-amd made their first contribution in #1419
@vivekkhandelwal1 made their first contribution in #1452

Full changelog

List of changes

Update project license and drop classifer by @marbre in #1383
Bump actions/create-github-app-token from 2.0.2 to 2.0.6 in the github-actions group by @dependabot in #1385
[docs] Update command to install editable shortfin by @jinchen62 in #1370
Switch to actions/python-setup by @marbre in #1387
Bump version to 3.5.0 after 3.4.0 release. by @ScottTodd in #1389
Sharktank LLM split argmax op by @stbaione in #1360
[docs] Updating documentation for may 5th 2025 release by @Muzammiluddin-Syed-ECE in #1369
Drop support for torch 2.3 by @sogartar in #1392
Remove seq_len parameter from kv cache by @rsuderman in #1380
Refactor pipeline parallelism work and fix block offset by @rsuderman in #1379
Adding benchmark_client.py from demo branch by @pdhirajkumarprasad in #1395
[sharktank] General implementation for trivially replicable ops by @sogartar in #1384
Fixes for recent pipeline parallelism refactor by @Alex-Vasile in #1398
[sharktank] Tensor-parallel MoE block by @sogartar in #1393
Bump IREE requirement pins to 3.4.0rc20250505 by @shark-pr-automator in #1382
Perplexity test for pipeline parallelized Llama 8B by @Alex-Vasile in #1401
address kv_cache_quantizer issues in quark dataset importer by @dan-garvey in #1402
Bump pyenv to v2.5.5 and Python to 3.12.10 by @marbre in #1390
Fixed missing index by @Alex-Vasile in #1399
improving topk by @oyazdanb in #1400
Refactor IREE PPL Tests by @Alex-Vasile in #1406
Improve performance of ops.mean by @oyazdanb in #1366
Bump IREE requirement pins to 3.5.0rc20250507 by @shark-pr-automator in #1405
Cleanup InferenceTensor.invert by @Alex-Vasile in #1411
Refactor Torch PPL Tests by @Alex-Vasile in #1413
[tuner] Improve support for convolutions by @rkayaith in #1407
Add PPL path for TP2 sharded Llama 8B by @Alex-Vasile in #1409
IREE Bump to 3.5.0rc20250509 + Disable/Enable async_caching in Shortfin by @stbaione in #1414
[sharktank] Add reduce_scatter and split ops by @sogartar in #1420
Restore separate threads for prefill and decode by @rsuderman in #1421
Use numpy.argpartition instead of sfnp for select_top_k by @pravg-amd in #1419
Bump transformers from 4.48.0 to 4.50.0 in /sharktank by @dependabot in #1339
Bump IREE requirement pins to 3.5.0rc20250512 by @shark-pr-automator in #1425
[shark Tank] Implementation of sharded_gather by @oyazdanb in #1415
Modify the queue size to account for number of beams by @zphoenixrises in #1428
Shortfin llm numpy postprocessing by @stbaione in #1430
[sharktank] Refactor tokenizer and dump_gguf by @archana-ramalingam in #1423
Re-enable quark parity test after bug fixes by @KyleHerndon in #1426
[sharktank] Add exporting of sharded toy sized Resnet block IREE test data by @sogartar in #1033
[sharktank] Add CLI tool for operations on models by @sogartar in #1212
Make improved reservation scheduler with strobing by @rsuderman in #1432
Decrease numy of await device in Batcher by @rsuderman in #1437
[sharktank] Refactor llm sharding by @archana-ramalingam in #1424
llama4 - patching by @oyazdanb in #1445
Add support for rejecting requests and making sure that the server is robust. by @zphoenixrises in #1438
[Shortfin][LLM] Create a worker/fiber pool for CPU work by @vinayakdsci in #1436
[SharkTank] improving compare_safetensors.py tool by @oyazdanb in #1453
[sharktank] Refactor ffn_norm and add_residual from ffn blocks by @archana-ramalingam in #1433
Fix relative import by @Alex-Vasile in #1460
Provide useful error message for invalid sharding by @Alex-Vasile in #1461
[SharkTank] patching_bug by @oyazdanb in #1462
Complete overhaul of the KVCache by @rsuderman in #1404
Normalize negative dim for reshard_split by @Alex-Vasile in #1466
Add .device to InferenceTensor, matching torch.Tensor.device by @Alex-Vasile in #1469
Bug fix in dump_gguf by @Alex-Vasile in #1471
Add missing names to toy weights by @Alex-Vasile in #1470
Shortfin llm gpu argmax indices by @stbaione in #1454
[sharktank] Add chunk attention support for Llama 4 model by @vivekkhandelwal1 in #1452
Deepseek tests by @Alex-Vasile in #1467
Bump IREE requirement pins to 3.5.0rc20250513 by @shark-pr-automator in #1434
Bump IREE requirement pins to 3.5.0rc20250520 by @shark-pr-automator in #1476
Fix for pipeline parallelism example by @Alex-Vasile in #1427
Add base case to pipeline_parallelize_theta by @Alex-Vasile in #1443
Update meta package README by @marbre in #1485
Bump IREE requirement pins to 3.5.0rc20250521 by @shark-pr-automator in #1492
Shortfin gpu topk by @stbaione in #1474
View and transpose operations for TensorScaledLayout quantized tensors by @KyleHerndon in #1364
[sharktank] Add Latent attention changes and Deepseek config by @archana-ramalingam in #1486
Reorder KV Cache contraction dimensions by @rsuderman in #1465
Fix CLI script by @zphoenixrises in #1473
Copy all device memory then slice / select by @rsuderman in #1488
Add ttft and tpot to cli script by @zphoenixrises in #1475
Fix KVCache for PP>1 TP=1 by @Alex-Vasile in #1484
Change setup_cache to use existing placement info by @Alex-Vasile in #1464
[sharktank] Add fixes for deepseek toy model by @archana-ramalingam in #1497
[mlir_kernel] Use mlir_kernel for attention kernels by @Groverkss in #1477
Bump IREE requirement pins to 3.5.0rc20250522 by @shark-pr-automator in #1503
Fix InferenceTensor.to missing kwarg specified dtype by @Alex-Vasile in #1504
[mlir_kernel] Add documentation and user guide by @Groverkss in #1502
[sharktank] Handle tensor_regex and num_blocks in dump_gguf by @archana-ramalingam in #1501
[sharktank] Add toy theta for Deepseek by @archana-ramalingam in #1500
Fixed missing device param in KVCache by @Alex-Vasile in #1505
[PagedAttention] Add mlir_kernel for KVCache reads by @Groverkss in #1481
[sharktank] Add pipeline parallelism changes after cache refactor by @archana-ramalingam in #1498
[sharktank] Add Deepseek sharding changes by @archana-ramalingam in #1499
[sharktank] Enable Deepseek v3 in sharktank by @archana-ramalingam in #1256
[shortfin] Fix extra fiber appends in FiberPool.return_fiber by @vinayakdsci in #1513
Bump IREE requirement pins to 3.5.0rc20250523 by @shark-pr-automator in #1510
Fix how fp8 attention quantizers are used by @KyleHerndon in #1494
[sharktank] Fix module patching and safetensors comparison tool by @sogartar in #1506
[RoPE] Cleanup RoPE implementation by @Groverkss in #1514
[sharktank] add to/form properties methods in LlamaModelConfig by @sogartar in #1522
[sharktank] add vocab_size to LlamaHParams by @sogartar in #1523
[sharktank] xfail with support for regex matching in the error message by @sogartar in #1521
move quantizer after linear layers by @dan-garvey in #1526
Fix the MLA case for forward_decode by @KyleHerndon in #1528
[LLM Server] Merge Greedy strategies into one class by @stbaione in #1529
[RoPE] Only use llama3 scaling if requested by @Groverkss in #1527
[sharktank] Add mixture of experts(moe) support for Llama 4 model by @vivekkhandelwal1 in #1491
[sharktank] move and rename get_iree_compiler_flags by @sogartar in #1524
Bump IREE requirement pins to 3.5.0rc20250528 by @shark-pr-automator in #1520
Revert "[RoPE] Cleanup RoPE implementation (#1514)" by @stbaione in #1533
[sharktank] Add toy Deepseek tensor-parallel model comparison of IREE vs eager by @sogartar in #1525
Bump IREE requirement pins to 3.5.0rc20250530 by @shark-pr-automator in #1537
Fix _maximally_negative_value to return correct datatype by @Alex-Vasile in #1539
[sharktank] Fix unused scale arg for paged attention by @Alex-Vasile in #1538
[sharktank] Fix which prefill logits IREE perplexity calculations are using by @Alex-Vasile in #1535
[tuner] create interface for constraint generation by @bangtianliu in #1511
Add a cacheing allocator for avoiding reallocations by @rsuderman in #1519
Add --use-attention-mask flag to eager mode by @Alex-Vasile in #1542
Add missing device arg to _maximally_negative_value by @Alex-Vasile in #1543
[sharktank] Add toy deepseek eager perplexity test by @archana-ramalingam in #1531
Revert "Add a cacheing allocator for avoiding reallocations (#1519)" by @stbaione in #1548
[sharktank] Add toy Deepseek IREE Perplexity Tests by @Alex-Vasile in #1507
remove redundant object by @dan-garvey in #1552
[sharktank] in device creation add arg to specify tensor parallelism by @sogartar in #1554
[shortfin] Fix runner label for nightly workflow. by @ScottTodd in #1556
Reland caching allocator for avoiding reallocations (#1519) by @rsuderman in #1561
[sharktank] Fix deepseek sharding by @archana-ramalingam in #1547
shortfin_apps.sd: Updates artifacts version for sdxl vmfbs, adds gfx1201 flagfile by @monorimet in #1563
[sharktank] Make attention kernel functions private by @IanWood1 in #1565
Bump IREE requirement pins to 3.5.0rc20250602 by @shark-pr-automator in #1553
[sharktank] do not use scatter_add in MoE to improve performance by @sogartar in #1570
[sharktank] improve doc for trace_tensor op by @sogartar in #1569
First pass implementation of hsaco topk kernel integration by @rsuderman in #1574
Hookup topk to iree_linalg_ext.topk by @rsuderman in #1575
Bump IREE requirement pins to 3.5.0rc20250604 by @shark-pr-automator in #1571
Re-enable tests by @Alex-Vasile in #1572
[sharktank] In xfail use re.search to be consistent with pytest.raises by @sogartar in #1573
[sharktank] Fix which prefill logits eager perplexity calculations are using by @Alex-Vasile in #1540
Bump IREE requirement pins to 3.5.0rc20250605 by @shark-pr-automator in #1579
[sharktank] more tensor tracing doc for auto submodule key prefixing by @sogartar in #1578
Remove argmax and topk functions from PagedLlmModelV1 by @stbaione in #1536
[shortfin][LLM] Lock FiberPool when index_queue is modified by @vinayakdsci in #1582
Rework topk kernel and hip code #1577 by @rsuderman in #1584
Add multiple version of topk kernel for {fp16/fp32} by @rsuderman in #1585
Update transformers version to 4.52.4 by @vivekkhandelwal1 in #1581
Re-enable and fix rotary.py by @rsuderman in #1534
Change perplexity logit padding to align logits with tokens by @Alex-Vasile in #1588
[sharktank] Fix pipeline parallelism perplexity regression for toy deepseek by @archana-ramalingam in #1545
Bump IREE requirement pins to 3.5.0rc20250606 by @shark-pr-automator in #1587

Commit history: v3.4.0...v3.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v3.5.0