sync : ggml #3071

ggerganov · 2025-04-24T15:42:42Z

No description provided.

* add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr

* ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci

* llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes

…llama/12559) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.

… (llama/12705)

* CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace

…s (llama/9017) * CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes #12152 * Addressed comments * fix HIP builds * properly sync to stream * removed ggml_cuda_cpy_fn_ptrs * move stream sync before free * guard to only use indirection with graphs * style fixes * check for errors --------- Co-authored-by: slaren <[email protected]>

* [CANN]support sin cos argmax Signed-off-by: noemotiovon <[email protected]> * [CANN]codestyle adjustment Signed-off-by: noemotiovon <[email protected]> * [CANN]Remove redundant code Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: noemotiovon <[email protected]>

* fix MUSA compiler warning * replace (void) with GGML_UNUSED

* Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

…2744)

…llama/12630) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.

…2747) fixes error for compiler paths with spaces

…io project/solution (llama/12625)

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

Signed-off-by: Xiaodong Ye <[email protected]>

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

…et_tensor (llama/12734)

ggml-ci

…uffer_set_tensor" (llama/12812) * Revert "sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…" This reverts commit 518a01480eb3a7c80a4951b430db9dee55428310. * Update ggml/src/ggml-sycl/ggml-sycl.cpp * Update ggml/src/ggml-sycl/ggml-sycl.cpp * rm tail space

…/1183) * ggml : add more generic ggml_custom op * ggml : remove deprecated custom ops

* Add AVX512 implementation of GEMM - q4kx8 * Update changes to remove unnecessary whitespaces

* SYCL: Add ROPE vision kernel * Add comment about rope mode

Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.

* CANN: Add x86 build ci * CANN: fix code format

ggml-ci

…llama/12931) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

Submit operators using asynchronous threads to improve performance. Use the environment variable GGML_CANN_ASYNC_MODE to control whether asynchronous submission is enabled. It is disabled by default. Testing shows a 10%–20% performance improvement in scenarios with small parameter sizes, especially in quantized models.

…12970)

…a/12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci

Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like #12465

* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo

ggml-ci

…2871) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant

* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs

* tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <[email protected]> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <[email protected]>

…lama/13090) ggml-ci

…2886) --------- Co-authored-by: Shangqing Gu <[email protected]>

ggml-ci

CISC and others added 30 commits April 24, 2025 18:38

CUDA: don't convert BF16 weights to FP32 (ggml/1174)

b4b7e15

* add bf16 support * use convert_from_bf16_cuda instead of convert_unary_cuda for f32 * revert 7ec5085 * move functionality into convert_unary with constexpr

ggml : simplify Arm fp16 CPU logic (ggml/1177)

53506af

* ggml : simlpify Arm fp16 CPU logic ggml-ci * cont : bring back CUDA/MUSA checks ggml-ci

llama : add option to override model tensor buffers (llama/11397)

614c087

* llama : add option to override tensor buffers * ggml : fix possible underflow in ggml_nbytes

Vulkan: Fix mmq int dot float cache size (llama/12722)

8a6981a

cmake: remove caching from vulkan coopmat checks (llama/12719)

3502755

opencl: use max_alloc_size in backend ctx instead of querying again…

953a903

… (llama/12705)

CANN: Fix failed test cases (llama/12708)

ec21d02

* CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace

fix MUSA compiler warning (llama/12704)

d52ab9b

* fix MUSA compiler warning * replace (void) with GGML_UNUSED

vulkan: Fix missing cmake logic for dot product extension (llama/12721)

96a8cc5

vulkan: set cmake minimum and project name in vulkan-shaders (llama/1…

4773f07

…2744)

cmake: fix ggml-shaders-gen compiler paths containing spaces (llama/1…

7cff350

…2747) fixes error for compiler paths with spaces

sycl: allow ggml-sycl configuration and compilation using Visual Stud…

cb85299

…io project/solution (llama/12625)

Vulkan: Tune Vulkan mmq int dot shader for performance (llama/12767)

c5b7ca9

vulkan: Use unclamped loads for flash attention mask (llama/12720)

1426dfd

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

vulkan: fix NaN issue in flash attention shader (llama/12776)

bf9d152

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

musa: fix compilation warnings in mp_22/31 (llama/12780)

2653196

Signed-off-by: Xiaodong Ye <[email protected]>

CANN: Refactor to reduce duplicate code (llama/12731)

1751e80

* CANN: Refactor to reduce duplicate code * CANN: fix review comment

CANN: fix typo in ggml-cann (llama/12733)

9dcd7ff

sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_s…

b1306b9

…et_tensor (llama/12734)

cuda : fix HIP and MUSA BF16 (llama/0)

981432f

ggml-ci

opencl: better identify Adreno GPU (llama/12760)

db561f2

ggml : add more generic custom op, remove deprecated custom ops (ggml…

7574918

…/1183) * ggml : add more generic ggml_custom op * ggml : remove deprecated custom ops

ggml : add bilinear upscale support (ggml/1185)

3c4f20e

Srihari-mcw and others added 28 commits April 24, 2025 18:38

ggml : Add AVX512 implementation of GEMM - Q4_Kx8 (llama/12829)

b344957

* Add AVX512 implementation of GEMM - q4kx8 * Update changes to remove unnecessary whitespaces

SYCL: Add ROPE vision kernel (llama/12887)

6f2a40f

* SYCL: Add ROPE vision kernel * Add comment about rope mode

CANN: Add x86 build ci (llama/12950)

6a2273b

* CANN: Add x86 build ci * CANN: fix code format

metal : add FA-vec kernels for head size 96 (llama/12952)

342a49f

ggml-ci

CANN: Add 310P operator support check (llama/12962)

505e57d

opencl: fix incorrect local_size index in profiling log (llama/12868)

fa69e39

ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (llama/…

9a2ac85

…12970)

SYCL: Refactor and enable FP16 in binary broadcast OPs (llama/12975)

ed18c7a

* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo

metal: add neg operator (llama/13029)

f7eb6eb

vulkan: support noncontiguous rms_norm (llama/13031)

308239d

SYCL: Add non-contiguous support in ROPE (llama/12993)

0612ed6

ggml-ci

ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (llama/1…

6d63a73

…2871) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant

CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (llama/13014)

6e92d2e

* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs

vulkan: matmul gcn tuning (llama/13016)

2299308

* tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <[email protected]> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <[email protected]>

metal : fix floating-point range of attention scores in FA kernels (l…

3bd21ed

…lama/13090) ggml-ci

CUDA: use switch statements in constexpr functions (llama/13095)

de20f45

ggml : fix trailing whitespaces (llama/0)

fd137aa

opencl: split ggml-opencl.cl into multiple files and cleanup (llama/1…

6f9d3e6

…2886) --------- Co-authored-by: Shangqing Gu <[email protected]>

sync : ggml

775402b

opencl : remove obsolete files (skip) (ggml/1200)

fc70647

sync : ggml

00fb660

ggml-ci

cuda : fix unused variable compile warning (#0)

4ae6aa3

ggml-ci

ruby : add cmake option (#0)

475e93e

ggerganov merged commit adaea08 into master Apr 24, 2025
47 checks passed

ggerganov deleted the sync-ggml-25-04-24 branch April 24, 2025 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #3071

sync : ggml #3071

ggerganov commented Apr 24, 2025

sync : ggml #3071

sync : ggml #3071

Conversation

ggerganov commented Apr 24, 2025