sync : ggml #2342

ggerganov · 2024-08-08T11:18:34Z

No description provided.

…ggml/885)

…893) This prevents invalid frees when destroying a partially initialized vk_buffer_struct. For example, this could happen in ggml_vk_create_buffer when running out of device memory. Co-authored-by: Tony Wasserka <[email protected]>

…ml/895) * Add support for float16 tensors in 1d pooling operations * Add support for float16 input tensors in 2d pooling operations * code cleanup remove unnecessary casting during srow ptr initialization --------- Co-authored-by: vanaka11 <[email protected]>

Apply a loop tiling technique to the generic path, which provides performance upside for ISAs with enough registers to take advantage of it. Also helps the compiler optimize this path.

* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend * Reduced verbosity of comment

* Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add optimized GEMV and GEMM asm kernels for q4_0_q8_0 quantization and refactor code to address llama.cpp pr#5780 suggestions * Arm AArch64: add copyright claim only to ggml-aarch64.cpp and ggml-aarch64.h files * Arm AArch64: minor code refactoring for rebase * Arm AArch64: minor code refactoring for resolving a build issue with cmake * Arm AArch64: minor code refactoring to split the Q4_0_AARC64 type into three separate types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code change for resolving a build issue with server-windows * retrigger checks * Arm AArch64: minor code changes for rebase * Arm AArch64: minor changes to skip the pr#7433 vec_dot code for arm cpus with SVE VL not equal to 256 bits * Arm AArch64: remove stale LLAMA_QKK_64 from CMakeLists.txt and delete build.zig * Arm AArch64: add reference scalar gemm and gemv, and avoid dynamic memory allocations during quantization for Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: add multithreaded quantization support for the new types: Q4_0_4_4, Q4_0_4_8, and Q4_0_8_8 * Arm AArch64: minor code refactoring * Arm AArch64: simplify logic for calling gemm and gemv functions in ggml_compute_forward_mul_mat * Arm AArch64: minimize changes in ggml_compute_forward_mul_mat * Arm AArch64: minor code refactoring, and add reference scalar code to quantize routines for new quant types * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * Arm AArch64: minor code refactoring * rebase on the latest master commit 3fd62a6 and adapt to the new directory structure * Arm AArch64: remove a redundant comment * Arm AArch64: add pragma in ggml-aarch64.c to turn -Woverlength-strings warning off * Arm AArch64: use __aarch64__ check to guard 64-bit neon kernels * Arm AArch64: update docs/build.md README to include compile time flags for buiilding the Q4_0_4_4 quant type

ggml-ci

* CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation

* cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 | } | ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <[email protected]> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <[email protected]> --------- Signed-off-by: Daniel Bevenius <[email protected]>

* ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <[email protected]>

* fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <[email protected]> --------- Signed-off-by: Chen Xi <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]> Co-authored-by: Chen Xi <[email protected]>

* ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names

ggml-ci

* Add Vulkan to CMake pkg * Add Sycl to CMake pkg * Add OpenMP to CMake pkg * Split generated shader file into separate translation unit * Add CMake target for Vulkan shaders * Update README.md * Add make target for Vulkan shaders * Use pkg-config to locate vulkan library * Add vulkan SDK dep to ubuntu-22-cmake-vulkan workflow * Clean up tabs * Move sudo to apt-key invocation * Forward GGML_EXTRA_LIBS to CMake config pkg * Update vulkan obj file paths * Add shaderc to nix pkg * Add python3 to Vulkan nix build * Link against ggml in cmake pkg * Remove Python dependency from Vulkan build * code review changes * Remove trailing newline * Add cflags from pkg-config to fix w64devkit build * Update README.md * Remove trailing whitespace * Update README.md * Remove trailing whitespace * Fix doc heading * Make glslc required Vulkan component * remove clblast from nix pkg

* Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error

* add concat through dim 1/2

* lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <[email protected]> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit 42415a4874e0f963e4aca6796ea5dfb97cd17464. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <[email protected]> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Francis Couture-Harpin <[email protected]>

* [CANN] Add Ascend NPU backend Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI. Co-authored-by: wangshuai09 <[email protected]> * delete trailing whitespaces * Modify the code based on review comment * Rename LLAMA_CANN to GGML_CANN * Make ggml-common.h private * add ggml_cann prefix for acl funcs * Add logging for CANN backend * Delete Trailing whitespace --------- Co-authored-by: wangshuai09 <[email protected]>

Co-authored-by: 65a <[email protected]>

* Add additional error information when model files fail to load. * Adding additional error information to most instances of fopen.

* ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix odd blocks for ARM_NEON (llama/8556) * ggml : fix iq4_nl dot product with odd number of blocks * ggml : fix q4_1 * ggml : fix q5_0 * ggml : fix q5_1 * ggml : fix iq4_nl metal ggml-ci * ggml : fix q4_0 * ggml : fix q8_0 ggml-ci * ggml : remove special Q4_0 code for first 2 blocks * ggml : fix sumf redefinition --------- Co-authored-by: slaren <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build

* Fix Vulkan repeat op * Implement Vulkan concat op * Delete old Vulkan shader generator * Implement Vulkan im2col op * Implement Vulkan unary gelu_quick op * Implement Vulkan group_norm op * Implement Vulkan timestep_embedding op * Implement Vulkan upscale op * Fix Vulkan vk_context tensor extra index issue * Fix Vulkan matmul shader parameter bug * Properly fix Vulkan matmul shader parameter bug * Add Vulkan ADD f16 + f32 -> f16 operator support * Implement Vulkan tanh op * Fix Vulkan group count too large Validation error on non-Nvidia GPUs * Throw error when too much memory is requested * Fix another Vulkan group count too large Validation error on non-Nvidia GPUs * Fix matmul MMQ condition * Implement Vulkan pad op * Fix Vulkan crash when tensor is used multiple times in a compute graph * Add Vulkan CONCAT f16 + f16 -> f16 op * Add Vulkan LEAKY_RELU op

ggml-ci

* Update doc for MUSA Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <[email protected]> * CUDA => MUSA Signed-off-by: Xiaodong Ye <[email protected]> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <[email protected]> * Fix CI build failure Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

…/8746) Signed-off-by: Xiaodong Ye <[email protected]>

… (llama/8748) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <[email protected]>

Signed-off-by: zhentaoyu <[email protected]>

…751) * added android implementation of ggml_print_backtrace_symbols * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> * Update ggml/src/ggml.c Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X * update asserts * only use dmmv for supported types * add test

…a/8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one

* Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <[email protected]> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>

* add truncate_bf16 * truncate intermediate fp32 if converting bf16 to bf16 * fix masking in __compute_fp32_to_bf16 * np.int16 no longer used * missing cast and additional numpy 2.x fix * ggml-impl : do not flush bf16 subnormals to zero * ggml : add reference fp32 to bf16 conversion The fast version is no longer equivalent for all platforms because of the handling of subnormal values. * gguf-py : remove flush to zero for bf16 subnormals * gguf-py : remove float32 truncation to bf16 Rounding achieves the same thing in the cases where this was used. * missed prototype update in merge * merge cleanup --------- Co-authored-by: Francis Couture-Harpin <[email protected]>

* ggml : reading the runtime sve config of the cpu * change to one time init to prevent performance drop * prefix variable to avoid possible conflicts * revert xxhash fix and add brackets --------- Co-authored-by: domke <[email protected]>

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

Signed-off-by: Molly Sophia <[email protected]>

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

ggerganov and others added 30 commits August 8, 2024 14:09

scripts : sync new files (#0)

05b3d1c

cmake : only enable GGML_NATIVE and x86 flags if not crosscompiling (…

4fce5d4

…ggml/885)

ggml : loop tiling optimizations for scalar path (ggml/898)

bcf05d5

Apply a loop tiling technique to the generic path, which provides performance upside for ISAs with enough registers to take advantage of it. Also helps the compiler optimize this path.

sycl : fix powf call in device code (llama/8368)

12e31b1

sycl : Reenabled mmvq path for the SYCL Nvidia Backend (llama/8372)

43816f9

* SYCL : Reenabled mmvq path for the SYCL Nvidia Backend * Reduced verbosity of comment

ggml : move sgemm sources to llamafile subfolder (llama/8394)

dc9508d

ggml-ci

Use multi_ptr to clean up deprecated warnings (llama/8256)

6b733da

CUDA: optimize and refactor MMQ (llama/8416)

aed5f19

* CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation

ggml : add NVPL BLAS support (ggml/8329) (llama/8425)

e298c16

* ggml : add NVPL BLAS support * ggml : replace `<BLASLIB>_ENABLE_CBLAS` with `GGML_BLAS_USE_<BLASLIB>` --------- Co-authored-by: ntukanov <[email protected]>

fix the mul_mat_id ut issues (llama/8427)

c46ee3e

* fix part of mul_mat_id * skip the bfloat 16 sycl ut Signed-off-by: Chen Xi <[email protected]> --------- Signed-off-by: Chen Xi <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]> Co-authored-by: Chen Xi <[email protected]>

ggml : minor naming changes (llama/8433)

33fe98b

* ggml : minor naming changes ggml-ci * ggml : use PRId64 [no ci] * ggml : revert FA K/Q names

metal : template-ify some of the kernels (llama/8447)

c4b504b

ggml-ci

Vulkan MMQ Fix (llama/8479)

df07dae

* Fix incoherence by adding missing LOAD_VEC_A parameter * Fix Vulkan op result checker build error

add concat through dim 1/2 (llama/8483)

fb2bd6f

* add concat through dim 1/2

make/cmake: add missing force MMQ/cuBLAS for HIP (llama/8515)

3c63419

cmake : install all ggml public headers (llama/8480)

53b9f09

Co-authored-by: 65a <[email protected]>

CUDA: fix partial offloading for ne0 % 256 != 0 (llama/8572)

c0760e2

ggml : add friendlier error message to fopen errors (llama/8575)

05a9656

* Add additional error information when model files fail to load. * Adding additional error information to most instances of fopen.

gguf : handle null name during init (llama/8587)

a7e1d3a

CUDA: MMQ code deduplication + iquant support (llama/8495)

54d968f

* CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build

ggml: fix compile error for RISC-V (llama/8623)

5b28839

fix scratch size of softmax (llama/8642)

8899f0b

0cc4m and others added 28 commits August 8, 2024 14:09

metal : add abort callback (ggml/905)

3f40857

metal : fix struct name (ggml/912)

c626ad5

ggml-ci

ggml : ignore more msvc warnings (ggml/906)

2e4a575

add conv support (llama/8688)

9ae2794

cuda : organize vendor-specific headers into vendors directory (llama…

e6d4565

…/8746) Signed-off-by: Xiaodong Ye <[email protected]>

Add TIMESTEP_EMBEDDING OP (llama/8707)

5eff773

Signed-off-by: zhentaoyu <[email protected]>

cann: update cmake (llama/8765)

7f82c97

cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X (llama/8800)

003db16

* cuda : fix dmmv cols requirement to 2*GGML_CUDA_DMMV_X * update asserts * only use dmmv for supported types * add test

Build: Only include execinfo.h on linux systems that support it (llam…

a5550c9

…a/8783) * Only enable backtrace on GLIBC linux systems * fix missing file from copy * use glibc macro instead of defining a custom one

Fixing wrong VDR iq4nl value (llama/8812)

94ea73e

ggml : fix overflows in elu function (llama/8866)

4898fae

It's helpful to use expm1f(x), because expf(x)-1 will result in overflow for 25% of single-precision floating point numbers.

ggml : add epsilon as a parameter for group_norm (llama/8818)

f26565a

Signed-off-by: Molly Sophia <[email protected]>

CUDA: fix padding logic for FP16/FP32 (llama/8884)

5bc1f65

CUDA/HIP: fix tests/test-backend-ops (llama/8896)

3b8a579

Updated SYCL device filtering (llama/8901)

4d1b494

* Updated device filter to depend on default_selector (fixes non-intel device issues) * Small related update to example/sycl Readme

ggml-backend : fix async copy from CPU (llama/8897)

69e9538

* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same

sync : ggml

17749ad

talk-llama : sync llama.cpp

da43d55

build : fix aarch64 (#0)

0504fba

ci : try to fix FreeBSD (#0)

224c75f

ci : disable ruby workflow (#0)

bb21cb3

ggerganov merged commit 6eac067 into master Aug 8, 2024
90 checks passed

ggerganov deleted the sync branch August 8, 2024 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml #2342

sync : ggml #2342

ggerganov commented Aug 8, 2024

sync : ggml #2342

sync : ggml #2342

Conversation

ggerganov commented Aug 8, 2024