KoboldCPP does not support 16-bit, 8-bit, 4-bit (GPTQ) models and AWQ models. For such support, see KoboldAI.
KoboldCPP is a backend for text generation based off llama.cpp and KoboldAI Lite for GGUF models (GPU+CPU).
Download KoboldCPP and place the executable somewhere on your computer in which you can write data to.
AMD users will have to download the ROCm version of KoboldCPP from YellowRoseCx's fork of KoboldCPP.
Linux installation requires you to compile KoboldCPP (both Concedo's and YellowRoseCx's versions) for yourself.
python3
python3-pip
libopenblas-dev
libclblas-dev
gcc
If you are on a NVIDIA GPU, follow the NVIDIA CUDA instructions for your OS here, then do Post-Installation instructions here.
Copy the following line into your terminal depending on the version of KoboldCPP you are using:
Concedo's KoboldCPP
git clone https://github.com/LostRuins/koboldcpp && cd koboldcpp
YellowRoseCx's KoboldCPP
git clone https://github.com/YellowRoseCx/koboldcpp-rocm && cd koboldcpp-rocm
Copy the following line into your terminal, depending on your GPU:
NVIDIA GPUs
make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_CUBLAS=1
AMD GPUs using YellowRoseCx's Fork
The argurment
-j4
means it will use 4 cores of your CPU when using RoCM. You can adjust this value accordingly (-j8
,-j14
) or leave it off altogether.
make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 LLAMA_HIPBLAS=1 -j4
Other GPUs
make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1
Setting the right amount of GPU layers is a trial and error process. Experiment with different amounts until KoboldCPP crashes. Do note that the bigger the parameter size is, the more layers it needs.
For this section, we will be using theQ4K_M
GGUF version of Pygmalion-2-7B as the model we want to load. Replace the model filename in these examples with the model you want to load.
In the KoboldCPP GUI, select either Use CuBLAS
(for NVIDIA GPUs), Use hipBLAS (ROCm)
(for AMD GPUs using YellowRoseCx's fork), or Use OpenBLAS
(for other GPUs), select how many layers you wish to use on your GPU and click Launch.
There are many more options you can use in KoboldCPP. Type in
.\koboldcpp.exe -h
(Windows) orpython3 koboldcpp.py -h
(Linux) to see all available argurments you can use.
All commands here are boilerplate. Adjust them to fit your system needs.
In a Command Prompt that is set to the KoboldCPP folder, copy the following commands depending on your use case:
NVIDIA GPUs
.\koboldcpp.exe --model pygmalion-2-7b.Q4_K_M.gguf --usecublas normal 0 --gpulayers 17
Other GPUs
.\koboldcpp.exe --model pygmalion-2-7b.Q4_K_M.gguf --useopenblas 0 0 --gpulayers 17
Run KoboldCPP by copying the following commands into your Terminal, depending on your use case:
NVIDIA GPUs
python3 koboldcpp.py --model pygmalion-2-7b.Q4_K_M.gguf --usecublas normal 0 --gpulayers 17
Other GPUs
python3 koboldcpp.py --model pygmalion-2-7b.Q4_K_M.gguf --useopenblas 0 0 --gpulayers 17
This section only works on NVIDIA GPUs.
To use other GPUs with KoboldCPP, you may do the following.
.\koboldcpp.exe --model pygmalion-2-7b.Q4_K_M.gguf --usecublas normal 0 1 --gpulayers 17
python3 koboldcpp.py --model pygmalion-2-7b.Q4_K_M.gguf --usecublas normal 0 1 --gpulayers 17
Connecting to KoboldCPP is the same as connecting to KoboldAI. However change :5000
in the URL to :5001
.
KoboldCPP only supports manual downloads at this time.
Download a GGUF model of your choosing or from LLM Models following this guide.