The Aphrodite Engine is a cutting-edge inference software designed to run large language models (LLMs) efficiently, whether for large-scale deployments or individual applications, with an emphasis on scalability.
Inspired and heavily influenced by vLLM, Aphrodite Engine is designed to power the PygmalionAI chat website. It can be used by individuals at home same as many other backends, with the caveat of no CPU offloading for model weights - however, the memory footprint for quantized models is the same (in some cases, less than other backends!)
All the most popular models are supported by Aphrodite, including Llama, Mistral, Mixtral, Phi, and GPT-J. Models can be loaded in multiple different formats:
For 16 and 32 bit precision, PyTorch bins and safetensors are supported.
For quantized models, GPTQ, AWQ, and SqueezeLLM are currently supported (with Exllamav2 and QuIP# coming soon).
Aphrodite is currently developed for Linux operating systems, but Windows users can install the engine via Windows Subsystem for Linux (WSL).
sm_60
), AMD (gfx9
)For NVIDIA GPUs, CUDA 11 to 11.8 is required for the building process. If you install via the pip package, the CUDA version is irrelevant and will work on higher CUDA versions.
Only the MI series AMD cards are currently supported, due to the limitations from Flash Attention and xFormers.
The step below is only required for Windows users.
Thanks to @pyroserenus for writing the WSL guide.
This step assumes a few things; you have an NVIDIA GPU with up-to-date Studio Drivers (Game-Ready Drivers are untested), and that your Windows installation is up-to-date.
wsl --install
then press enter to install WSL.Ubuntu
in Start Menu and open the first result (it has a Tux icon).nvidia-smi
in the Ubuntu shell.sudo apt update && upgrade -y
sudo apt install python3 python3-pip ipython3 git wget bzip2 tar -y
That should set up WSL with everything you need. You can now move on to the next step.
Aphrodite can be easily installed via the provided pip package. Open a terminal and simply run:
pip install aphrodite-engine
That's it! You've successfully installed Aphrodite Engine. For Windows users, make sure you're in a WSL shell before running the above command.
You can also build Aphrodite from source if you want to receive the latest updates, as the pip package is updated with each release.
If you already have an environment set up with CUDA 11.8, you can simply run:
pip install git+https://github.com/PygmalionAI/aphrodite-engine@dev
The command will clone and install from the dev
branch of the Aphrodite Engine GitHub repository.
If you don't have an environment set up, then run the following commands:
git clone https://github.com/PygmalionAI/aphrodite-engine -b dev
cd aphrodite-engine
./update-runtime.sh
Remove the -b dev
at the end of the first command if you wish to use the stable main
branch instead. The third command will create an embedded micromamba environment with everything necessary installed. Note that if you installed Aphrodite Engine this way, all your commands related to Aphrodite moving forward will need the ./runtime.sh
prefix; e.g., ./runtime.sh python -m aphrodite...
. Make sure you're in the cloned aphrodite-engine
directory first.
That should be it. If you followed all the steps correctly, you should now have a working Aphrodite Engine installation.
Aphrodite Engine by default uses 90% of your VRAM. To limit this, use
--gpu-memory-utilization
or-gmu
with a lower value. By default, this is set to0.9
. To use only 50% of the VRAM, you can use-gmu 0.5
.
Aphrodite provides two REST API servers, OpenAI and KoboldAI. The OpenAI server is recommended for personal usage, and KoboldAI for hosting on Kobold Horde.
To launch an OpenAI server for Pygmalion-2-7b GPTQ, run the following command:
python -m aphrodite.endpoints.openai.api_server --model TheBloke/Pygmalion-2-7B-GPTQ -q gptq --dtype float16
To see all the options, refer to this page.
After you've launched an OpenAI server with your model of choice, launch SillyTavern and configure your API endpoint like this:
Refer to this page.