Guide to quant FP8

infermatic — Tue, 27 Aug 2024 02:43:41 +0000

Simple Guide to Convert an FP16 Model to FP8

Overview

This simple guide to quant models walks you through converting a model from FP16 to FP8, an 8-bit data format that significantly improves model inference efficiency without sacrificing output quality. FP8 is ideal for quantizing large language models (LLMs), ensuring faster and more cost-effective deployments.

Requirements for the quants

VM with GPUs: Ensure your VM has sufficient GPUs to handle the FP16 model download and conversion process.
Supported GPU Architectures: The conversion process requires GPUs with NVIDIA Ada Lovelace or Hopper architectures, such as the L4 or H100 GPUs.

Step 1: Setup the Environment

Access your VM or GPU environment and open a terminal.
Install Python and Pip:
```
sudo apt install python3-pip
```

Install the required Python packages:

pip install transformers
pip install -U "huggingface_hub[Cli]"

Clone the AutoFP8 repository:

git clone https://github.com/neuralmagic/AutoFP8.git

Navigate to the AutoFP8 directory:
```
cd AutoFP8
```
Install AutoFP8:
```
pip install -e .
```

Step 2: Download the FP16 Model

In a new terminal, use the Hugging Face CLI to download the FP16 model:

huggingface-cli download [modelName]

Step 3: Quantize the Model to FP8

Open the quantize_model.py script in a text editor:
```
nano quantize_model.py
```

Modify the script to reference the downloaded model name:

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8-Dynamic"

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
# For dynamic activation scales, there is no need for calibration examples
examples = []

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Run the quantization script:
```
python3 quantize_model.py
```

Step 4: Upload the Quantized FP8 Model

Log in to Hugging Face:
```
huggingface-cli login
```
Paste your Hugging Face token when prompted.
Navigate to the model’s weight directory:
```
cd [path_to_model_weights]
```
Upload the FP8 model:
```
huggingface-cli upload [modelName]
```

Conclusion

You have successfully converted your FP16 model to FP8 and uploaded it to Hugging Face!!! This conversion will allow for faster and more efficient inference, especially for large language models.

Check our FP8 models.

Understanding FP8 Quantization

TL;DR: FP8 is an 8-bit data format that offers an alternative to INT8 for quantizing LLMs. Thanks to its higher dynamic range, FP8 is suitable for quantizing more of an LLM’s components, most notably its activations, making inference faster and more efficient. FP8 quantization is also safer for smaller models, like 7B parameter LLMs, than INT8 quantization, offering better performance improvements with less degradation of output quality.

An Introduction to Floating Point Numbers

Floating point number formats were a revelation in the math that underpins computer science, and their history stretches back over 100 years. Today, floating point number formats are codified in the IEEE 754-2019 spec, which sets international standards for how floating point numbers are expressed.

A floating point number has 3 parts:

Sign: A single bit indicating if the number is positive or negative.
Range (Exponent): The power of the number.
Precision (Mantissa): The significant digits of the number.

In contrast, an integer representation is mostly significant digits (precision). It may or may not have a sign bit depending on the format, but no exponent.

FP8 vs INT8 Data Formats

FP8 and INT8 are both 8-bit values, but the way they use those bits determines their utility as data formats for model inference. Here’s a comparison of the dynamic range of each format:

INT8 dynamic range: 2^8
E4M3 FP8 dynamic range: 2^18
E5M2 FP8 dynamic range: 2^32

This higher dynamic range means that after FP16 values are mapped to FP8, it’s easier to tell them apart and retain more of the encoded information from the model parameters, making FP8 quantization more reliable for smaller models.

Applying FP8 in Production

In practice, FP8 enables quantizing not just an LLM’s weights but also the activations and KV cache, avoiding expensive calculations in FP16 during model inference. FP8 is supported on latest-generation GPUs such as the NVIDIA H100 GPU, where alongside other optimizations, it can deliver remarkable performance with minimal quality degradation.

Alternative with vLLM: Quick Start with Online Dynamic Quantization

Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying --quantization="fp8" in the command line or setting quantization="fp8" in the LLM constructor.

In this mode, all Linear modules (except for the final lm_head) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.

vLLM Quantization Documentation

The post Guide to quant FP8 appeared first on Infermatic.

Using Infermatic.ai API with SillyTavern

infermatic — Fri, 21 Jun 2024 00:36:35 +0000

SillyTavern is one of the most popular interfaces to interact with LLMs. We have been working on developing an API and one of the first interfaces we wanted to integrate with was SillyTavern. We have done just that.

Requirements: Infermatic.ai Plus Tier subscription ($15/month)

Steps to integrate:

After you subscribe to Infermatic.ai you can generate an API key. You will see a new option on the left-side menu bar called API Keys. Select that and a modal will open to generate a new key or copy an existing key.

2. In the modal generate a new key or copy an existing key.

3. Run SillyTavern locally. To connect our API select the power socket icon and match each setting.

API: Text Completion
API: Infermatic
Custom Endpoint (If selected vLLM or Aphrodite): https://api.totalgpt.ai
Custom API Key: They key you copied from step 2 above.

Then hit ‘connect’ connect to our API. The available models drop down will populate with various models our API supports. Select one and use the same name listed in the Enter a Model ID section.

5. And that’s all, ready for you to enjoy it!

Now we are still learning so would love feedback on this integration. Feel free to join us in Discord to share your feedback.

The post Using Infermatic.ai API with SillyTavern appeared first on Infermatic.

Guides Archives - Infermatic