Consulting

Llama 2 with TensorRT-LLM – When You Have The Need for Speed

Llama 2 with TensorRT-LLM – When You Have The Need for Speed

Llama 2 is an open-source large language model (LLM) created by Meta to compete with the likes of ChatGPT and Gemini. Unlike some of the other competitors, Llama 2 distinguishes itself because of its performance which in many metrics is close to GPT 3.5-Turbo accuracy. It has a  smaller size compared to more massive models such as GPT 3. This gives any user of the model the possibility to use the model with less expensive hardware.

There are currently three different Llama 2 versions publicly available, the 7B, 13B and 70B. B in this context stands for billions of parameters. The more parameters the version you’re using has, the greater its capacity to capture and utilize language nuance. This increased capacity can lead to improvements in understanding the context of the input prompt, enabling the model to generate more accurate and contextually relevant results. However, with such high parameters offered by Llama 2, when using this LLM you can expect inference speed to be relatively slow.

As a rule of thumb, the more parameters, the larger the model. Consequently, the larger the model, the more memory it uses – and the more memory it uses, the slower the processing speed. With Llama 2 known for its low number of parameters and small models, speed and lower hardware usage have been the selling point.

A Date That Changed the LLM Game

On October 19, 2023, the LLM game was changed forever. On this day, NVIDIA publicly released their TensorRT-LLM library – a library that drastically increases the inference speed of many open-source LLM models, like Llama 2. Several techniques enable speed to be optimized, including Precision Optimization, Layer and Tensor Fusion, and Kernel Auto-Tuning, to name a few. Because of Llama 2, regular computer users working off of expensive hardware can have better performance by taking  advantage of inference speed, and without the need of expensive servers. Thus, game changer status was achieved.

Since LLMs are typically implemented with a conversational approach, finding the right balance between accuracy (built with context from increasingly large datasets) and speed (the traditional arch nemesis of accuracy) has been a primary challenge. With the increased speed provided by the TensorRT-LLM library, we no longer have to compromise on either. This allows any user of the model to have a real-time, coherent conversation with an LLM, like Llama 2.

Diving Deeper Into the Tech

It’s important to understand and take a deeper look into the technology in order to understand the opportunities and challenges. In this section, we’re going to demonstrate how we use TensorRT-LLM to speed up the inference of a Llama 2 7B model… and you can follow along!

*Note: In order to follow along and implement the steps, you’re going to need these minimum hardware requirements:

  • NVIDIA GPU with at least 15 GB of memory
  • CPU: any CPU which does not bottleneck the GPU
  • RAM: 32GB

Let’s get started. 

Step 1:

Install git-lfs (stands for git large file system). This is used to pull some files which are not actually stored in the TensorRT-LLM github repository.

apt-get update && apt-get -y install git git-lfs

Step 2:

Clone the repo.

git clone https://github.com/NVIDIA/TensorRT-LLM.git

cd TensorRT-LLM

Step 3:

Update git submodules (other repositories embedded within this repository):

git submodule update –init –recursive

Step 4

Configure git lfs in the local git environment by modifying the global git configuration. 

git lfs install

Step 5:

Download the files specified in the tracking files that are in this repository.

git lfs pull

Step 6:

TensorRT-LLM has a command to create its own docker image (this takes a long time to finish):

make -C docker release_build

This command runs a Makefile inside of a folder called docker which comes with the tensort-llm repo.

Step 7:

Execute the container:

make -C docker release_run

Step 8:

Now we have to build the TensorRT-LLM code with this command (commands from this step and step 9 take some time to finish):

python3 ./scripts/build_wheel.py –trt_root /usr/local/tensorrt

Step 9:

This installs the tensorrt-llm wheel. The installation will install any other packages required automatically.

pip install ./build/tensorrt_llm*.whl

Step 10

Here comes what we were waiting for, and that’s the building of the Llama 2 7B model. For the model directory we just have to pass the name of a hugging face model. This model can be a custom Llama 7B model. For this example we use the NousResearch/Lama-2-7b-hf model (https://huggingface.co/NousResearch/Llama-2-7b-hf).

python build.py –model_dir NousResearch/Llama-2-7b-hf \

–dtype float16 \

                –remove_input_padding \

                –use_gpt_attention_plugin float16 \

                –enable_context_fmha \

                –use_gemm_plugin float16 \

                –output_dir ./examples/llama/tmp/llama/7B/trt_engines/fp16/1-gpu/

Step 11:

For this last step we run the run.py example found in the examples directory. This example inputs an incomplete sentence into the model. The model will then complete the sentence as its inference result:

python3 ./examples/run.py –max_output_len=50                   –tokenizer_dir NousResearch/Llama-2-7b-hf                   –engine_dir=./examples/llama/tmp/llama/7B/trt_engines/fp16/1-gpu/

This code takes just a few seconds to run whereas without the use of this library inference, it would have taken more than one minute.

Performance Benchmark

Here we have an official table showing the performance of this library using A100 GPUs running some models with FP16. Performance table taken from the TensorRT-LLM website.

Conclusion

Beyond speeding up Llama 2, by improving inference speed TensorRT-LLM has brought so many important benefits to the LLM world. 

  1. Reduced Latency: Faster inference directly translates to reduced latency, which is crucial for applications like chatbots, natural language processing, and other real-time systems. Lower latency improves the user experience and responsiveness of the model.
  2. Energy Efficiency: Accelerating LLMs can lead to more energy-efficient inference, which is essential for applications running on edge devices with limited power resources. This can make LLMs more feasible for deployment in resource-constrained environments.
  3. Scalability: Optimized inference allows for better scalability, enabling deployment of LLMs on a broader range of hardware, from edge devices to high-performance GPU servers.
  4. Deployment in Production Environments: Faster inference speed is crucial for deploying LLMs in production environments, where responsiveness and efficiency are paramount. TensorRT provides tools for optimizing models for deployment in various scenarios.
  5. Real-Time Applications: Many applications, such as virtual assistants, language translation, and sentiment analysis, require real-time processing of natural language. Accelerating LLMs can make these applications more practical and responsive.

As Llama 2 continues to evolve and develop, we can expect even more enhancements that will deliver benefits to the LLM world and beyond. 


Posted by on February 13, 2024