AI Open Source · 模型推理与部署
huggingface/text-generation-inference
Hugging Face 的 TGI 是面向生产的文本生成服务框架,支持 BLOOM、 Falcon、StarCoder、GPT 系列等模型的高效推理。和 vLLM 算同类, Hugging Face 自家 Inference Endpoints 背后用的就是它。
Large Language Model Text Generation Inference
- Stars
- ★ 11k
- Language
- Python
- License
- Apache-2.0
- Last push
- 2mo ago
- Created
- 2022-10-08
- Topics
- bloomdeep-learningfalcongptinferencenlp
README
<div align="center"> <a href="https://www.youtube.com/watch?v=jlMAX2Oaht0"> <img width=560 alt="Making TGI deployment optimal" src="https://huggingface.co/datasets/Narsil/tgi_assets/resolve/main/thumbnail.png"> </a>[!CAUTION] text-generation-inference is now in maintenance mode. Going forward, we will accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks.
TGI has initiated the movement for optimized inference engines to rely on a
transformersmodel architectures. This approach is now adopted by downstream inference engines, which we contribute to and recommend using going forward: vllm, SGLang, as well as local engines with inter-compatibility such as llama.cpp or MLX.
Text Generation Inference
<a href="https://github.com/huggingface/text-generation-inference"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/huggingface/text-generation-inference?style=social"> </a> <a href="https://huggingface.github.io/text-generation-inference"> <img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational"> </a>A Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoints.
</div>Table of contents
Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:
- Simple launcher to serve most popular LLMs
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- Messages API compatible with Open AI Chat Completion API
- Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
- Quantization with :
- Safetensors weight loading
- Watermarking with A Watermark for Large Language Models
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
- Stop sequences
- Log probabilities
- Speculation ~2x latency
- Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
- Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
- Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance
Hardware support
- Nvidia
- AMD (-rocm)
- Inferentia
- Intel GPU
- Gaudi
- Google TPU
Get Start
同一分类的其他项