AI Open Source · 模型推理与部署

huggingface/text-generation-inference

Hugging Face 的 TGI 是面向生产的文本生成服务框架，支持 BLOOM、 Falcon、StarCoder、GPT 系列等模型的高效推理。和 vLLM 算同类， Hugging Face 自家 Inference Endpoints 背后用的就是它。

Large Language Model Text Generation Inference

Repo: huggingface/text-generation-inference
Stars: ★ 11k
Language: Python
License: Apache-2.0
Last push: 2mo ago
Created: 2022-10-08
Topics: bloomdeep-learningfalcongptinferencenlp
Homepage: http://hf.co/docs/text-generation-inference

在 GitHub 打开

README

[!CAUTION] text-generation-inference is now in maintenance mode. Going forward, we will accept pull requests for minor bug fixes, documentation improvements and lightweight maintenance tasks.

TGI has initiated the movement for optimized inference engines to rely on a transformers model architectures. This approach is now adopted by downstream inference engines, which we contribute to and recommend using going forward: vllm, SGLang, as well as local engines with inter-compatibility such as llama.cpp or MLX.

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoints.

</div>

Get Started
Optimized architectures
Run locally
- Run
- Quantization
Develop
Testing

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Messages API compatible with Open AI Chat Completion API
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with :
- bitsandbytes
- GPT-Q
- EETQ
- AWQ
- Marlin
- fp8
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Speculation ~2x latency
Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Start

同一分类的其他项

Back to 模型推理与部署

README

Text Generation Inference

Table of contents

Hardware support

Get Start