Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
通过NVIDIA TensorRT优化LLM推理,以实现最大吞吐量和最低延迟。在需要比PyTorch快10-100倍的推理时,或用于具有量化(FP8/INT4)、飞行批处理和多GPU扩展的模型服务时,使用NVIDIA GPUs(A100/H100)进行生产部署。
Category: developer (开发工具) · Author: davila7 · Version: @main · License: MIT
Tags: Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU
该 Skill 暂无文档文件。
tensorrt-llm 是由 davila7 开发的 AI Agent 技能,属于「developer」分类。 Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling. 该技能支持 Inference Serving、TensorRT-LLM、NVIDIA、Inference Optimization、High Throughput、Low Latency、Production、FP8、INT4、In-Flight Batching、Multi-GPU 相关能力,可直接集成到兼容的 AI Agent 平台中使用。 安装后,Agent 将获得该技能定义的工具、提示词或工作流,从而在对话中自动调用相应功能。
tensorrt-llm 是一个 AI Agent 技能,由 davila7 开发,归类于「developer」。安装后,它会为你的 Agent 增加新的能力,让 Agent 能够执行更丰富的任务。
点击页面右侧的安装命令复制到终端执行即可。大多数技能使用 npx skills add 命令安装,部分技能也支持手动下载 ZIP 文件。
该技能在 AgentCC 上免费提供。但部分技能可能依赖第三方 API 或服务,使用时请查看技能文档了解是否需要额外的 API Key 或付费服务。
安装成功后,技能会自动注册到你的 Agent 平台。在与 Agent 对话时,当你的需求匹配该技能的能力范围,Agent 会自动调用该技能完成任务。
每个技能的实现方式、覆盖范围和作者不同。建议对比页面底部的「相关技能推荐」中的同类选项,选择最符合你需求的技能。
Search for places (restaurants, cafes, etc.) via Google Places API proxy on localhost.
Interact with GitHub using the `gh` CLI. Use `gh issue`, `gh pr`, `gh run`, and `gh api` for issues, PRs, CI runs, and advanced queries.
Create or update AgentSkills. Use when designing, structuring, or packaging skills with scripts, references, and assets.
Start voice calls via the OpenClaw voice-call plugin.
Notion API for creating and managing pages, databases, and blocks.
Gemini CLI for one-shot Q&A, summaries, and generation.
Category:developer
Tags:Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU