vLLM
vLLM is the production path for serving open-weight models at scale — paged attention, continuous batching, tensor parallelism. Pair with TapPass when you want Bedrock-tier performance on your own GPUs.
Requirements
Section titled “Requirements”Server-side (point TapPass at your vLLM OpenAI-compatible endpoint):
VLLM_BASE_URL=http://vllm.internal:8000/v1VLLM_API_KEY=<optional-shared-secret>Run vLLM with the OpenAI server:
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 4 \ --api-key "$VLLM_API_KEY"Option A — SDK
Section titled “Option A — SDK”from tappass import Agent
agent = Agent("https://tappass.example.com", "tp_...")response = agent.chat("Hello", model="vllm/meta-llama/Llama-3.1-70B-Instruct")Option B — OpenAI SDK, zero-code
Section titled “Option B — OpenAI SDK, zero-code”export OPENAI_BASE_URL=https://tappass.example.com/v1export OPENAI_API_KEY=tp_...from openai import OpenAI
client = OpenAI()response = client.chat.completions.create( model="vllm/meta-llama/Llama-3.1-70B-Instruct", messages=[{"role": "user", "content": "Hello"}],)What’s supported
Section titled “What’s supported”chat.completions(streaming + non-streaming)completions(legacy)embeddings(when using an embedding model)- Tool calls (model-dependent)
- Prefix caching, speculative decoding — all invisible to the client