Skip to content

vLLM

vLLM is the production path for serving open-weight models at scale — paged attention, continuous batching, tensor parallelism. Pair with TapPass when you want Bedrock-tier performance on your own GPUs.

Server-side (point TapPass at your vLLM OpenAI-compatible endpoint):

Terminal window
VLLM_BASE_URL=http://vllm.internal:8000/v1
VLLM_API_KEY=<optional-shared-secret>

Run vLLM with the OpenAI server:

Terminal window
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--api-key "$VLLM_API_KEY"
from tappass import Agent
agent = Agent("https://tappass.example.com", "tp_...")
response = agent.chat("Hello", model="vllm/meta-llama/Llama-3.1-70B-Instruct")
Terminal window
export OPENAI_BASE_URL=https://tappass.example.com/v1
export OPENAI_API_KEY=tp_...
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="vllm/meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
)
  • chat.completions (streaming + non-streaming)
  • completions (legacy)
  • embeddings (when using an embedding model)
  • Tool calls (model-dependent)
  • Prefix caching, speculative decoding — all invisible to the client