While working at ACME Autos we shipped one of the first AI-enabled customer support agents, before agents were a thing. Tool calling was hand-coded, pre-harnesses, pre-SDKs, pre-MCP. The case study below runs a hardened, period-styled version. Talk to it. Try to break it. (You're welcome to. It won't.)
Function calling shipped at OpenAI in mid-2023. The dealer-chat work was already in production using a hand-rolled XML-tag protocol that did the same job: structured tool descriptions, structured argument parsing, structured tool results. When the official spec landed, migration was a renaming exercise.
metric · 02
1 tool, hardened
Inventory lookup. Single function, narrow contract, well-tested. The first principle of useful tool calling: fewer tools, sharper boundaries, more reliable behavior. The demo on this page exposes exactly that one tool, scoped to tongue-in-cheek answers.
metric · 03
Adversarial-tested
The live demo is hardened against prompt injection, code-extraction, off-topic abuse, role-play takeovers, and rate-limit attacks. Per-IP minute and hour caps, per-message length caps, session token budget, refusal patterns for known abuse shapes. Try to make it break character.
The 2023 build
ACME Autos (a major auto-dealer platform) wanted AI in the
consumer-facing chat. The market opening was real: a customer browsing
inventory at 11pm has no salesperson available. The risk was just as real:
one bad answer in front of a brand puts the dealer in an awkward seat.
OpenAI didn't ship structured tool calling until later that year, so we
built it ourselves. A small protocol of XML-tagged tool descriptions in the
system prompt, structured argument parsing on the response, and a tool
runner that executed inventory lookups against the dealer's actual stock
feed. When the OpenAI spec landed, the migration to the official format
was effectively a rename.
The discipline that mattered
What made it work wasn't the model. The discipline was: one tool,
narrow boundary, ruthless refusal posture for everything else.
Wide-open chat surfaces became liabilities; chat surfaces with one
well-scoped tool became products.
The demo, hardened
Below is a faithful re-implementation of that pattern, on a 2026 stack
(small modern model, modern hosting), with the same posture. Talk to it.
Ask about a car, ask for a trade-in valuation, try to convince it to
write you a Python script, try to override its system prompt. The
refusal patterns and the rate limits are real defenses, not theater.
ACME AUTOS · LIVE CHATv1.0 · 2023
online
ACME-CHAT-1.0
Welcome to the live demo of the dealer chat I built back in 2023.
Hand-rolled tool calling, GPT-3.5 under the hood (this version runs on
a small modern model, but the shape is the same). Try a make/model, or
ask about your trade-in.
What's running under it
Persona lock: vintage 2023 dealer assistant. Stays in character through every refusal.
Untrusted-input isolation: every user message gets wrapped in a <untrusted_user_input> tag with a system directive forbidding instructions inside.
Hard refusal patterns: regex-matched abuse shapes (prompt injection, code generation, jailbreaks) get refused before a model token is spent.
Single tool: lookup_inventory(query). Always returns a tongue-in-cheek result. The model summarizes in voice.
Per-IP rate limits: 5 messages per minute, 15 per hour, 10K tokens per session.
Per-message cap: 800 characters. Counter visible in the chat footer.
No state leakage: the model can't see other users' conversations, the system prompt, or any internal config.
↳ build for the brand, not for the lab. The one-tool surface beats the
everything-tool surface every time the boss reads the chat logs.
~ on the workbench ~
The tooling.
GPT-5 Nano (this demo)
GPT-3.5 (the original 2023 build)
OpenAI Chat Completions API
Hand-rolled function calling
Astro SSR routes
Per-IP rate limit
<untrusted_user_input> tag isolation
~ counterfactual ~
What would have been worse.
Without the constrained tool-calling discipline: a chat surface attached to a brand-name dealer site becomes a free LLM playground. Customers ask for code samples, recipe ideas, political opinions, and competitors' inventory. The brand wears every one of those answers. The wide model gets narrowed by careful prompt design, deterministic tool boundaries, and a refusal posture that stays in voice. The site becomes a dealer chat that talks about cars and only cars.