- The biggest efficiency win with an LLM is deciding when not to call it. Most app logic shouldn't touch the model.
- Treat the LLM as one well-bounded service behind a clean interface, not a thing your whole app talks to directly.
- Structured, specific prompts with the context the model needs — and nothing it doesn't — cut both cost and latency.
- Design for the model being slow or wrong: stream output, fail gracefully, and let the user stay in control.
It's easy to bolt an LLM onto an app. It's harder to do it so the feature is fast, affordable, and genuinely useful instead of a gimmick that burns tokens. I built BidHound — a mobile app that reads a freelance job post and drafts a tailored proposal — and the interesting engineering wasn't "call the API." It was everything around the call that made it efficient.
Here's what actually matters when you put an LLM into a real product, drawn from building one.
The first efficiency decision: when NOT to call the model
The most expensive mistake in LLM apps is routing things through the model that never needed it. An LLM call is the slowest and costliest operation in your app by orders of magnitude. So the first design question isn't "how do I prompt this" — it's "does this step need the model at all?"
In BidHound, parsing a job post's structure, validating input, formatting the final output, storing drafts, handling auth — none of that touches the LLM. Ordinary code does it faster and for free. The model is reserved for the one thing only it can do well: turning a job description plus the user's background into persuasive, tailored prose. Everything that can be deterministic, should be.
An LLM is a specialist, not a general-purpose runtime. Every token you send and receive costs money and time. Use the model only where its actual strength — open-ended language generation — is what the task requires.
Put the model behind one clean boundary
A production LLM feature should sit behind a single, well-defined service on the backend — not be called ad hoc from all over the app. In BidHound, the Android client never talks to the model directly; it calls the backend, and the backend owns the one path that talks to the LLM API.
That boundary buys you a lot:
- Your API key never ships to the client, where it could be extracted. It lives only on the server.
- You can change models or providers without touching the app — the contract between client and backend stays the same.
- Cost and rate-limit controls live in one place — you can add caching, throttling, or quotas server-side without redeploying the mobile app.
- Prompt logic is centralized, so improving the prompt improves every user immediately.
Prompt structure is a performance decision, not just a quality one
People treat prompt engineering as purely about output quality. It's also about efficiency. Every unnecessary word in a prompt is latency and cost on every single call, multiplied across all users.
The discipline that works: give the model exactly the context it needs to do the job, structured clearly, and nothing extraneous. For bid drafting that means the job post and the relevant slice of the user's background — not their entire profile, not boilerplate instructions repeated every call. Be specific about the desired output shape so you don't pay for a rambling response you then have to trim.
Specific beats clever
A precise instruction — the tone, the length, the structure you want — produces usable output on the first try more often, which means fewer retries. A retry is a doubled cost and a doubled wait. Getting it right the first time is itself the optimization.
Design for the model being slow or wrong
An LLM call is slow by app standards — often seconds. If your UI freezes while it waits, the feature feels broken even when it's working. The fixes are architectural:
- Stream the output so the user sees words appearing immediately instead of staring at a spinner.
- Keep the user in control. The model drafts; the human edits and decides. That's both a better product and a safer one — the user is the quality gate.
- Fail gracefully. The API will occasionally be slow, rate-limited, or return something off. Handle it as a normal path: a clear message and a retry, never a crash or a hang.
The model is a draft engine, not an oracle. The feature that respects that — fast first output, human in the loop, graceful failure — is the one people trust and keep using. Overpromising autonomy is how LLM features lose users.
A reusable foundation pays off
Much of what makes an LLM feature production-grade — the auth, the clean client/backend boundary, the request handling — isn't specific to one app. Building those once as a reusable foundation means the next AI-powered feature starts from a working, secure base instead of from scratch. The efficiency compounds beyond any single product.
The pattern underneath all of this is simple: an LLM is a powerful, expensive specialist. Build deliberately around it — call it only when it earns its place, wrap it in one clean boundary, give it precisely what it needs, and design for it being slow and occasionally wrong — and you get a feature that's genuinely useful instead of a costly novelty.
Have a system like this to build?
This is the kind of work I take on. Tell me what you're building or rescuing.