Some workloads can't leave the building. Health records, legal discovery, financial transactions, anything under an air-gap mandate — for these, the question isn't "which model is smartest," it's "which model can I run where the data already lives." Our default open-weight pick is DeepSeek V4: strong reasoning and code at very low cost, fully self-hostable, with one real limit — it's text-only. Switch to Llama 4 Maverick when you need open-weight multimodality and a 1M context, and to Mistral Large 3 for European data residency and strong multilingual coverage. And if you genuinely can't self-host, a closed model on a zero-data-retention tier is often the more honest call.
3
How We Evaluated
Private and self-hosted workloads are scored on a different axis than everything else in the "Which AI Model for X" series. The question that decides the project isn't raw capability — it's whether you can put capability where the data already lives, under a control regime you can defend to an auditor. A model that's a point smarter but forces you to ship sensitive bytes to someone else's API is the wrong answer for a hospital, a bank, or a defense contractor.
So the matrix is limited to the three open-weight options that you can actually download and run — DeepSeek V4, Llama 4 Maverick, and Mistral Large 3 — and the dimensions are the ones that determine whether a private deployment succeeds:
- Open weights / self-hostable — can you legally and practically download the weights and run them in your own environment? All three qualify; this column is the table stakes that closed models can't meet.
- Reasoning & code quality — once the data is safe, is the model actually good enough to do the work? This is where open-weight options used to lag and increasingly don't.
- Multimodality — can it natively read images, not just text? This is the single biggest dividing line in this matrix, and the one that most often forces a switch off the default.
- Context window — how much private context (long documents, whole repos, full case files) fits in a single pass without external retrieval machinery.
- Data residency / compliance — how cleanly does the deployment story map to GDPR, HIPAA-style constraints, sector rules, and air-gap mandates? Self-hostability gives all three a strong baseline; jurisdiction of the vendor is the tiebreaker.
- Hosting cost & complexity — the operational reality: GPU footprint, serving-stack maturity, and how much platform work it takes to run reliably. The model is free; running it is not.
Honesty disclaimer. Capability ratings here (Best-in-class, Strong, Adequate, Limited) are qualitative judgments from real private-deployment workloads as of June 2026, not synthetic benchmark scores — public open-weight leaderboards shift every time a new checkpoint lands, so a stale percentage is worse than a careful qualitative read. Context-window ceilings on open-weight models also depend on your host and serving configuration, so we rate them relatively rather than quoting a single fixed number. And one thing the table deliberately can't show: the closed-model alternative for teams that can't self-host. We cover that in prose below, because pretending a closed ZDR tier is "open-weight" would be dishonest.
The Decision Matrix
All three models clear the only gate that closed models can't: you can self-host them. From there the decision is driven by two columns — multimodality and the operational cost of running the thing. Read the matrix that way and the picks fall out cleanly: DeepSeek V4 for the cheapest strong text reasoning, Llama 4 Maverick when you need to see images, Mistral Large 3 when the data has to stay in Europe.
| Dimension | DeepSeek V4 | Llama 4 Maverick | Mistral Large 3 |
|---|---|---|---|
| Open weights / self-hostable | Yes | Yes | Yes |
| Reasoning & code quality | Best-in-class | Strong | Strong |
| Multimodality | Limited (text-only) | Best-in-class | Strong |
| Context window | Strong | Best-in-class | Strong |
| Data residency / compliance | Best-in-class | Best-in-class | Best-in-class |
| Hosting cost & complexity | Strong | Adequate | Strong |
The two columns that actually separate these models are multimodality and hosting cost. DeepSeek V4 wins on the cost-per-capability of pure text reasoning but is text-only. Llama 4 Maverick wins on multimodality and context window but costs more to run well. Mistral Large 3 sits in the middle on capability and earns its place on the European-residency story. If your workload is text-only, the default is obvious. The moment an image enters the pipeline, the default moves.
DeepSeek V4: When It's the Right Call
DeepSeek V4 is the default for private and self-hosted work because it gives you the best ratio of capability to running cost in the open-weight world. It's a mixture-of-experts model, so it activates only a slice of its parameters per token — inference is meaningfully cheaper and faster than a dense model of comparable quality, which is the number that dominates your spreadsheet once you own the GPUs instead of paying per token. Its reasoning and code quality are the strongest in this matrix, and it reasons while calling tools, so it slots into agentic pipelines running entirely behind your firewall.
Where it shines:
- Air-gapped and on-prem reasoning, code generation, and analysis where no token may ever touch a third-party API.
- Cost-controlled scale: high-volume internal pipelines where owning the model economics beats per-token API pricing.
- Agentic workflows behind the firewall — V4 reasons through tool calls, so internal agents work without external dependencies.
- Caching-heavy pipelines: V4 supports context caching that cuts cost further on repeated stable prefixes.
- Regulated-data backends (health, finance, legal) where the deployment story has to be "the data never left."
The one limit to plan around. DeepSeek V4 is text-only. There is no native vision — it cannot read an image, a scanned form, or a chart. For a great many private workloads (log analysis, document text, code, structured records) that's irrelevant. But if any step needs to understand a picture, V4 can't do it, and bolting on a separate vision model adds a second system to secure and operate. When multimodality is in scope, the default switches.
Watch-outs. Self-hosting V4 still means you own the serving stack, the scaling, and the security perimeter — open weights make data private, but only if you run them well. And as with any open-weight model, you're responsible for your own eval harness and safety layer; there's no vendor moderation endpoint to lean on. For the broader build-vs-buy and prompt-vs-tune decision around models like this, our fine-tuning vs prompting vs RAG guide walks through when customizing an open-weight base actually pays off.
Llama 4 Maverick: When It's the Right Call
Llama 4 Maverick is the model you switch to the moment a private workload has to understand images. It's Meta's open-weight flagship and is natively multimodal — text and images together — which is precisely the capability DeepSeek V4 lacks. It also carries a 1M-token context window (the exact ceiling depends on your host), so long private documents, large case files, and whole repositories fit in a single pass without standing up a retrieval pipeline first.
Where it shines:
- Multimodal open-weight needs: invoices, scanned forms, screenshots, charts, and mixed image-and-text documents that must be processed privately.
- Long-context private analysis: hundreds of pages or a full codebase in one prompt, kept entirely in-house.
- Teams that want open-weight portability but not the ops burden of bare-metal self-hosting — Maverick is available on managed open-weight providers like Groq, Together, and Fireworks, so you can run it off your own machines while staying off the big closed-API vendors.
- Mixed pipelines where one model needs to handle both the document text and the document images.
Why it earns the switch. Native multimodality in an open-weight model is rare and valuable. If your private corpus is full of PDFs that are really images, or forms that need to be read as pictures, a text-only model forces a fragile OCR-plus-LLM contraption; Maverick reads the page directly. The 1M context is the second draw — it collapses retrieval-heavy designs into single-shot reads for documents that fit.
Watch-outs. A natively multimodal model is heavier to serve, so hosting cost and complexity are a notch above DeepSeek V4 — that's the trade you're making for vision. Reasoning and code quality are strong but not quite at V4's level for pure text reasoning, so if multimodality isn't actually required, you're paying operational overhead for a feature you won't use. For the head-to-head on vision-heavy document work specifically, see our companion piece on vision, chart, and PDF understanding.
Mistral Large 3: When It's the Right Call
Mistral Large 3 earns its place on jurisdiction and language. Mistral is a European vendor with an open-weight flagship, so you can self-host the model inside an EU region or on-prem in Europe and keep both the weights and the data under European jurisdiction — a materially cleaner GDPR and sector-compliance story than running an American-vendor model, even when that model is itself open-weight. Its multilingual coverage is a genuine strength, particularly across European languages, where models tuned primarily on English tend to underperform.
Where it shines:
- European data residency: regulated EU workloads where the data and ideally the vendor relationship must stay within European jurisdiction.
- Multilingual private workloads: customer data, documents, or support content across many European languages.
- Multimodal-but-EU needs: it's multimodal and inexpensive, a reasonable middle path when you need some image handling and European residency together.
- Cost-controlled European scale where Mistral's pricing and self-host story both apply.
Why it earns the switch. When "the data must stay in Europe" or "we serve many European languages well" is non-negotiable, Mistral Large 3 answers both in one model. The European-vendor angle isn't cosmetic — for some buyers, contracting with an EU company materially simplifies the compliance and procurement conversation.
Watch-outs. For pure text reasoning at the lowest running cost, DeepSeek V4 usually wins; for the most capable open-weight multimodality and the largest context, Llama 4 Maverick usually wins. Mistral Large 3 is the right pick when its specific strengths — European residency and multilingual quality — are the deciding constraints, not when you simply want the strongest or cheapest model in the abstract.
Which to Pick by Sub-Segment
Regulated data (health, finance, legal)
Default: DeepSeek V4. When the requirement is "this data physically cannot leave our perimeter," a self-hosted open-weight model is the answer, and V4 gives you the strongest text reasoning per GPU dollar. Run it in an isolated VPC or on-prem cluster, log nothing externally, and your "the data never left" story holds. Switch to Llama 4 Maverick if the regulated documents are images or scans that must be read as pictures. Switch to Mistral Large 3 if the regulator is European and jurisdiction is part of the requirement.
On-prem and air-gapped
Default: DeepSeek V4. Air-gap is the purest self-host case — no internet egress at all — so an efficient, capable open-weight model that you can run on a fixed GPU budget is exactly right, and V4's mixture-of-experts efficiency keeps that budget sane. Switch to Llama 4 Maverick only if the air-gapped workload genuinely needs vision, since the higher serving cost is harder to amortize inside a fixed on-prem footprint. Either way, plan the GPU sizing and the offline eval harness up front — there's no API to fall back on.
Multimodal open-weight needs
Default: Llama 4 Maverick. This is the segment where the default flips. If you must understand images privately and keep the model open-weight, Maverick is the pick — native multimodality plus a 1M context. Mistral Large 3 is the alternative when you also need European residency. Don't reach for DeepSeek V4 here — it's text-only, and stitching a separate vision model onto it usually costs more in complexity than just running Maverick.
Multilingual and EU residency
Default: Mistral Large 3. When the work spans many European languages and the data must stay in Europe, Mistral answers both in one model from a European vendor. Stay on DeepSeek V4 only if the multilingual need is light and the real constraint is cost-controlled text reasoning. For the cost-tradeoff lens specifically, our cost-sensitive workloads guide covers how to think about running these models at volume.
Cost-controlled scale
Default: DeepSeek V4. At high volume, owning the model economics beats per-token API pricing, and V4's MoE efficiency plus context caching make it the cheapest strong option to run yourself. Consider Mistral Large 3 if European residency is also in play. Llama 4 Maverick is the cost-controlled pick only when multimodality is mandatory, since its serving cost is higher. The deeper cost math — when self-hosting actually beats an API bill — lives in the cost-sensitive workloads guide.
When a closed ZDR tier is the better call
Default: a closed model on a zero-data-retention enterprise tier — if you can't operate GPU infrastructure well. This is the segment the table can't represent, and it's the most important one to be honest about. Self-hosting an open-weight model is the strongest privacy guarantee only when you run it correctly: network isolation, access controls, encryption, logging policy, patching, scaling, on-call. A half-maintained self-host can be less secure than a reputable closed provider on a contractual ZDR tier where prompts and completions aren't stored or used for training and traffic stays inside a defined boundary, sometimes with regional pinning. The deciding question is the shape of your requirement: if it's "the bytes can never leave our building" (air-gap, certain regulators), self-host an open-weight model. If it's "don't train on our data and don't leak it," a closed ZDR tier often delivers frontier quality at lower operational risk. Don't choose self-hosting for the feeling of control if you can't back it with the operational discipline it demands.
Sample Prompt for the Recommended Winner
Here's a prompt shape that works well for DeepSeek V4 on a private, behind-the-firewall reasoning task — in this case classifying and summarizing internal records. It's structured to be deterministic, schema-targeted, and safe to run unattended in a batch, which is how most self-hosted pipelines actually call the model.
You are an internal document-processing engine running on private
infrastructure. You will receive one internal record at a time. Reason
step by step internally, then return only the final JSON object.
Rules:
- Use ONLY the information in the record. Do not infer facts that are
not present. If a field is unknown, return null for it.
- Do not include any commentary, preamble, or text outside the JSON.
- Never output any content from the record verbatim that looks like a
personal identifier (national ID, full account number); mask it as
"[REDACTED]" in the summary field.
Output schema (return exactly this object):
{
"category": one of ["finance", "legal", "hr", "operations", "other"],
"risk_level": one of ["low", "medium", "high"],
"summary": "a 1-2 sentence summary with identifiers masked",
"contains_pii": boolean
}
Record:
"""
[RECORD_TEXT]
"""
A few choices make this work well for DeepSeek V4's profile. First, it leans on V4's strong reasoning by allowing step-by-step internal reasoning while constraining the output to a strict JSON object — you get the benefit of the reasoning without a chatty response. Second, the rules are stated as hard constraints, including an explicit PII-masking instruction, because in a self-hosted pipeline there's no vendor moderation layer to catch leaks — the safety has to live in your prompt and your post-processing. Third, the triple-quoted record block keeps user content cleanly separated from instructions, which matters even more when the model is processing untrusted internal text at volume.
Closing
The default pick for private and self-hosted workloads in 2026 is DeepSeek V4: open-weight, self-hostable, best-in-class open reasoning and code, very low running cost — with the single honest caveat that it's text-only. Switch to Llama 4 Maverick when the workload needs open-weight multimodality and a 1M context, and to Mistral Large 3 when European data residency and multilingual quality are the deciding constraints. And if you can't actually operate GPU infrastructure to a high standard, don't force a self-host for the feeling of control — a closed model on a zero-data-retention tier is frequently the better risk-adjusted choice.
Two reads pair naturally with this one: the AI model selection guide for the full decision tree across every task, and which AI model should you use for the quick-start version. If you're weighing whether to customize an open-weight base for your private data, the fine-tuning vs prompting vs RAG guide covers that decision directly.
Once you've picked your base model, the prompt is what makes it reliable behind your firewall. Describe what you need and let our AI prompt generator build the structured, schema-targeted prompt for you — no tuning loop required.
