Vendor evaluations fail in two directions. In the first, the team has no real criteria — the decision gets made by whoever ran the best demo. In the second, criteria exist but are shaped to fit the preferred vendor, and the scoring is a post-hoc justification of a decision already made. AI can help with the mechanical work around both failures — generating a rubric, applying it consistently, surfacing risks — but only if the evaluator feeds the model real vendor materials. Generic "compare vendors X and Y" prompts produce feature claims the model invented, and those inventions are hard to catch because they sound plausible.
This post sits in the operations track of our prompt engineering for business teams guide and pairs with AI SOP writing prompts, AI process automation prompts, and AI competitor analysis.
Why "Compare Vendor X and Vendor Y" Fails
The generic ask is the fastest way to produce confidently wrong output. Ask a model to compare two SaaS vendors it has not been given documents for, and it will return a clean table — feature by feature, with checkmarks and a concluding paragraph on trade-offs. Much of the table will be wrong.
The failure mode is hallucinated features. The model has seen enough vendor marketing pages to have a prior on what a vendor in a given category probably does, and it fills the table with that prior. Sometimes the prior is current. Often it is stale, confused across competitors, or invented to complete the shape of the answer. The reviewer cannot tell, because the output has the same visual authority as a correct comparison.
The second failure is feature-list thinking. Real vendor decisions turn on trade-offs the feature list hides — support responsiveness, data portability, integration depth, roadmap alignment, financial stability. A checkmark grid flattens those into binaries. A vendor that "supports SSO" and a vendor where SSO is a $20,000-a-year add-on get the same checkmark.
The third is no weighting. Two vendors tied on feature count can be radically different fits because the features that matter to this team are not the ones the model weighted. Without a rubric tied to the buyer's actual requirements, the comparison is a shape, not a decision.
Pattern 1: Scoring Criteria Generation
The first useful job for AI in vendor evaluation is generating a rubric. Given a vendor category and a short description of what the buying team needs, the model can produce a starting list of dimensions — functional coverage, integration fit, security posture, commercial terms, vendor viability, support model — with suggested weightings and concrete scoring anchors.
The prompt has to do three things. Name the category specifically (not "a CRM" but "a CRM for a 30-person B2B sales team moving off spreadsheets, with a Slack-first workflow"). Ask for dimensions with sub-criteria, not top-level abstractions. And require scoring anchors — what a 1, 3, and 5 look like on each dimension — because anchor-less rubrics become whatever the evaluator wants them to be.
The output is a draft. The team edits it: some dimensions get cut, some get reweighted, and the anchors get rewritten to match the team's actual thresholds. The point of generating the rubric with AI is not to skip that work; it is to start from a complete draft instead of a blank page.
Pattern 2: Weighted Comparison Against Real Materials
Once the rubric is set, the second job is applying it consistently across vendors. This is where the hallucination risk is highest and where the prompt design matters most. The model cannot apply the rubric to vendors it has no information about. It will pretend it can, and the output will look plausible.
The fix is mechanical. Feed the model actual vendor materials for each vendor — product docs, demo transcripts, security questionnaires, pricing pages, reference-call notes, RFP responses. Require a document citation for each score. When the materials contain no evidence for a dimension, require "insufficient evidence" rather than a score.
| Rubric input | What to feed | What it prevents |
|---|---|---|
| Functional coverage | Product docs, demo notes, feature pages captured during trial. | Feature claims the vendor does not actually make. |
| Integration fit | API docs, integration partner lists, implementation case studies. | Assumed integrations that are on the roadmap or do not exist. |
| Security posture | SOC 2 reports, security questionnaire responses, data processing agreements. | Invented certifications. |
| Commercial terms | Quote PDFs, pricing pages, contract redlines. | Hallucinated discounts or fabricated pricing tiers. |
| Support model | SLA documents, community forum samples, reference-call notes. | "Responsive support" with no evidence behind it. |
| Vendor viability | Funding announcements, customer count disclosures, company updates. | Plausible-sounding stability claims pulled from nowhere. |
The cited-evidence rule is non-negotiable. A rubric that scores every cell is less trustworthy than one that marks half "insufficient evidence." The gaps are the finding — they tell the team what to ask for in the next round of due diligence.
Pattern 3: Risk Flagging
The third job is surfacing risks, which tend to hide in long documents the evaluators do not read closely. Security questionnaires, data processing addenda, standard contracts, and pricing schedules contain the non-standard terms that cause trouble after signing. A risk-flagging prompt takes those documents as input and produces a categorized list.
The categories worth separating:
- Financial — pricing escalators, minimum commitments, overage rates, currency and invoicing terms, auto-renewal and termination conditions.
- Security and compliance — data residency, sub-processor disclosures, breach notification windows, certifications that have lapsed or never applied to the relevant product line.
- Integration and data portability — export formats, export fees, API rate limits, dependencies on proprietary formats that make exit painful.
- Vendor-lock and operational — contract length, termination-for-convenience clauses (or their absence), service credits that are meaningless in practice, professional-services dependencies.
- Roadmap and viability — features marketed as "coming soon" that appear to have been coming soon for multiple years, funding patterns, customer concentration signals.
For each flagged risk, the prompt should require the source document and the exact clause reference. A risk with no source is the model inventing risk from a pattern-match against similar vendors — plausible, sometimes correct, not trustworthy. The discipline is the same as the comparison pattern: evidence or "insufficient evidence," never confident synthesis.
Feeding Real Vendor Materials
This is the part teams skip because it is tedious. Demo transcripts have to be captured, security questionnaires collected, pricing PDFs retrieved, and the verbal "we can probably get you a discount" written down. Every shortcut here becomes a hallucination opportunity.
Useful rules for assembling the input pack. Capture demos as transcripts — video recordings are less useful to the model than text. Collect documents per vendor in a consistent structure so the prompt can reference "Vendor A's security questionnaire" reliably. Separate claims by source (vendor website vs. vendor-provided document vs. reference customer) so the prompt can weight them. And never paste a competitor's disparaging marketing content as evidence about another vendor; it contaminates the evaluation.
Never trust the model's prior. Training data is old, features change, products get renamed, companies get acquired. If a claim cannot be traced to a supplied document, it should not appear in the evaluation output.
Calibrating the Rubric
The first run of the rubric is never the final run. Calibration means applying the rubric to two or three vendors as a pilot, comparing the model's scoring against what the evaluation team would have scored manually, and adjusting until the two converge. Dimensions that everyone rates the same get cut for not discriminating. Dimensions where the model and the team disagree get rewritten — usually the anchors were ambiguous.
The calibration output is worth saving as a prompt template the team can reuse for the next vendor category. Each new category starts from the calibrated template and gets recalibrated — lighter the second time because the structure is proven.
Example: Scoring Rubric Prompt (Hypothetical)
A prompt for generating a starting rubric for a CRM evaluation. The example is hypothetical — category, buyer profile, and weightings are illustrative.
ROLE:
You are a procurement analyst drafting a vendor evaluation rubric.
You produce dimensions with concrete sub-criteria and scoring anchors.
You flag gaps rather than inventing criteria to complete the shape.
CONTEXT:
Category: CRM for a 30-person B2B sales team.
Buyer profile:
- Current state: Google Sheets pipeline, manual reporting.
- Integrations required: HubSpot (marketing), Slack, Gmail, Zoom.
- Team maturity: first CRM, low appetite for implementation cost.
- Annual budget ceiling: $30,000 all-in for year one.
Known constraints: data must stay in a US region; SSO required;
evaluator has four weeks end-to-end.
TASK:
Draft the rubric. For every dimension, include:
- A named dimension with one-sentence scope.
- Three to five sub-criteria, each observable.
- A suggested weight (percentage of total score).
- Scoring anchors for levels 1, 3, and 5. Each anchor must name
an observable state, not a quality adjective.
Cover at minimum: functional coverage, integration fit, security,
commercial terms, support model, and vendor viability. Add others
only if the buyer profile implies them.
When the buyer profile does not support a sub-criterion or anchor,
flag it as "[GAP: <what is missing>]" instead of inventing content.
ACCEPTANCE:
- Every dimension has a weight and at least three sub-criteria.
- Every scoring anchor names an observable state, not "good" /
"excellent" / "best-in-class."
- Weights sum to 100.
- Gaps are flagged, not filled with plausible guesses.
- No vendor names appear in the rubric. This is category-generic.
The "no vendor names" rule matters because rubrics written with specific vendors in mind tend to be rubrics the preferred vendor wins. Category-generic rubrics get applied to vendors afterward — the order that keeps the exercise honest.
Common Anti-Patterns
- "Compare vendor X and vendor Y" with no documents fed. Produces hallucinated feature tables. Fix: feed real materials per vendor; require cited evidence or "insufficient evidence."
- Quality adjectives as scoring anchors. "Best-in-class," "industry-leading," "robust." None are observable. Fix: rewrite anchors as observable states.
- Rubric written with the preferred vendor in mind. The rubric shape reveals the bias — weights concentrated on dimensions where the preferred vendor wins. Fix: write the rubric category-generic, then apply.
- Dimensions no one scores differently. If every vendor scores a 4 on a dimension, the dimension is not discriminating. Fix: cut in calibration, or rewrite the anchors so they discriminate.
- Risks flagged with no source clause. Confident synthesis, not evidence. Fix: require a document reference and clause for every flagged risk.
- Trusting the model's prior on vendor features, pricing, or certifications. Training data ages fast; features rename, certifications lapse, companies get acquired. Fix: treat unsupplied claims as unknown, not as fact.
FAQ
How many vendors should we run through the rubric?
Three to five is the useful range for most categories. Fewer and you have not tested the rubric; more and evaluation cost exceeds the value of a better decision. The rubric is for discriminating among finalists, not for initial screening. Screening can use a lighter prompt that filters on dealbreakers only.
What if we cannot get a security questionnaire from the vendor?
That is itself a finding. A vendor that will not complete a standard security questionnaire in the evaluation phase is not going to be more forthcoming after contract signature. Flag it as a risk, mark the rubric dimensions "insufficient evidence," and let the absence weigh in the decision.
How do we handle the reference-call bias problem?
References are chosen by the vendor; they are not random customers. Treat them as one input among several, not as ground truth. Ask the reference questions the vendor would not have prepped for — which features they have not used, what the worst support interaction was, what the renewal conversation was like. Feed the notes into the rubric as evidence, but weight them accordingly.
Can AI run the whole evaluation without a human?
No, and the failure modes are specific. The model cannot judge strategic fit — whether this vendor's direction aligns with the team's plans two years out. It cannot read the room on a demo call. It cannot weigh intangibles like whether the vendor's team is pleasant to work with. Use AI for the mechanical work — rubric drafting, consistent scoring, risk surfacing, pattern-matching across documents — and keep humans on the judgment. For adjacent patterns on automating structured operational work, see AI process automation prompts.
How do we keep vendor evaluations consistent across teams?
Save the calibrated rubric as a prompt template in a shared snippet library, and require new evaluations to start from it. The shared version evolves as teams find gaps — the procurement ops function owns the template the way engineering owns shared code. When a category changes enough that the old template no longer fits, fork it rather than letting the shared version drift silently. See AI SOP writing prompts for the same discipline applied to operating procedures.
The point of a vendor evaluation is to surface trade-offs, not to produce a number. Structured prompts with real materials surface the trade-offs the feature list hides. Generic prompts produce confident hallucinations in the shape of an evaluation — worse than no evaluation because it reads like one.