This article is the 0th installment of the series "AI Recommendation Lab." All subsequent installments (SaaS edition, job-change edition, finance edition…) are built on this methodology page as their foundation. If you're unsure how to read the numbers, come back here.
Why We Started This Experiment
"Best CRM tools," "job placement agency comparison" — more and more people are asking these kinds of questions to ChatGPT or Perplexity rather than Google. And AI doesn't return a list of search result links — it names specific services directly.
This creates an urgent question for marketers: "When that happens, is our brand being mentioned by AI?"
Yet there is almost no public data to verify this. What exists is either non-public dashboards belonging to each company, or position talk centered on selling their own tools. Data that is independent, cross-industry, and with the entire measurement methodology disclosed is almost nowhere to be found.
So let's measure it ourselves — that's what this series is.
What We Want to Know: "Do Recommendations Change With Query Specificity?"
Simply listing "which AI mentions what" isn't interesting enough. What this experiment focuses on is how recommended brands shift as questions become more specific. We submit the same topic at three levels of specificity.
- KW (keyword): "CRM tool comparison"
- NL (natural question): "What CRM do you recommend for a sales team?"
- Contextual NL (with conditions): "What CRM is easy to use and cost-effective even for a small sales team?"
The question we're posing is this:
As questions become more specific, are established players safe? Or does opportunity emerge for challengers?
Either outcome makes for a good article. If challengers overtake, it's "a map of winning strategies." If established players hold firm even with context, it's "a warning that bias doesn't break." We're not deciding the conclusion in advance. We let the data speak.
Methodology
We call each AI's API (developer interface) directly and submit the same question a minimum of 3 times. The AI models used are as follows (exact versions are recorded as the snapshot returned by the API at the time of each response).
AI Models Used and Web Search Status
| AI |
Model |
Web Search |
| ChatGPT (OpenAI) |
gpt-4o |
Off |
| Gemini (Google) |
gemini-2.5-flash |
Off |
| Claude (Anthropic) |
claude-sonnet-4 |
Off |
| Grok (xAI) |
grok-3 |
Off |
| Perplexity |
sonar |
On (due to the model's nature) |
※ Copilot is excluded this time as there is no public API.
This is very important for reading the results. Four of the five (ChatGPT, Gemini, Claude, Grok) answer from learned knowledge without using web search. Only Perplexity (sonar) searches the web on the spot to answer. In other words, "different AIs mention different brands" includes not just differences in preference but also this structural difference. In particular, new services and domestically-focused tools tend not to appear in the four that rely on training data, while they tend to appear more easily in Perplexity, which does live searches — this pattern is expected. This series reads results with this difference as a premise.
Conditions are controlled. Each request is a single standalone question (no conversation history, no system prompt, no user identification or memory), starting fresh each time. Temperature and other settings are at their defaults. So the results are not influenced by individual search histories or past interactions — they reflect "a single bare-bones question."
Why a minimum of 3 times? AI responses are probabilistic and vary with each submission of the same question. So one run isn't enough to measure. We repeat and look at "appearance rate (how many times out of how many attempts the brand appeared)".
Counting Rules (Unglamorous, But This Is the Heart of It)
AI responses cannot be aggregated as-is. We organize them with the following rules.
- Standardize to the brand level: "Salesforce" and "Salesforce Sales Cloud" are counted as the same Salesforce. Variation in expression and differences in product editions are consolidated under the parent brand.
- Exclude out-of-category results: If Office or Teams appears in a CRM question, it's not a CRM recommendation, so it's excluded from the count. (However, the original data is retained to allow for later verification.)
- Count as once if the same brand appears multiple times in the same response, adopting the highest-ranked position.
Metrics — All Tools for Measuring "Do Recommendations Change With Specificity?"
In brief: ① how often brands appear, ② and ③ how highly they rank, ④ differences between AIs, ⑤ AI accuracy, ⑥ the balance of power between established players and challengers. All of them verify "do recommendations change when asked specifically?" from different angles.
① Appearance Rate — The Most Basic "Face Time" Metric
What percentage of responses include the brand name for a given query style (number of responses with the brand ÷ total responses; e.g., 12 out of 15 for 5 AIs × 3 runs = 80%). Lining up the same brand's appearance rate from KW → NL → Context reveals "cliffs" like "established players at 100% on keywords, then a sharp drop on specific questions" — the main number in this experiment.
② Visibility Score — A Comprehensive Score That Includes "How High It Ranked"
Appearance rate alone treats "appeared in 1st place" and "appeared in last place" as the same one count. But only the top of AI responses gets read. So we average "1 ÷ rank" per run (1st = 1.0, 2nd = 0.5, 3rd ≈ 0.33… 0 if not appearing) across all runs and multiply by 100 (same logic as Mean Reciprocal Rank = MRR used in information retrieval). Even if appearance rates are the same, this captures "gradual sinking" where rank slowly drops with specificity.
③ Average Rank — Supporting Evidence for the Score
When appearing, what was the average position (smaller = higher)? This reveals whether a brand "appears but tends toward the bottom" or "always appears near the top" when it appears — and displaying it alongside the score ensures transparency.
④ Differences Between AIs — "Which AI Is It More Likely to Appear In?"
Even for the same question, ChatGPT and Perplexity mention different brands. Lining up the tendencies of each AI shows whether the five AIs align or diverge when queries become specific. This also means the optimization approach differs by AI. (Brands are also classified as "domestic / overseas," examining how much each AI picks up domestic brands.)
⑤ Category Deviation Rate — AI "Accuracy"
How much out-of-category content was mixed in (e.g., mentioning Word or Teams for a CRM question; out-of-category count ÷ total items mentioned). A supplementary measure examining AI behavior: does it become more accurate as questions become more specific?
⑥ Established vs. Challenger Share — The "Final Answer" Across Industries
At each stage, what percentage of mentioned brands are established players vs. challengers? Brand names differ by industry and can't be compared directly, but "established or challenger" allows comparison across all industries on one chart. The line between established and challenger is drawn using published rankings accepted in each industry, to avoid subjectivity (ITreview for SaaS, Oricon customer satisfaction for job-change, account numbers and market share for finance, etc. — source and retrieval date noted per industry). If established player share drops and challenger share rises from KW → contextual, that means "specificity creates opportunity for challengers." If it doesn't change, "bias doesn't break even with specificity." This is the backbone number of the series.
Being Honest (The Limitations of This Experiment)
- AI behavior changes with updates. This is "a snapshot at a specific point in time." So we record dates and continue as longitudinal observations.
- We're not deciding the conclusion in advance. If the data contradicts the hypothesis, we'll write that honestly too.
- Conflict of interest disclosure. The author provides a GEO strategy tool that optimizes how brands are displayed by AI. This survey was conducted independently, without requests or compensation from any of the brands measured. The methodology and source data are fully disclosed in a form that third parties can verify.
The Series Going Forward
We'll keep testing the same question across different industries. SaaS → job-change → finance → … After examining "how do recommendations shift with query specificity?" in each industry, we'll publish a cross-industry summary at the end.
Next installment: results from the first test subject, "SaaS." At the moment a question becomes specific, who does AI cut — and who does it choose?
(The method for reading numbers and aggregation rules in each installment all follow this methodology page.)