【実験】AIは、どのブランドを"おすすめ"するのか？｜狙いと検証方法について

実験 2026-06-12

公開日：2026年06月5日

この記事は連載「AI推薦の実験室」の第0回です。以降の各回（SaaS編・転職編・金融編）は、すべてこの方法ページを土台にしています。数字の読み方に迷ったら、ここに戻ってきてください。

なぜこの実験を始めたのか

「CRM おすすめ」「転職エージェント比較」——こうした調べものを、GoogleではなくChatGPTやPerplexityに聞く人が増えました。そしてAIは、検索結果のリンク一覧ではなく、具体的なサービス名を名指しで挙げてきます。

ここでマーケターには切実な問いが生まれます。「そのとき、自社のブランドはAIに挙げてもらえているのか？」

ところが、これを確かめられる公開データはあまりありません。あるのは各社の非公開ダッシュボードか、自社ツールを売るためのポジショントークが中心です。独立した立場で、業界を横断して、しかも測り方を全部公開したデータは、ほとんど見当たらない。

それなら自分で測ってみよう——というのがこの連載です。

知りたいのは「質問の具体度で推薦は変わるのか」

単に「どのAIが何を挙げるか」を並べても面白くありません。この実験が注目するのは、質問が具体的になるにつれて、推薦されるブランドがどう動くかです。同じトピックを3段階の聞き方で投げます。

KW（キーワード）：「CRM ツール比較」
NL（自然な質問）：「営業チームにおすすめのCRMは？」
文脈付きNL（条件つき）：「小規模な営業チームでも使いやすくてコスパいいCRMは？」

立てている問いはこうです。

質問が具体的になるほど、大手は安泰なのか。それとも挑戦者にもチャンスが生まれるのか。

挑戦者が逆転するなら"勝ち筋の地図"、大手が文脈でも揺るがないなら"バイアスは崩れない"という警告。結論は先に決めません。データに語らせます。

測り方

各AIのAPI（開発者向けの接続口）を直接呼び、同じ質問を最低3回ずつ投げています。使ったAIとモデルは次のとおりです（厳密なバージョンは、各回答時点でAPIが返したスナップショット版を記録しています）。

使用したAIモデルとウェブ検索の有無
AI	モデル	ウェブ検索
ChatGPT（OpenAI）	gpt-4o	オフ
Gemini（Google）	gemini-2.5-flash	オフ
Claude（Anthropic）	claude-sonnet-4	オフ
Grok（xAI）	grok-3	オフ
Perplexity	sonar	オン（モデルの性質上）

※Copilotは公開APIがないため、今回は対象外です。

ここは結果を読むうえでとても重要です。 5つのうち4つ（ChatGPT・Gemini・Claude・Grok）はウェブ検索を使わず、学習済みの知識から答えています。一方Perplexity（sonar）だけは、その場でウェブを検索して答える仕組みです。つまり「AIによって挙げるブランドが違う」のは、好みの差だけでなくこの構造の違いも含みます。とくに、新しいサービスや国内ローカルのツールは、学習データに頼る4つでは出にくく、ライブ検索するPerplexityでは出やすい——という傾向が予想されます。本連載はこの違いを前提に読み解きます。

条件はそろえています。 各リクエストは1問だけの単発（会話履歴なし・システムプロンプトなし・ユーザー識別やメモリなし）で、毎回まっさらな状態。temperatureなどは既定値です。だから個人の検索履歴や過去のやり取りに左右されない、"素の質問1本"の結果になります。

なぜ最低3回か。 AIの回答は確率的で、同じ質問でも毎回ブレます。だから1回では測れない。繰り返して「出現率（何回中、何回登場したか）」で見ます。

数えるときのルール（地味だけど、ここが命）

AIの回答はそのままでは集計できません。次のルールで整えています。

ブランド単位に揃える：「Salesforce」と「Salesforce Sales Cloud」は同じSalesforceとして数えます。表記ゆれや製品エディションの違いは、親ブランドに統合します。
カテゴリ外は除外：CRMの質問でOfficeやTeamsが出てきても、それはCRMの推薦ではないので集計から外します（ただし元データは残し、後から検証できるようにします）。
同じ回答に同じブランドが複数回出たら1回として、いちばん上の順位を採用します。

使う指標 ——すべて「具体度で推薦は変わるか」を測る道具

ざっくり言うと、①でどれだけ顔を出すか、②③でどれだけ上位に出るか、④でAIごとの違い、⑤でAIの的確さ、⑥で大手と挑戦者の勢力図。どれも「具体的に聞くと推薦は変わるのか」を、別の角度から確かめるためのものです。

① 出現率 ——いちばん基本の"顔出し率"

その聞き方をしたとき、何％の回答にそのブランドの名前が出たか（登場した回答数 ÷ 全回答数。例：5AI×3回＝15回のうち12回なら80％）。同じブランドの出現率を KW→NL→文脈と並べると、「大手はキーワードで100％なのに、具体的な質問で急落」といった"崖"が見える——この実験の主役の数字です。

② 可視性スコア ——"上位で出たか"まで含めた総合点

出現率だけだと「1位で出た」も「最下位で出た」も同じ1回扱いになります。でもAIの回答は上のほうしか読まれません。そこで各回の「1 ÷ 順位」（1位=1.0、2位=0.5、3位≈0.33…出なければ0）を全回で平均し、×100します（情報検索で使われる平均逆順位＝MRRと同じ考え方）。出現率が同じでも、具体化で順位がジワジワ下がる"地盤沈下"を捕まえられます。

③ 平均順位 ——スコアの裏づけ

登場したとき、平均で何番目に挙げられたか（小さいほど上位）。「出るけど下のほう」か「出れば必ず上位」かが分かり、スコアと並べて出すことで透明性も担保します。

④ AI間の差 ——"どのAIで出やすいか"

同じ質問でも、ChatGPTとPerplexityでは挙げるブランドが違います。AIごとのクセを並べると、具体化したときに5つのAIの足並みがそろうのか割れるのかが見える。最適化の打ち手はAIごとに違う、ということでもあります。（ブランドは「国産／海外」でも分類し、どのAIが国産をどれだけ拾うかも見ます。）

⑤ カテゴリ逸脱率 ——AIの"的確さ"

そのカテゴリと無関係なものをどれだけ混ぜたか（CRMの質問でWordやTeamsを挙げる、など。カテゴリ外の数 ÷ 挙げた項目総数）。質問が具体的になるほど的確になるのか、というAIの挙動を見る補助役です。

⑥ 大手 vs 挑戦者シェア ——業界横断の"最終回答"

各段階で、挙がったブランドのうち大手が何％・挑戦者が何％か。ブランド名は業界ごとに違って比べられませんが、「大手か挑戦者か」なら全業界を1枚で比較できます。大手と挑戦者の線引きは、主観を避けるため、その業界で通用する公開ランキングの上位を典拠にします（SaaSならITreview、転職ならオリコン顧客満足度、金融なら口座数・シェアなど。業界ごとに出典と取得日を明記します）。KW→文脈で大手シェアが下がり挑戦者が上がれば「具体化は挑戦者のチャンス」、変わらなければ「バイアスは具体化でも崩れない」。連載の背骨になる数字です。

正直に言っておくこと（この実験の限界）

AIは更新で挙動が変わります。 これは「ある時点のスナップショット」です。だから日付を記録し、定点観測として続けます。
結論を先に決めません。 データが仮説を否定したら、それも正直に書きます。
利益相反の開示。 筆者は、AIにブランドがどう表示されるかを最適化するGEO対策ツールを提供しています。本調査は、測定対象の各ブランドからの依頼・報酬を一切受けず、独立して実施しました。手法と元データはすべて公開し、第三者が検証できる形にしています。

これからの連載

同じ問いを、業界を変えながら試していきます。SaaS → 転職 → 金融 → …。各業界で「質問の具体度で推薦はどう動くか」を見たあと、最後に全業界を横断したまとめを出します。

次回は、最初の実験台「SaaS」の結果から。AIは、具体的な質問になった瞬間に、誰を切り、誰を選ぶのか——。

（各回の数字の読み方・集計ルールは、すべてこの方法ページに準じます。）

Author: Kiyoto Yoshida (CMO, FID Inc. / PM, Genview)

Published: Jun 5, 2026

This article is the 0th installment of the series "AI Recommendation Lab." All subsequent installments (SaaS edition, job-change edition, finance edition…) are built on this methodology page as their foundation. If you're unsure how to read the numbers, come back here.

Why We Started This Experiment

"Best CRM tools," "job placement agency comparison" — more and more people are asking these kinds of questions to ChatGPT or Perplexity rather than Google. And AI doesn't return a list of search result links — it names specific services directly.

This creates an urgent question for marketers: "When that happens, is our brand being mentioned by AI?"

Yet there is almost no public data to verify this. What exists is either non-public dashboards belonging to each company, or position talk centered on selling their own tools. Data that is independent, cross-industry, and with the entire measurement methodology disclosed is almost nowhere to be found.

So let's measure it ourselves — that's what this series is.

What We Want to Know: "Do Recommendations Change With Query Specificity?"

Simply listing "which AI mentions what" isn't interesting enough. What this experiment focuses on is how recommended brands shift as questions become more specific. We submit the same topic at three levels of specificity.

KW (keyword): "CRM tool comparison"
NL (natural question): "What CRM do you recommend for a sales team?"
Contextual NL (with conditions): "What CRM is easy to use and cost-effective even for a small sales team?"

The question we're posing is this:

As questions become more specific, are established players safe? Or does opportunity emerge for challengers?

Either outcome makes for a good article. If challengers overtake, it's "a map of winning strategies." If established players hold firm even with context, it's "a warning that bias doesn't break." We're not deciding the conclusion in advance. We let the data speak.

Methodology

We call each AI's API (developer interface) directly and submit the same question a minimum of 3 times. The AI models used are as follows (exact versions are recorded as the snapshot returned by the API at the time of each response).

AI Models Used and Web Search Status
AI	Model	Web Search
ChatGPT (OpenAI)	gpt-4o	Off
Gemini (Google)	gemini-2.5-flash	Off
Claude (Anthropic)	claude-sonnet-4	Off
Grok (xAI)	grok-3	Off
Perplexity	sonar	On (due to the model's nature)

※ Copilot is excluded this time as there is no public API.

This is very important for reading the results. Four of the five (ChatGPT, Gemini, Claude, Grok) answer from learned knowledge without using web search. Only Perplexity (sonar) searches the web on the spot to answer. In other words, "different AIs mention different brands" includes not just differences in preference but also this structural difference. In particular, new services and domestically-focused tools tend not to appear in the four that rely on training data, while they tend to appear more easily in Perplexity, which does live searches — this pattern is expected. This series reads results with this difference as a premise.

Conditions are controlled. Each request is a single standalone question (no conversation history, no system prompt, no user identification or memory), starting fresh each time. Temperature and other settings are at their defaults. So the results are not influenced by individual search histories or past interactions — they reflect "a single bare-bones question."

Why a minimum of 3 times? AI responses are probabilistic and vary with each submission of the same question. So one run isn't enough to measure. We repeat and look at "appearance rate (how many times out of how many attempts the brand appeared)".

Counting Rules (Unglamorous, But This Is the Heart of It)

AI responses cannot be aggregated as-is. We organize them with the following rules.

Standardize to the brand level: "Salesforce" and "Salesforce Sales Cloud" are counted as the same Salesforce. Variation in expression and differences in product editions are consolidated under the parent brand.
Exclude out-of-category results: If Office or Teams appears in a CRM question, it's not a CRM recommendation, so it's excluded from the count. (However, the original data is retained to allow for later verification.)
Count as once if the same brand appears multiple times in the same response, adopting the highest-ranked position.

Metrics — All Tools for Measuring "Do Recommendations Change With Specificity?"

In brief: ① how often brands appear, ② and ③ how highly they rank, ④ differences between AIs, ⑤ AI accuracy, ⑥ the balance of power between established players and challengers. All of them verify "do recommendations change when asked specifically?" from different angles.

① Appearance Rate — The Most Basic "Face Time" Metric

What percentage of responses include the brand name for a given query style (number of responses with the brand ÷ total responses; e.g., 12 out of 15 for 5 AIs × 3 runs = 80%). Lining up the same brand's appearance rate from KW → NL → Context reveals "cliffs" like "established players at 100% on keywords, then a sharp drop on specific questions" — the main number in this experiment.

② Visibility Score — A Comprehensive Score That Includes "How High It Ranked"

Appearance rate alone treats "appeared in 1st place" and "appeared in last place" as the same one count. But only the top of AI responses gets read. So we average "1 ÷ rank" per run (1st = 1.0, 2nd = 0.5, 3rd ≈ 0.33… 0 if not appearing) across all runs and multiply by 100 (same logic as Mean Reciprocal Rank = MRR used in information retrieval). Even if appearance rates are the same, this captures "gradual sinking" where rank slowly drops with specificity.

③ Average Rank — Supporting Evidence for the Score

When appearing, what was the average position (smaller = higher)? This reveals whether a brand "appears but tends toward the bottom" or "always appears near the top" when it appears — and displaying it alongside the score ensures transparency.

④ Differences Between AIs — "Which AI Is It More Likely to Appear In?"

Even for the same question, ChatGPT and Perplexity mention different brands. Lining up the tendencies of each AI shows whether the five AIs align or diverge when queries become specific. This also means the optimization approach differs by AI. (Brands are also classified as "domestic / overseas," examining how much each AI picks up domestic brands.)

⑤ Category Deviation Rate — AI "Accuracy"

How much out-of-category content was mixed in (e.g., mentioning Word or Teams for a CRM question; out-of-category count ÷ total items mentioned). A supplementary measure examining AI behavior: does it become more accurate as questions become more specific?

⑥ Established vs. Challenger Share — The "Final Answer" Across Industries

At each stage, what percentage of mentioned brands are established players vs. challengers? Brand names differ by industry and can't be compared directly, but "established or challenger" allows comparison across all industries on one chart. The line between established and challenger is drawn using published rankings accepted in each industry, to avoid subjectivity (ITreview for SaaS, Oricon customer satisfaction for job-change, account numbers and market share for finance, etc. — source and retrieval date noted per industry). If established player share drops and challenger share rises from KW → contextual, that means "specificity creates opportunity for challengers." If it doesn't change, "bias doesn't break even with specificity." This is the backbone number of the series.

Being Honest (The Limitations of This Experiment)

AI behavior changes with updates. This is "a snapshot at a specific point in time." So we record dates and continue as longitudinal observations.
We're not deciding the conclusion in advance. If the data contradicts the hypothesis, we'll write that honestly too.
Conflict of interest disclosure. The author provides a GEO strategy tool that optimizes how brands are displayed by AI. This survey was conducted independently, without requests or compensation from any of the brands measured. The methodology and source data are fully disclosed in a form that third parties can verify.

The Series Going Forward

We'll keep testing the same question across different industries. SaaS → job-change → finance → … After examining "how do recommendations shift with query specificity?" in each industry, we'll publish a cross-industry summary at the end.

Next installment: results from the first test subject, "SaaS." At the moment a question becomes specific, who does AI cut — and who does it choose?

(The method for reading numbers and aggregation rules in each installment all follow this methodology page.)

← 実験・コラムに戻る