1000ページより、AIに読まれる10ページを作れ

コラム 2026-06-16

公開日：2026年6月16日

1000ページのサイトがある。ユーザーがAIに「おすすめのGEO対策ツールを教えて」と聞いた。このとき、検索・Groundingが有効なAIは、サイト全体を読むわけではありません。リアルタイムで取得できる情報の中から、回答に使えそうなページやチャンクを選びます。

Groundingとは、AIが回答を生成する際に外部のWebページをリアルタイムで検索・取得し、その情報を根拠として使う仕組みです。ChatGPTのWeb検索やGeminiの検索連携がこれにあたります。

では、その数ページはどう選ばれるのか。何が選ばれて、何が無視されるのか。

AIは1000ページを読まない

ここで扱うのは、AIの学習済みデータに何が含まれているかではありません。ChatGPTのWeb検索、GeminiのGrounding、Perplexityのように、回答時に外部情報をリアルタイムで取得する場面で、どのページが取得候補になるかという話です。

RAGや検索連携型のAIでは、ユーザーのクエリに答えるために、外部情報を検索・取得し、その情報を根拠として回答を生成します。このプロセスは「全ページを読む」ではありません。クエリをベクトル（数値）に変換し、取得候補となるページやチャンクの中から意味的に近いものを絞り込んで取得します。

サイト内の全ページ（例：1000ページ）

↓ クエリをベクトル変換・類似度スコアリング

意味的に近い上位チャンク（数〜十数件）

↓ コンテキストとして使用

回答生成

取得される単位はページ全体ではなくチャンクと呼ばれる断片です。1ページが複数のチャンクに分割されることもあります。極端に言えば、1000ページあるサイトでも、あるクエリに対して実際に回答生成へ使われるのは全体の1%未満かもしれません。

Groundingで参照されにくいページ

私が観察する限り、以下のようなページは一般的な商材・サービス系クエリに対して参照されにくい傾向があります。

採用情報・求人ページ：サービスに関するクエリには意味的に遠い
プライバシーポリシー・利用規約：法的文書は一般的な質問への回答には使われにくい
会社沿革・IR情報：企業の歴史は製品・サービスのクエリとは離れている
カテゴリ一覧・タグページ：コンテンツへのリストであり、内容自体が薄い
ページネーション（2ページ目以降）：情報の密度が低く、クエリとの類似度も下がりやすい
テキストが薄い・画像中心のページ：取得できるテキスト情報がそもそも少ない

また、404エラーやリダイレクトループなど技術的な問題を抱えるページは、そもそも取得対象になりにくい可能性があります。1000ページと言っても、実際にAIが読める状態にあるページは思っているよりずっと少ないかもしれません。

Groundingで参照されやすいページ

正確には、AIはページ種別ではなく「クエリへの適合度」で取得します。その結果として、以下のようなページが取得されやすくなる傾向があります。

FAQページ：Q&A形式は「答え」として抽出しやすく、チャンクに分割しやすい
比較記事・比較ページ：「〇〇と△△の違いは？」系のクエリに対して意味的に近い
用語集・定義ページ：「〇〇とは？」系クエリに対して正確に一致する
導入事例・実績ページ：「実際に使っている会社は？」系クエリに対応
機能説明ページ：具体的な機能に関するクエリに対して意味的に近い

海外のGEO調査メディアAI+Automationは「クエリのインテントが取得プールを決める」という2段階モデルを指摘しています。FAQや比較ページが取得されやすいのは、それらのページがユーザーの問いに直接答える構造を持っているからです。またNiaraは「FAQや仕様ページなど、事実ベースで具体的なデータを持つページを優先せよ」と述べています。

Genviewが用語集・FAQをコンテンツの軸に置いている理由はここにあります。AIが参照しやすい形のページを意図的に作ることが、Grounding対策の基本です。

内部リンク100本よりクエリ一致

よく聞かれる誤解があります。「内部リンクが多いページはAIにも優先される」という考え方です。これは半分正しく、半分違います。

内部リンクが多いページはクローラーに先に発見されやすくなります。しかし発見されることと、Grounding時に取得されることは別の話です。

例：「GEO対策ツール比較」というクエリが来た場合

内部リンク100本ホームページ（会社紹介・サービス概要）参照されにくい

内部リンク3本 GEO対策ツール比較ページ（詳細な比較コンテンツ）参照されやすい

Grounding時の取得優先度には、クエリとの意味的な近さが強く影響します。やみくもに内部リンクを増やすのではなく、引用されたいページのコンテンツをクエリに対して充実させることが重要です。ただし、クロールすらされていなければ土台に立てないため、引用されたいページへの内部リンクは最低限必要です。

llms.txtはどこまで効くのか

llms.txtは、AIに対してサイト構造や重要ページを伝えるための仕組みで、AI向けサイトマップのような構想として注目されています。一部のAIシステムやクローラーでは、発見・理解の補助情報として利用される可能性があります。

ただし、2026年現在でもllms.txtの実効性は限定的であり、すべてのAIシステムが対応しているわけではありません。記載したからといって必ず参照されるわけでもない。llms.txtを整備することは有効ですが、それだけでGrounding時の参照が保証されるものではないという点は押さえておく必要があります。

1000ページより、AIに読まれる10ページを作れ

1000ページ作ることより、AIに読まれる10ページを作る方が重要です。

これはページ数を減らせという意味ではありません。1000ページ作っても、Grounding時にAIが参照するのはその中の一部です。技術的な問題で読めないページがあり、意味的に遠いページは取得されない。だからこそ、「どのページがAIに読まれるか」を意識してページを設計することに意味があります。

引用されたいクエリを決める。そのクエリに答えるページを特定する。そのページをFAQ・比較・定義など参照されやすい構造で作る。クロールされるよう内部リンクを整える。これが、Grounding時代のコンテンツ設計の基本だと私は考えています。

まとめ

この記事が扱うのは学習済みデータではなく、Grounding・検索連携が有効な状態でのリアルタイム取得の話
AIはクエリをベクトル変換し、意味的に近い上位チャンクだけを取得して回答する。1000ページのうち参照されるのはごく一部
404エラーやリダイレクトループなど技術的な問題を抱えるページは取得対象になりにくい
採用・プライバシーポリシー・カテゴリ一覧・ページネーションなどは一般的なクエリに対して参照されにくい
AIはページ種別ではなくクエリへの適合度で取得する。結果としてFAQ・比較記事・用語集が取得されやすい
内部リンクが多い＝Grounding時に優先されるは誤解。取得優先度にはクエリとの意味的な近さが強く影響する
llms.txtはAI向けサイトマップのような構想として注目されているが、実効性はまだ限定的
1000ページより、AIに読まれる10ページを作ることの方が重要

関連用語：RAGの仕組みについてはRAG（Retrieval-Augmented Generation）をご覧ください。

関連用語：コンテンツの分割単位についてはChunk（チャンク）をご覧ください。

関連用語：AIへのページ指示についてはllms.txtをご覧ください。

この記事をまとめながら、ふと気になったことがあります。UTMパラメータ付きのURLが大量に存在するサイトでは、Grounding時の取得結果はどう変わるのか。調べてみましたが、「誰も分からない」が現時点で最も正確な答えでした。ChatGPT・Gemini・Perplexityのいずれも、取得候補生成のアルゴリズムを公開していないためです。ブラックボックスでした。いつか「UTM付きURLはGrounding時にどう扱われるか」を実際に実験して、記事にしたいと思っています。GEOの世界は、まだ「分かっていること」より「分かっていないこと」の方が多いのかもしれません。

Author: Kita Yohei

Published: June 16, 2026

A site has 1,000 pages. A user asks AI: "What GEO tools do you recommend?" At that moment, an AI with search and　Grounding enabled doesn't read the entire site. It selects pages and chunks that look useful for generating a response from what it can retrieve in real time.

Grounding refers to the mechanism by which AI retrieves external web pages in real time when generating a response, using that information as its basis. ChatGPT's web search and Gemini's search integration are examples of this.

So how are those few pages chosen? What gets selected, and what gets ignored?

AI Doesn't Read All 1,000 Pages

This article isn't about what's contained in AI's trained data. It's about which pages become retrieval candidates when AI fetches external information in real time — as in ChatGPT's web search, Gemini's Grounding, or Perplexity.

In RAG and search-integrated AI systems, the model searches for and retrieves external information to answer a query, then uses that information as the basis for its response. This process doesn't mean "reading all pages." The query is converted into a vector, and only the most semantically similar pages and chunks from the retrieval candidates are selected.

All pages on the site (e.g. 1,000 pages)

↓ Query converted to vector · similarity scoring

Top semantically similar chunks (a handful to a few dozen)

↓ Used as context

Response generated

The unit of retrieval is a chunk — a fragment, not a full page. A single page can be split into multiple chunks. To put it starkly: even on a 1,000-page site, the content actually used in generating a response to any given query might be less than 1% of the total.

Pages Less Likely to Be Referenced in Grounding

From what I observe, the following types of pages tend to be referenced rarely for general product and service queries.

Recruitment and job listing pages: Semantically distant from service-related queries
Privacy policies and terms of service: Legal documents rarely useful for answering general questions
Company history and IR information: Corporate history is removed from product and service queries
Category index and tag pages: Lists of links to content, with little substance of their own
Pagination (page 2 and beyond): Lower content density, lower similarity to most queries
Thin content or image-heavy pages: Simply not enough text to retrieve

Pages with technical problems — 404 errors or redirect loops — may not be retrievable in the first place. Even if a site claims 1,000 pages, the number AI can actually read may be far smaller than expected.

Pages More Likely to Be Referenced in Grounding

To be precise: AI retrieves pages based on query fit, not page type. As a result, the following types of pages tend to be retrieved more often.

FAQ pages: Q&A format is easy to extract as "answers" and chunk cleanly
Comparison articles and pages: Semantically close to "what's the difference between X and Y?" queries
Glossary and definition pages: Directly match "what is X?" queries
Case studies and implementation examples: Match "who's actually using this?" queries
Feature description pages: Semantically close to specific feature-related queries

GEO research publication AI+Automation describes a two-level model in which "query intent determines the retrieval pool." FAQ and comparison pages tend to be retrieved because they're structured to directly answer what users are asking. Niara similarly advises prioritizing "pages with factual, unique data — product specs, pricing, FAQs, and documentation."

This is why Genview builds its content around glossaries and FAQs. Intentionally creating pages that AI can easily reference is the foundation of Grounding strategy.

Query Match Beats 100 Internal Links

There's a common misconception worth addressing: "Pages with lots of internal links get prioritized by AI too." This is half right and half wrong.

Pages with many internal links are more likely to be discovered by crawlers first. But being discovered and being retrieved during Grounding are different things.

Example: query "compare GEO strategy tools"

100 internal links Homepage (company overview, service summary) Less likely to be retrieved

3 internal links GEO tool comparison page (detailed comparison content) More likely to be retrieved

Semantic closeness to the query has strong influence on retrieval priority during Grounding. Rather than adding internal links indiscriminately, the more important task is making the content of pages you want cited as relevant as possible to the queries you're targeting. That said, pages that aren't crawled at all can't be retrieved — so a minimum of internal links pointing to key pages is still necessary.

How Far Does llms.txt Actually Go?

llms.txt is a mechanism for communicating site structure and key pages to AI — a concept gaining attention as a kind of sitemap for AI systems. For AI crawlers that support it, it may serve as supplementary information that aids discovery and understanding.

That said, as of 2026 the practical effectiveness of llms.txt remains limited. Not all AI systems support it, and listing a page doesn't guarantee it will be retrieved. Setting up llms.txt is a valid step, but it shouldn't be treated as a guarantee of Grounding retrieval.

Stop Making 1,000 Pages. Make the 10 AI Actually Reads.

Building 1,000 pages matters less than building the 10 AI actually reads.

This isn't a call to reduce page count. The point is that even with 1,000 pages, AI only references a fraction during Grounding — some pages are technically unreadable, others are semantically too distant to retrieve. That's why designing pages with "which ones will AI read?" in mind is what actually matters.

Identify the queries you want to be cited for. Find or create the pages that answer those queries. Structure them as FAQs, comparisons, or definitions that are easy to retrieve. Ensure they're reachable through internal links. This is what I believe is the foundation of content design in the Grounding era.

Summary

This article covers real-time retrieval during Grounding and search-integrated AI — not trained data
AI converts queries to vectors and retrieves only the most semantically similar chunks. Of 1,000 pages, what gets used may be less than 1%
Pages with technical issues like 404 errors or redirect loops may not be retrievable in the first place
Recruitment, privacy policies, category indexes, and pagination are rarely referenced for most general queries
AI retrieves based on query fit, not page type — FAQ and comparison articles tend to score well on query fit
Many internal links ≠ prioritized in Grounding. Semantic closeness to the query has strong influence on retrieval priority
llms.txt is a promising concept but its effectiveness remains limited in practice
Build the pages AI actually reads — not more pages

Related term: For how RAG works, see RAG (Retrieval-Augmented Generation).

Related term: For how content is split for retrieval, see Chunk.

Related term: For how to signal pages to AI crawlers, see llms.txt.

While putting this article together, something caught my attention. What actually happens to Grounding results on a site with large numbers of UTM-parameterized URLs? I looked into it, and "nobody knows" turned out to be the most accurate answer available right now. ChatGPT, Gemini, and Perplexity all keep their retrieval candidate generation algorithms private. A black box. One day I'd like to run an actual experiment — "how does Grounding handle UTM-tagged URLs?" — and write it up. The world of GEO may still have more unknowns than knowns.

← 実験・コラムに戻る