A site has 1,000 pages. A user asks AI: "What GEO tools do you recommend?" At that moment, an AI with search and Grounding enabled doesn't read the entire site. It selects pages and chunks that look useful for generating a response from what it can retrieve in real time.
Grounding refers to the mechanism by which AI retrieves external web pages in real time when generating a response, using that information as its basis. ChatGPT's web search and Gemini's search integration are examples of this.
So how are those few pages chosen? What gets selected, and what gets ignored?
AI Doesn't Read All 1,000 Pages
This article isn't about what's contained in AI's trained data. It's about which pages become retrieval candidates when AI fetches external information in real time — as in ChatGPT's web search, Gemini's Grounding, or Perplexity.
In RAG and search-integrated AI systems, the model searches for and retrieves external information to answer a query, then uses that information as the basis for its response. This process doesn't mean "reading all pages." The query is converted into a vector, and only the most semantically similar pages and chunks from the retrieval candidates are selected.
All pages on the site (e.g. 1,000 pages)
↓ Query converted to vector · similarity scoring
Top semantically similar chunks (a handful to a few dozen)
↓ Used as context
Response generated
The unit of retrieval is a chunk — a fragment, not a full page. A single page can be split into multiple chunks. To put it starkly: even on a 1,000-page site, the content actually used in generating a response to any given query might be less than 1% of the total.
Pages Less Likely to Be Referenced in Grounding
From what I observe, the following types of pages tend to be referenced rarely for general product and service queries.
- Recruitment and job listing pages: Semantically distant from service-related queries
- Privacy policies and terms of service: Legal documents rarely useful for answering general questions
- Company history and IR information: Corporate history is removed from product and service queries
- Category index and tag pages: Lists of links to content, with little substance of their own
- Pagination (page 2 and beyond): Lower content density, lower similarity to most queries
- Thin content or image-heavy pages: Simply not enough text to retrieve
Pages with technical problems — 404 errors or redirect loops — may not be retrievable in the first place. Even if a site claims 1,000 pages, the number AI can actually read may be far smaller than expected.
Pages More Likely to Be Referenced in Grounding
To be precise: AI retrieves pages based on query fit, not page type. As a result, the following types of pages tend to be retrieved more often.
- FAQ pages: Q&A format is easy to extract as "answers" and chunk cleanly
- Comparison articles and pages: Semantically close to "what's the difference between X and Y?" queries
- Glossary and definition pages: Directly match "what is X?" queries
- Case studies and implementation examples: Match "who's actually using this?" queries
- Feature description pages: Semantically close to specific feature-related queries
GEO research publication AI+Automation describes a two-level model in which "query intent determines the retrieval pool." FAQ and comparison pages tend to be retrieved because they're structured to directly answer what users are asking. Niara similarly advises prioritizing "pages with factual, unique data — product specs, pricing, FAQs, and documentation."
This is why Genview builds its content around glossaries and FAQs. Intentionally creating pages that AI can easily reference is the foundation of Grounding strategy.
Query Match Beats 100 Internal Links
There's a common misconception worth addressing: "Pages with lots of internal links get prioritized by AI too." This is half right and half wrong.
Pages with many internal links are more likely to be discovered by crawlers first. But being discovered and being retrieved during Grounding are different things.
Example: query "compare GEO strategy tools"
100 internal links
Homepage (company overview, service summary)
Less likely to be retrieved
3 internal links
GEO tool comparison page (detailed comparison content)
More likely to be retrieved
Semantic closeness to the query has strong influence on retrieval priority during Grounding. Rather than adding internal links indiscriminately, the more important task is making the content of pages you want cited as relevant as possible to the queries you're targeting. That said, pages that aren't crawled at all can't be retrieved — so a minimum of internal links pointing to key pages is still necessary.
How Far Does llms.txt Actually Go?
llms.txt is a mechanism for communicating site structure and key pages to AI — a concept gaining attention as a kind of sitemap for AI systems. For AI crawlers that support it, it may serve as supplementary information that aids discovery and understanding.
That said, as of 2026 the practical effectiveness of llms.txt remains limited. Not all AI systems support it, and listing a page doesn't guarantee it will be retrieved. Setting up llms.txt is a valid step, but it shouldn't be treated as a guarantee of Grounding retrieval.
Stop Making 1,000 Pages. Make the 10 AI Actually Reads.
Building 1,000 pages matters less than building the 10 AI actually reads.
This isn't a call to reduce page count. The point is that even with 1,000 pages, AI only references a fraction during Grounding — some pages are technically unreadable, others are semantically too distant to retrieve. That's why designing pages with "which ones will AI read?" in mind is what actually matters.
Identify the queries you want to be cited for. Find or create the pages that answer those queries. Structure them as FAQs, comparisons, or definitions that are easy to retrieve. Ensure they're reachable through internal links. This is what I believe is the foundation of content design in the Grounding era.
Summary
- This article covers real-time retrieval during Grounding and search-integrated AI — not trained data
- AI converts queries to vectors and retrieves only the most semantically similar chunks. Of 1,000 pages, what gets used may be less than 1%
- Pages with technical issues like 404 errors or redirect loops may not be retrievable in the first place
- Recruitment, privacy policies, category indexes, and pagination are rarely referenced for most general queries
- AI retrieves based on query fit, not page type — FAQ and comparison articles tend to score well on query fit
- Many internal links ≠ prioritized in Grounding. Semantic closeness to the query has strong influence on retrieval priority
- llms.txt is a promising concept but its effectiveness remains limited in practice
- Build the pages AI actually reads — not more pages
Related term: For how RAG works, see RAG (Retrieval-Augmented Generation).
Related term: For how content is split for retrieval, see Chunk.
Related term: For how to signal pages to AI crawlers, see llms.txt.
While putting this article together, something caught my attention. What actually happens to Grounding results on a site with large numbers of UTM-parameterized URLs? I looked into it, and "nobody knows" turned out to be the most accurate answer available right now. ChatGPT, Gemini, and Perplexity all keep their retrieval candidate generation algorithms private. A black box. One day I'd like to run an actual experiment — "how does Grounding handle UTM-tagged URLs?" — and write it up. The world of GEO may still have more unknowns than knowns.