A chunk is the meaning unit into which documents are divided during RAG Retrieval. In GEO strategy, "one topic per H2/H3 heading" is the fundamental design principle. A design that is easy to handle with structure-based splitting (content in which meaning is self-contained at the heading level) is considered effective, and BLUF and FAQ formats are potentially more likely to be appropriately handled as chunks. However, being retrieved and being cited are separate matters, and final citation is determined after post-Retrieval ranking and credibility evaluation. Three measures that content teams can immediately implement are: ① BLUF implementation, ② one section per one topic, and ③ FAQ format.
What Is a Chunk?
A chunk is an English word originally meaning "a lump or block." In the context of RAG (Retrieval-Augmented Generation), it refers to the unit into which long documents are divided into manageable sizes for Retrieval.
LLMs have an upper limit on the amount of text they can process at one time. Also, when long documents are used as search targets as-is, the accuracy of "which part is relevant to the question" decreases. For this reason, in RAG, documents are generally designed to be pre-divided into chunks and indexed, so that during Retrieval, only the chunks with high relevance to the question are retrieved.
Chunk Splitting Methods
Chunk splitting methods differ depending on the implementation, but there are primarily three types. From a GEO strategy perspective, content design that is easy to handle with structure-based splitting is considered effective.
Main Chunk Splitting Methods and Characteristics
| Splitting Method |
Overview |
Characteristics |
| Fixed-size splitting |
Mechanically split by character count or token count |
Simple to implement, but may cut in the middle of meaning |
| Structure-based splitting |
Split following document structure such as headings, paragraphs, and sections |
Meaning tends to be self-contained; easier to improve Retrieval accuracy |
| Semantic splitting |
Split by semantic coherence of content |
High accuracy but high processing cost |
From a GEO strategy perspective, content design that is easy to handle with structure-based splitting (structures in which meaning is self-contained at the heading level) is considered effective. However, many details of each AI service's chunking implementation are not public, and as of May 2026, this includes inferences.
Example: Not Suited vs. Suited for Chunking
This table compares how the state of content affects how it is handled as a chunk.
Differences in Content State and How It Is Handled as a Chunk
| Status |
Content State |
How It Is Handled as a Chunk |
| ❌ Not suited |
Multiple topics are mixed within a single H2 section. The heading and body content do not match. |
When divided into chunks, it may become difficult to determine "what this chunk is about" |
| ✅ Suited |
Each H2/H3 heading corresponds to one topic, and a conclusion is placed immediately below the heading. |
When divided with structure-based splitting, one chunk is more likely to be self-contained as one meaning unit, potentially increasing the likelihood of being retrieved for related queries |
Genview's Definition
In the context of GEO strategy, Genview defines a chunk as "the meaning unit for processing documents in RAG Retrieval, and one of the concepts that explains why content structure optimization is necessary."
This definition represents Genview's perspective and does not reflect an industry-wide consensus.
Genview's adoption of this positioning is based on three points.
- The 2025 WebFAQ study (arXiv) demonstrated that FAQ-format Q&A data is well-suited for Dense Retrieval (semantic search). Since FAQ format clearly pairs "questions" and "answers," each Q&A can be interpreted as tending to become a chunk that is self-contained as a meaning unit.
- The BLUF principle (placing a conclusion immediately below a heading) plays the role of explicitly stating "what this chunk is about" at the beginning when divided by structure-based chunking. The semantic clarity at the chunk level may influence Retrieval accuracy.
- Semantic HTML tags such as
<article>, <section>, and <h2> may function as cues for splitting in structure-based chunking. However, this is Genview's inference as of May 2026 and has not been officially disclosed by any of the companies involved.
Parent Concepts and Related Terms
Chunks are positioned as the basic unit for processing documents in the Retrieval phase of RAG. The following organizes the concepts related to chunks.
Parent Concepts and Related Terms
Chunks are positioned as the basic unit for processing documents in RAG's Retrieval phase. The following organizes the concepts related to chunks.
Parent Concepts
- RAG (Retrieval-Augmented Generation): The mechanism by which AI searches for and retrieves external information before generating a response. Chunks are the basic unit for processing documents in RAG's Retrieval phase.
- Retrieval: The first phase of RAG. The process of searching for and retrieving relevant chunks based on the user's question.
Related Terms
- BLUF (Bottom Line Up Front): The writing structure principle of placing the conclusion directly under the heading. Related as an implementation principle for creating content whose meaning is self-contained when divided into chunks.
- Semantic HTML: HTML structured using meaningful HTML tags correctly. Tags such as
<section> and <h2> may function as cues for splitting in structure-based chunking.
- Vector Search: Technology that searches for related chunks based on the semantic similarity of text. Widely used in RAG's Retrieval phase, where the semantic clarity of chunks affects search precision.
- FAQ format: A structure describing questions and answers as a set in "Q: ~ / A: ~" format. Each Q&A tends to become a semantically self-contained chunk, and is attracting attention as a structure that tends to improve retrieval precision.
- Context Window: The maximum number of tokens an LLM can process in a single inference. Chunks retrieved in Retrieval are passed to the context window, where they are used for LLM response generation within that range.
Common Misconceptions
The following three misconceptions about chunks are frequently observed.
Misconception 1: "Being mindful of chunks means being cited by AI."
Chunk design may influence Retrieval accuracy, but being retrieved and being ultimately cited in an AI response are separate matters. Citation is determined after multiple subsequent processes including post-Retrieval ranking, credibility evaluation, and answer synthesis. Chunk design is one of the structural preparations that serve as its prerequisite.
Misconception 2: "Chunks are determined at the web page level."
Chunks are divided not at the page level, but at finer units such as sections, paragraphs, and Q&A pairs within a page. Since a single page is divided into multiple chunks and indexed, not only the quality of the entire page but also "the self-containedness of meaning at the section level" becomes important.
Misconception 3: "Chunks are managed by engineers and have nothing to do with content teams."
Chunking implementation is in the engineer's domain, but the perspective of "writing content in which meaning is more likely to be self-contained as a chunk" overlaps with the content design domain. Self-containedness of meaning at the heading level, BLUF implementation, and utilizing FAQ format are measures that content teams can work on as chunk-design-conscious practice.
FAQ
- Q: What should I do for chunk-conscious content design?
- A: The basic principle is "one topic per H2/H3 heading." Specifically, three practices considered effective are: ① placing a conclusion immediately below each heading (BLUF); ② not mixing multiple topics within a single section; and ③ writing Q&A in FAQ format as independent pairs.
- Q: What is the appropriate size for a chunk?
- A: Since the chunking implementations of each AI service are not public, appropriate sizes cannot be stated definitively. In general RAG implementations, 200–500 tokens (approximately 300–700 Japanese characters) is cited as one benchmark, but this varies by service. Prioritizing "is the meaning self-contained?" over size is the practical approach.
- Q: Are chunks and sections the same thing?
- A: The concepts are similar but not the same. A section is a division in the HTML document structure (a range divided by
<section> tags or headings), while a chunk is the unit into which a RAG system divides a document for Retrieval. In structure-based chunking, sections are often used as chunk divisions, so the two frequently have a corresponding relationship.