Skip to Main Content
Welcome to the Scientist.com Marketplace

Go to Main Navigation

Benchmate: Chunking

Chunking is one of the core abstractions behind Retrieval Augmented Generation (RAG), so it’s a first-class citizen in Benchmate. In short, RAG works by searching chunks of text for possible answers to user queries, then passing the most relevant chunks to the LLM.

At its simplest, a chunker is a piece of software that takes documents and breaks them into smaller pieces of text. But while that sounds straightforward, creating high-quality chunks is surprisingly nuanced and critical if you want accurate results from your RAG model.

Not all chunks are created equal. A high-quality chunk should be complete, specific and coherent:

  • Complete: It should capture the full concept it’s referencing, without splitting key information across chunks. This way, when you pass a chunk to the LLM, the model gets the full context it needs to answer correctly.
  • Specific: Each chunk should represent a single concept or topic. You want to avoid cramming multiple ideas into one chunk, making it easier for the system to find and retrieve exactly what the user needs.
  • Coherent: The chunk should make sense on its own. When a chunk is both complete and specific, coherence usually follows naturally.

At the end of the day, it all boils down to one idea: an ideal chunk captures one complete thought or answer to one question.

While that sounds simple, generating high-quality chunks algorithmically, especially across large document collections, is anything but simple. It takes a thoughtful combination of strategies.

That’s why Benchmate was designed to support many different chunking approaches. We group them into three main classes: simple, structural, and semantic. Each has its own strengths and trade-offs.


Simple Chunkers

Simple chunkers use straightforward rules to break text apart:

  • Token Count Chunker: Splits documents into chunks based on a fixed number of tokens (words).
  • Character Splitter Chunker: Breaks text based on a character count but overlaps the boundaries slightly to preserve flow between chunks.
  • Regular Expression Chunker: Gives fine-grained control over where to break chunks, typically using punctuation as default boundaries.

Structural Chunkers

Structural chunkers use the document’s built-in structure to guide the splits:

  • Recursive Character Text Chunker: Breaks text on natural syntactic boundaries (words, sentences) and creates overlapping hierarchies — a paragraph becomes a chunk, but so might each sentence within it.
  • Markdown Chunker: Leverages the structure of Markdown documents to split text by sections.
  • HTML Chunker: Does the same for HTML by looking for header tags (h1 – h6) and grouping each heading with its associated content.
  • JSON Chunker: Walks through a JSON document, creating chunks based on its keys and values.

Semantic Chunkers

Semantic chunkers use natural language processing to create meaning-driven chunks:

  • Cosine Similarity Chunker: Creates embeddings for each sentence and measures cosine similarity between them. When the similarity drops below a threshold, it starts a new chunk, grouping semantically similar sentences together.
  • Change Point Detection Chunker: Also detects shifts in meaning but uses spaCy to identify where the change happens and creates new chunks at those points.
  • Topic Chunker: Clusters sentences into topics by embedding them, reducing dimensionality with UMAP, and clustering with HDBSCAN. Instead of returning slices of the original text, it generates summaries of the topics identified within the document.

Each of these chunkers has different trade-offs. Some are fast and simple, while others are slower but more precise. Some work better with long documents; others shine with shorter content. Many are configurable so you can tune the behavior to match the needs of each data source.

We built these options into Benchmate so we could quickly experiment with different chunking algorithms as we develop our LLM-based features. Fast experimentation is key to building a great product; equally important is being able to measure the quality of the results. That’s why chunking in Benchmate ties directly into our LLM Testing Framework, which we’ll cover in a separate post.

PQ: “An ideal chunk is one conceptual idea. It’s one answer to one question.”