Pragmatic AI - RAG series part 3: Making Data AI-Ready

The first step in any RAG system is making your raw data usable.

Yes, you can pass 10000 documents to an LLM in AWS Bedrock if you have an infinite budget and no practical constraints on latency or cloud costs. In the real world, that does not scale.

So when people say "make your data AI-ready", they do not mean "Claude can't read PDFs". It can. The issue is that doing this over and over, for lots of documents, for lots of users, is too slow and too expensive.

What you do instead:

Convert raw data (PDFs, docs, logs, exports, etc.) into text
Split it into chunks (small pieces)
Convert each chunk into an embedding (a vector: an array of numbers)
Store those vectors in a vector index (aka "vector database")
At query time, retrieve only the most relevant chunks
Give that smaller context to the LLM to answer

This whole pipeline - chunking -> embedding -> indexing - is what we call ingestion. And it is the first step if you want RAG that is fast, cheap, and scalable.

Chunking

There are multiple strategies for chunking text, each with its own trade-offs. In this project, we use a sliding window chunking strategy.

To make this concrete, let's look at an example. We indexed the file Pattern-Oriented Software Architecture, Volume 1 – A System of Patterns.pdf. Below are the text fields for chunks 27, 28, and 29 of that document.

Chunk #27

"
...
The creation of specific 
architectures is still based on intuition and experience. 
Patterns effectively complement these general problem-independent 
architectural techniques with specific problem-oriented ones. Note 
that pa
"

Chunk #28

"
...
with specific problem-oriented ones. Note 
that patterns do not make existing approaches to software architec- 
ture obsolete-instead, they fill a gap that is not covered by existing 
techniques.
...
These micro-methods complement 
general but problem-independent analysis and design methods. such 
as Booch [Boo941 and Object Modeling Technique [RBPELS 11, by pro- 
viding methodological steps for solving concrete recumng problems 
in software development. Section 5.4, Pattern Syste
"

Chunk #29

"
eral but problem-independent analysis and design methods. such 
as Booch [Boo941 and Object Modeling Technique [RBPELS 11, by pro- 
viding methodological steps for solving concrete recumng problems 
in software development. Section 5.4, Pattern Systems as lmplemen- 
tation Guidelines discusses this issue in detail.
...
"

Notice how the end of chunk 27 appears at the beginning of chunk 28, and chunk 29 starts slightly before the end of chunk 28.

To simplify: each chunk contains a small overlap with the previous one. This is done because language and meaning often span chunk boundaries, and overlap ensures that important context is not lost during retrieval.

We configure our chunking like this:

max-chunk-chars: 2500 - maximum chunk size
chunk-overlap-chars: 250 - overlap between consecutive chunks

This is fixed-size overlapping chunking, where the window is 2500 chars, and the overlap is 250 chars.

Chunking is not a detail - it is a critical design decision and must be evaluated per project based on the data and the expected queries. Once your content is indexed, changing chunking is not a simple tweak. You must re-chunk and re-embed, and then reindex.

Embedding

Embedding is the step where you transform your chunks into vectors. Each chunk is passed through an embedding model and converted into a fixed-size array of numbers. These vectors are what you store in your vector index and later use for similarity search.

Vector size (also called dimensionality) matters.

In this project, we use AWS Titan Text Embeddings v2, which produces vectors with 1024 dimensions by default. We keep that size end-to-end to stay consistent with both the model and the index configuration.

Choosing the vector size is not arbitrary:

Smaller vectors are cheaper to store and faster to search, but they capture less semantic detail and can reduce retrieval quality.
Larger vectors preserve more semantic nuance and improve recall, but increase storage, indexing time, and query latency.

As with chunking, there is no universal "best" choice. The right embedding model and vector size depend on your data, your queries, and your cost and latency constraints.

Below is an example of chunk #27 after embedding:

[
...            
-0.022368588,0.03426118,-0.021072196,0.048965596,0.0401257,
-0.0003573049,0.10447999,0.02597144,0.006521694,0.02481099,
...
]

No worries if you can't read it - neither can the rest of us. What matters is that, as explained in Pragmatic AI – RAG Series part 2: It Finds the Right Information, retrieval can now perform efficient similarity searches over these values.

One important rule: stick to the same embedding model.

Same family. Same version. Same configuration. You cannot embed documents with one model today and query with another tomorrow, even if the dimensionality matches. Once vectors are generated and indexed, changing dimensionality or embedding models means re-embedding and re-indexing everything.

Indexing

Now that we have vectors, the next question is simple: where do we store them? In this project, we use OpenSearch, which supports vector indexes out of the box. This allows us to store embeddings and run similarity search directly on them.

The only hard requirement is that the index configuration matches the embedding output. In practice, that means:

Defining a vector field - vec
Setting its dimensionality to match the embedding model output dimensionality - 1024
Choosing a similarity metric consistent with how embeddings are compared - k-NN search

Below is the index definition used in this project:

{
  "docs": {
    "aliases": {},
    "mappings": {
      "properties": {
        ...
        "chunk_id": {
          "type": "keyword"
        },
        "chunk_index": {
          "type": "integer"
        },
        "drive_link": {
          "type": "keyword"
        },
        "file_id": {
          "type": "keyword"
        },
        "file_name": {
          "type": "text"
        },
        "mime_type": {
          "type": "keyword"
        },
        "modified_time": {
          "type": "date"
        },
        "text": {
          "type": "text"
        },
        "vec": {
          "type": "knn_vector",
          "dimension": 1024
        }
      }
    },
    "settings": {
      "index": {
        ...
        "provided_name": "docs",
        "knn": "true",
        ...
      }
    }
  }
}

Notice that the index does not only store vectors. It also stores metadata alongside them. This metadata is critical. Depending on your use case, you may:

Use metadata fields to filter results (source, document type, owner, timestamp, etc.)
Split data across multiple indexes
Enforce physical separation for sensitive data

For example, PII or regulated data may require separate indexes or even separate databases. For internal documentation, metadata within the same index might be sufficient.

Finally, once data is indexed, the index must be refreshed so that subsequent queries can see the most up-to-date data.

Checkpoint

At this point in the series, we have a functional RAG system.

It can ingest data, chunk it, generate embeddings, and index everything for retrieval. Retrieval supports hybrid search - combining exact term matching with semantic similarity - and the LLM reasons over that context to return accurate responses.

What does it actually take to move from "it works" to "we trust it in production"?

If you would like to go deeper or take a look at the code, feel free to reach out!

Next episode in this series: Pragmatic AI - RAG series part 4: Cost Control - because a system you can't afford to run… isn't a system you can ship.

LinkedIn respects your privacy

Pragmatic AI - RAG series part 3: Making Data AI-Ready

Ivan Gonzalez Cabral

Chunking

Embedding

Recommended by LinkedIn

Indexing

Checkpoint

More articles by Ivan Gonzalez Cabral

Others also viewed

How to Make Your Data AI-Ready

STOP with AI initiatives if….

Exploring Snowflake Cortex COMPLETE Multimodal: Unlocking Insights from Text and Images

Unpacking the Data and AI Summit 23

The Role of Object Storage in AI, The Modern Datalake, MinIO Days Recap: The July 2024 MinIO Newsletter

AI Agents Are Now Building the Database — But Who’s Building the Knowledge?

Graph RAG: When the LLM Stops Searching and Starts Querying

Automating Data Understanding: How I Built a Data Analysis Agent in n8n Using AI

Why Tabular Foundation Models Should Matter to Every Data Leader—And What They Mean for Data Quality and Governance

Data Platform News (August 2025)

How to Use RAG Architecture for Better Information Retrieval

How Llms Process Language

Scaling Strategies for Large Language Model Architectures

RAG Adoption Strategies for Enterprise AI

Guide to Meta Llama Large Language Models

Data Preprocessing for Large Language Models

Optimizing Large Language Model Planning with Dynamic Belief Updates

Limitations of the RAG Approach in AI

Challenges of Aligning Large Language Models With Practical Use Cases

Explore content categories

Chunking

Embedding

Recommended by LinkedIn

Indexing

Checkpoint

More articles by Ivan Gonzalez Cabral

Pragmatic AI - RAG Series Part 7: Closing Thoughts

Pragmatic AI - Rag Series part 6: Guardrails

Pragmatic AI - RAG Series part 5: Performance & Accuracy

Pragmatic AI - RAG Series part 4: Cost Control

Pragmatic AI - RAG series part 2: It Finds the Right Information

Others also viewed

How to Make Your Data AI-Ready

STOP with AI initiatives if….

Exploring Snowflake Cortex COMPLETE Multimodal: Unlocking Insights from Text and Images

Unpacking the Data and AI Summit 23

The Role of Object Storage in AI, The Modern Datalake, MinIO Days Recap: The July 2024 MinIO Newsletter

AI Agents Are Now Building the Database — But Who’s Building the Knowledge?

Graph RAG: When the LLM Stops Searching and Starts Querying

Automating Data Understanding: How I Built a Data Analysis Agent in n8n Using AI

Why Tabular Foundation Models Should Matter to Every Data Leader—And What They Mean for Data Quality and Governance

Data Platform News (August 2025)

Similar topics

How to Use RAG Architecture for Better Information Retrieval

How Llms Process Language

Scaling Strategies for Large Language Model Architectures

RAG Adoption Strategies for Enterprise AI

Guide to Meta Llama Large Language Models

Data Preprocessing for Large Language Models

Optimizing Large Language Model Planning with Dynamic Belief Updates

Limitations of the RAG Approach in AI

Challenges of Aligning Large Language Models With Practical Use Cases

Explore content categories