Pragmatic AI - RAG series part 3: Making Data AI-Ready
The first step in any RAG system is making your raw data usable.
Yes, you can pass 10000 documents to an LLM in AWS Bedrock if you have an infinite budget and no practical constraints on latency or cloud costs. In the real world, that does not scale.
So when people say "make your data AI-ready", they do not mean "Claude can't read PDFs". It can. The issue is that doing this over and over, for lots of documents, for lots of users, is too slow and too expensive.
What you do instead:
This whole pipeline - chunking -> embedding -> indexing - is what we call ingestion. And it is the first step if you want RAG that is fast, cheap, and scalable.
Chunking
There are multiple strategies for chunking text, each with its own trade-offs. In this project, we use a sliding window chunking strategy.
To make this concrete, let's look at an example. We indexed the file Pattern-Oriented Software Architecture, Volume 1 – A System of Patterns.pdf. Below are the text fields for chunks 27, 28, and 29 of that document.
Chunk #27
"
...
The creation of specific
architectures is still based on intuition and experience.
Patterns effectively complement these general problem-independent
architectural techniques with specific problem-oriented ones. Note
that pa
"
Chunk #28
"
...
with specific problem-oriented ones. Note
that patterns do not make existing approaches to software architec-
ture obsolete-instead, they fill a gap that is not covered by existing
techniques.
...
These micro-methods complement
general but problem-independent analysis and design methods. such
as Booch [Boo941 and Object Modeling Technique [RBPELS 11, by pro-
viding methodological steps for solving concrete recumng problems
in software development. Section 5.4, Pattern Syste
"
Chunk #29
"
eral but problem-independent analysis and design methods. such
as Booch [Boo941 and Object Modeling Technique [RBPELS 11, by pro-
viding methodological steps for solving concrete recumng problems
in software development. Section 5.4, Pattern Systems as lmplemen-
tation Guidelines discusses this issue in detail.
...
"
Notice how the end of chunk 27 appears at the beginning of chunk 28, and chunk 29 starts slightly before the end of chunk 28.
To simplify: each chunk contains a small overlap with the previous one. This is done because language and meaning often span chunk boundaries, and overlap ensures that important context is not lost during retrieval.
We configure our chunking like this:
This is fixed-size overlapping chunking, where the window is 2500 chars, and the overlap is 250 chars.
Chunking is not a detail - it is a critical design decision and must be evaluated per project based on the data and the expected queries. Once your content is indexed, changing chunking is not a simple tweak. You must re-chunk and re-embed, and then reindex.
Embedding
Embedding is the step where you transform your chunks into vectors. Each chunk is passed through an embedding model and converted into a fixed-size array of numbers. These vectors are what you store in your vector index and later use for similarity search.
Vector size (also called dimensionality) matters.
In this project, we use AWS Titan Text Embeddings v2, which produces vectors with 1024 dimensions by default. We keep that size end-to-end to stay consistent with both the model and the index configuration.
Choosing the vector size is not arbitrary:
Recommended by LinkedIn
As with chunking, there is no universal "best" choice. The right embedding model and vector size depend on your data, your queries, and your cost and latency constraints.
Below is an example of chunk #27 after embedding:
[
...
-0.022368588,0.03426118,-0.021072196,0.048965596,0.0401257,
-0.0003573049,0.10447999,0.02597144,0.006521694,0.02481099,
...
]
No worries if you can't read it - neither can the rest of us. What matters is that, as explained in Pragmatic AI – RAG Series part 2: It Finds the Right Information, retrieval can now perform efficient similarity searches over these values.
One important rule: stick to the same embedding model.
Same family. Same version. Same configuration. You cannot embed documents with one model today and query with another tomorrow, even if the dimensionality matches. Once vectors are generated and indexed, changing dimensionality or embedding models means re-embedding and re-indexing everything.
Indexing
Now that we have vectors, the next question is simple: where do we store them? In this project, we use OpenSearch, which supports vector indexes out of the box. This allows us to store embeddings and run similarity search directly on them.
The only hard requirement is that the index configuration matches the embedding output. In practice, that means:
Below is the index definition used in this project:
{
"docs": {
"aliases": {},
"mappings": {
"properties": {
...
"chunk_id": {
"type": "keyword"
},
"chunk_index": {
"type": "integer"
},
"drive_link": {
"type": "keyword"
},
"file_id": {
"type": "keyword"
},
"file_name": {
"type": "text"
},
"mime_type": {
"type": "keyword"
},
"modified_time": {
"type": "date"
},
"text": {
"type": "text"
},
"vec": {
"type": "knn_vector",
"dimension": 1024
}
}
},
"settings": {
"index": {
...
"provided_name": "docs",
"knn": "true",
...
}
}
}
}
Notice that the index does not only store vectors. It also stores metadata alongside them. This metadata is critical. Depending on your use case, you may:
For example, PII or regulated data may require separate indexes or even separate databases. For internal documentation, metadata within the same index might be sufficient.
Finally, once data is indexed, the index must be refreshed so that subsequent queries can see the most up-to-date data.
Checkpoint
At this point in the series, we have a functional RAG system.
It can ingest data, chunk it, generate embeddings, and index everything for retrieval. Retrieval supports hybrid search - combining exact term matching with semantic similarity - and the LLM reasons over that context to return accurate responses.
What does it actually take to move from "it works" to "we trust it in production"?
If you would like to go deeper or take a look at the code, feel free to reach out!
Next episode in this series: Pragmatic AI - RAG series part 4: Cost Control - because a system you can't afford to run… isn't a system you can ship.