Ve Sharma’s Post

View profile for Ve Sharma

Microsoft5K followers

Had a great time on the Vancouver.dev AI & RAG panel earlier this week! 🙌Here are some additional quick tips & tricks on LLMs and RAG that I didn't get the chance to share on the panel! 🤓 Strategizing with RAG: - ✅ RAG for Grounded Truth: Use when answers must stem from your specific, verifiable, up-to-date knowledge for trust & accuracy. - 🤔 Rapid Prototyping First: Consider skipping complex RAG initially for V1 if a large context model (like Gemini 2.5 Pro) suffices. Move fast, optimize later. - 🔑 Retrieval is King: RAG success often hinges more on smart retrieval engineering (finding the right data) than the LLM itself. - 🚀 Beyond Q&A: The trend is towards agentic RAG & complex workflows for sophisticated, multi-step tasks. Taming LLM & RAG Costs: - 💰 Optimize Embedding Costs: Embedding isn't free! Consider smaller/efficient open-source embedding models (self-hosted?) vs. pricey APIs, especially for large datasets. - 🔍 Pre-Filter Before Vector Search: Use metadata filters (dates, categories) first to narrow the vector search space, reducing compute and improving relevance. - ✂️ Context Compression/Summarization: Before feeding retrieved context to the LLM, use techniques (or another cheaper LLM call) to summarize/compress it, cutting down expensive final LLM tokens. - 🔄 Incremental Indexing: Avoid re-embedding/re-indexing your entire knowledge base constantly - only process new or updated documents to save compute & API calls. - 🤏 Right-Size Your Model: Defaulting to the biggest/most expensive LLM? Choose the smallest, most efficient model that meets your specific needs first. - 💡 Smart Infra Choices: Explore open-source models on optimized, pay-per-use infra (like Cloud Run GPU) for potentially huge savings on predictable workloads vs. always-on managed endpoints. - 🌊 Model Cascading for RAG: Try answering with a cheaper LLM first using the retrieved context; only escalate to a premium model if the cheaper one fails or for complex queries. #cantech #vantech #ai #rag

  • No alternative text description for this image
Jaron Sander

Inform Growth1K followers

11mo

Super insightful! I am curious if you have any insight on how to figure out the best chunking strategy and embedding model for documents of a specific domain or do these things not matter as much as the retreival method?

Like
Reply

To view or add a comment, sign in

Explore content categories