Paul Chiusano’s Post

11mo

For search applications (whether vector or full text or some hybrid) people often just statically choose some general purpose encoding / embedding. But as new documents are added to the corpus being indexed this can start to drift and become unable to make the fine distinctions that users care about. As a drastic example, imagine a bunch of DNA sequences get added to the corpus and the embedding or tokenizing function is just trained on generic "text on the internet". You really want to evolve this contextual info as the corpus evolves. The question is: how to conveniently do this? I was noodling on this question, and I think log-structured merge trees (like the one I have in progress for Arcella https://lnkd.in/giE-6UfF) have the right sort of machinery. As new values are inserted into the data structure, periodically and on an exponential schedule, the data structure is rebuilt. Basically, right when you're merging two levels is the perfect time to (lazily) rebuild the corpus-contextual information for that level. You don't even have to make the arbitrary decisions ahead of time, since the data structure is effectively "learning" better encodings for the kind of information it is actually indexing. And by doing this on this exponential schedule, you're amortizing the cost away.

Unison Share share.unison-lang.org

2 Comments

Roman K. 11mo

Curious what are advantages over trie. Use a combination of in-memory & on disk storage?

Tyler Weir 11mo

FYI: There is an extra ')' in the link to Arcella.

See more comments

To view or add a comment, sign in

More Relevant Posts

Stefan Jansen
6mo
Report this post
Is RAG (Retrieval-Augmented Generation) really dying? A provocative essay titled “The RAG Obituary” argues that with expanding LLM context windows and agentic search (e.g. grep loops), traditional chunk/embed/retrieve pipelines are becoming obsolete. Reading the Hacker News thread in response (link in comments), I found the debate more constructive than determinative. Here’s what stood out: 🔍 Retrieval still matters, because generation benefits from augmentation — regardless of whether it’s via embeddings, grep, tool calls, or semantic maps. ⚖️ Trade-offs evolve — large contexts reduce chunking overhead but introduce memory/latency/context drift & rot concerns. 🤖 Agentic & hybrid models are rising — systems that dynamically choose between retrieval strategies (embedding, keyword, tool-based) seem more durable. 🧪 Context is everything — claims that “RAG is dead” often rest on specific domains (e.g. code) or small-scale testbeds. In large, messy corpora, retrieval pipelines still face hard challenges. So what should practitioners do? Start with a flexible retrieval architecture: combine semantic + symbolic methods, allow fallbacks, instrument for failure, and benchmark your actual corpus. I’d be curious: in your domain (legal, medical, enterprise docs, software), have you found embedding-based RAG still winning — or are you moving toward tools-driven, agentic retrieval?

2 Comments
Like Comment
To view or add a comment, sign in
Ivan Djordjevic
6mo
Report this post
RIHU implements retrieval as spatial navigation through KAG's geometric knowledge space, where Classes have measurable centroids, influence radii, and power (instance density per volume), transforming similarity matching into interpretable wayfinding across explicit semantic coordinates. The framework reframes RAG fundamentally: Instead of operating in opaque embedding space, KAG treats knowledge as explicit geometry where time, space, and semantics become coordinate axes. Classes exist as distributions of Hypothetical Facts (context-grounded propositions), with geometric properties that encode meaning: nearness indicates relatedness, density reveals influence, containment defines contextual framing. RIHU introduces two retrieval modes, objective (world-centric, finding where relevant Classes converge globally) and subjective (observer-centric, searching from a specific vantage point in the space). This reveals that retrieval isn't just about matching, it's about choosing epistemic perspective. Traditional systems collapse all queries into a single view; KAG enables position-dependent search. The interpretability advantage: retrieval decisions become auditable through spatial properties. Organizations can literally point to regions in the knowledge geometry where answers originated, addressing the "why was this retrieved?" question that plagues current RAG deployments. Github: 👩💻https://lnkd.in/eDeh33Jc
Like Comment
To view or add a comment, sign in
Muhammad Huzaifa Gohar
6mo
Report this post
#ColBERT Problem: Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is very useful for efficient search / retrieval, but puts a heavy burden on that single vector representation to capture all the semantic nuance / detail of the doc. In some cases, irrelevant / redundant content can dilute the semantic usefulness of the embedding. Idea: #ColBERT (@lateinteraction & @matei_zaharia) is a neat approach to address this with higher granularity embeddings: (1) produce a contextually influenced embedding for each token in the document and query. (2) score similarity between each query token and all document tokens. (3) take the max. (4) do this for all query tokens. (5) take the sum of the max scores (in step 3) for all query tokens to get the similarity score. This results in a much more granular token-wise similarity assessment between document and query, and has shown strong performance.
Like Comment
To view or add a comment, sign in
John Theobald
6mo Edited
Report this post
Check out this blog post from my colleague, David vonThenen . David succinctly explains why using Graph RAG enhances the quality and understanding of the provenance of the responses when combined with Vector. This blog is filled with so many great nuggets of value, but this one really struck me: For a non-technical view, vectors and LLMs lean on sampling to sound human (temperature, top-k/top-p add randomness), so identical questions can produce different phrasings. With a graph, you constrain what the model can say (the vetted subgraph), then you can reduce or remove randomness (e.g., temperature → 0) and still get fluent answers. The bouncer controls the guest list; the MC doesn’t have to improvise. Great insights, David! https://lnkd.in/eW2Q28w9

From "Trust Me" to "Prove It": Why Enterprises Need Graph RAG community.netapp.com

2 Comments
Like Comment
To view or add a comment, sign in
Joel Paulson
6mo Edited
Report this post
I am excited to share our recent paper published in Digital Discovery that presents MolDAIS - a simple yet effective way to do molecular design with Bayesian optimization in the low-data regime. The main idea is, instead of learning a complex latent space, we can start from rich descriptor libraries and adaptively learn a tiny, task-relevant subspace as data comes in. In practice, for certain problems, that means fewer than 100 evaluations can get you near-optimal candidates even in libraries with 100k+ molecules, with models that stay more interpretable. A few highlights: - Low-data first: We take advantage of a sparse axis-aligned subspace (SAAS) prior to train a Gaussian process model that focuses on just the handful of descriptors that matter for the property at hand. - Lightweight screening options: We show that mutual information-style variants of SAAS can give similar benefit at reduced computational cost. - Practical and interpretable: Avoids the need for heavy generative training, can work out-of-the-box on all types of descriptor features, and plays nicely with variants such as constrained and multi-objective problems. Huge thanks to my awesome PhD student Farshud Sorourifar who led the work, and to Thomas Banker (our fantastic undergrad researcher, now a PhD student at UC Berkeley) for pushing this over the finish line. 📄 Paper: https://lnkd.in/g8JWGscp 💻 Code: https://lnkd.in/gFt_3zAx

GitHub - PaulsonLab/MolDAIS: Molecular Descriptors with Actively Identified Subsets github.com

6 Comments
Like Comment
To view or add a comment, sign in
Manish Kumar
6mo
Report this post
𝐓𝐨𝐝𝐚𝐲’𝐬 𝐊𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞: 𝐖𝐡𝐲 𝐀𝐈 𝐌𝐨𝐝𝐞𝐥𝐬 𝐒𝐭𝐨𝐫𝐞 𝐃𝐚𝐭𝐚 𝐢𝐧 𝐕𝐞𝐜𝐭𝐨𝐫 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬, 𝐍𝐨𝐭 𝐍𝐨𝐫𝐦𝐚𝐥 𝐃𝐚𝐭𝐚𝐛𝐚𝐬𝐞𝐬 𝗩𝗲𝗰𝘁𝗼𝗿 𝗗𝗮𝘁𝗮 𝗕𝗮𝘀𝗲:- Vector databases store data as numerical representations called embeddings, which capture the semantic meaning of content (text, images, etc.). This allows them to quickly find items that are similar in meaning. 𝗡𝗼𝗿𝗺𝗮𝗹 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀:- Normal databases rely on the exact word matching. 😋
Like Comment
To view or add a comment, sign in
Tim Tolbert
6mo Edited
Report this post
𝗛𝗼𝘄 𝗪𝗲 𝗕𝗿𝗼𝗸𝗲 𝗗𝗶𝗷𝗸𝘀𝘁𝗿𝗮'𝘀 𝗦𝗽𝗲𝗲𝗱 𝗟𝗶𝗺𝗶𝘁 𝘁𝗼 𝗙𝗶𝗻𝗱 𝘁𝗵𝗲 𝗦𝗵𝗼𝗿𝘁𝗲𝘀𝘁 𝗣𝗮𝘁𝗵𝘀 🚀 For more than 60 years, one algorithm has quietly powered the digital world. Every time you ask a GPS for directions, every time data hops across the internet, every time a network routes a packet to its destination - there’s a good chance Dijkstra’s algorithm is working behind the scenes. 🧭 It’s a beautifully simple idea: ➡️ Start from your source ➡️ Explore outward step by step ➡️ Always choose the next closest point until every destination has the shortest path Elegant. Reliable. And for decades… unbeatable. ⚙️ But even the best solutions come with limits. At its core, Dijkstra’s method relies on sorting nodes by distance — and sorting has a built-in computational cost. That “sorting barrier” became the algorithm’s speed limit. Mathematicians chipped away at it over the years with clever data structures and special-case optimizations. But a fundamental breakthrough for general graphs always felt just out of reach. Until now. ✨ 📜 In a recent paper, researchers Mikkel Thorup, Tianyi Mao, and Ran Duan introduced a new approach that changes how we think about the shortest-path problem. Instead of meticulously sorting every frontier node, their algorithm: 🔹 Clusters nodes into groups 🔹 Uses a limited Bellman-Ford exploration to peek ahead at which nodes truly matter 🔹 Skips strict ordering and focuses only where it counts The result? They avoid the sorting bottleneck - and, for the first time ever, break past Dijkstra’s theoretical speed barrier. 🌐 This isn’t just a math curiosity. Faster shortest-path computations ripple outward into real-world systems: 🔹Smarter routing algorithms that adapt in real time 🔹Network traffic that self-optimizes dynamically 🔹Massive graph analyses that finish in a fraction of the time The gains today are modest - but the door this opens could reshape how we approach routing, infrastructure, and even AI systems that rely on graph traversal. 🤔 I’m curious what you think: Where do you see this having the biggest impact - networking, logistics, AI, or something else entirely? 📖 Read the original article: Quanta Magazine (https://lnkd.in/e7Yytsg3) 📚 Original paper: arXiv:2504.17033 (https://lnkd.in/eH2FZxk2) #Algorithms #Networking #GraphTheory #Innovation #Dijkstra #Routing #AI #ComputerScience #Infrastructure #ABTest 👽

Breaking the Sorting Barrier for Directed Single-Source Shortest Paths arxiv.org
Like Comment
To view or add a comment, sign in
Pradeep Tidke
6mo
Report this post
💡 HyDE (Hypothetical Document Embeddings) HyDE improves RAG by generating a hypothetical answer to the user query, embedding it, and then using that embedding for better semantic search. --- 🔍 Simple analogy: It’s like guessing what the answer looks like before you start searching 🔍, so you know what you need to find. --- 💡 How it works: - The model creates an ideal answer for the query - This answer is embedded into vector space - Retrieval searches for content close to the hypothetical answer’s meaning HyDE helps bridge gaps between vague queries and relevant documents. --- 🚀 Key takeaway: HyDE boosts retrieval by searching for the “shape” of the best answer—not just matching the query text. #HyDE #RAG #SemanticSearch #LLM #TechExplained
Like Comment
To view or add a comment, sign in
XINZHE LI
6mo Edited
Report this post
🚀 Excited to share our new arXiv preprint: Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search Tree-search methods achieve state-of-the-art reasoning with LLMs—but are often 10–20× slower than iterative prompting. Our framework, Chain-in-Tree (CiT), adaptively decides when to branch instead of branching at every step. 📊 Results: CiT can reduce runtime and token usage by 60–85% across all the three baseline search frameworks (Tree-of-Thoughts, RAP, and ReST-MCTS*), with often negligible and no accuracy loss (even better sometimes). 👇 https://lnkd.in/gWx6rwdi

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search arxiv.org
Like Comment
To view or add a comment, sign in
Divyesh V
6mo Edited
Report this post
Searching in a sorted array? Then what's better than using "binary search"? Binary search works by dividing the array into two parts based on low ,high and mid values .Having a sorted array the values left of mid will be lower values and values to the right of mid will be higher values making it easier for you to decide on which sub-array you would like to perform the search operation. Binary search gives you advantages with: 👉🏻Time complexity O(logN) which is better when compared to that of linear search O(N). 👉🏻Due to it's efficient searching it is USEFUL with large datasets.
1 Comment
Like Comment
To view or add a comment, sign in

1,956 followers

156 Posts

View Profile Connect

LinkedIn respects your privacy

Paul Chiusano’s Post

Explore content categories