Paul Chiusano’s Post

For search applications (whether vector or full text or some hybrid) people often just statically choose some general purpose encoding / embedding. But as new documents are added to the corpus being indexed this can start to drift and become unable to make the fine distinctions that users care about. As a drastic example, imagine a bunch of DNA sequences get added to the corpus and the embedding or tokenizing function is just trained on generic "text on the internet". You really want to evolve this contextual info as the corpus evolves. The question is: how to conveniently do this? I was noodling on this question, and I think log-structured merge trees (like the one I have in progress for Arcella https://lnkd.in/giE-6UfF) have the right sort of machinery. As new values are inserted into the data structure, periodically and on an exponential schedule, the data structure is rebuilt. Basically, right when you're merging two levels is the perfect time to (lazily) rebuild the corpus-contextual information for that level. You don't even have to make the arbitrary decisions ahead of time, since the data structure is effectively "learning" better encodings for the kind of information it is actually indexing. And by doing this on this exponential schedule, you're amortizing the cost away.

Curious what are advantages over trie. Use a combination of in-memory & on disk storage?

Like
Reply

FYI: There is an extra ')' in the link to Arcella.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories