Large Language Models (LLMs) are powerful, but how we 𝗮𝘂𝗴𝗺𝗲𝗻𝘁, 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲, 𝗮𝗻𝗱 𝗼𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗲 them truly defines their impact. Here's a simple yet powerful breakdown of how AI systems are evolving: 𝟭. 𝗟𝗟𝗠 (𝗕𝗮𝘀𝗶𝗰 𝗣𝗿𝗼𝗺𝗽𝘁 → 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲) ↳ This is where it all started. You give a prompt, and the model predicts the next tokens. It's useful — but limited. No memory. No tools. Just raw prediction. 𝟮. 𝗥𝗔𝗚 (𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻) ↳ A significant leap forward. Instead of relying only on the LLM’s training, we 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗲 𝗿𝗲𝗹𝗲𝘃𝗮𝗻𝘁 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗳𝗿𝗼𝗺 𝗲𝘅𝘁𝗲𝗿𝗻𝗮𝗹 𝘀𝗼𝘂𝗿𝗰𝗲𝘀 (like vector databases). The model then crafts a much more relevant, grounded response. This is the backbone of many current AI search and chatbot applications. 𝟯. 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗟𝗟𝗠𝘀 (𝗔𝘂𝘁𝗼𝗻𝗼𝗺𝗼𝘂𝘀 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 + 𝗧𝗼𝗼𝗹 𝗨𝘀𝗲) ↳ Now we’re entering a new era. Agent-based systems don’t just answer — they think, plan, retrieve, loop, and act. They: - Use 𝘁𝗼𝗼𝗹𝘀 (APIs, search, code) - Access 𝗺𝗲𝗺𝗼𝗿𝘆 - Apply 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗰𝗵𝗮𝗶𝗻𝘀 - And most importantly, 𝗱𝗲𝗰𝗶𝗱𝗲 𝘄𝗵𝗮𝘁 𝘁𝗼 𝗱𝗼 𝗻𝗲𝘅𝘁 These architectures are foundational for building 𝗮𝘂𝘁𝗼𝗻𝗼𝗺𝗼𝘂𝘀 𝗔𝗜 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁𝘀, 𝗰𝗼𝗽𝗶𝗹𝗼𝘁𝘀, 𝗮𝗻𝗱 𝗱𝗲𝗰𝗶𝘀𝗶𝗼𝗻-𝗺𝗮𝗸𝗲𝗿𝘀. The future is not just about 𝘸𝘩𝘢𝘵 the model knows, but 𝘩𝘰𝘸 it operates. If you're building in this space — RAG and Agent architectures are where the real innovation is happening.
Large Language Models Insights
Explore top LinkedIn content from expert professionals.
-
-
If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression: - Prompt Pruning, remove irrelevant history or system tokens - Prompt Summarization, use model-generated summaries as input - Soft Prompt Compression, encode static context using embeddings - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization: - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization: - Post-Training, no retraining needed - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification: - Weight Pruning, Sparse Attention → Structure Optimization: - Neural Architecture Search, Structure Factorization → Knowledge Distillation: - White-box, student learns internal states - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!
-
I've been saying for over a year that multimodal large language models will become the ultimate interface between physicians and a range of AI-based solutions. Here is the proof! In this study, the authors developed and evaluated an autonomous clinical AI agent leveraging GPT-4 with multimodal precision oncology tools to support personalized clinical decision-making. They used multiple sources such as histopathology slides, radiological images and search tools like OncoKB, PubMed and Google. "Evaluated on 20 realistic multimodal patient cases, the AI agent autonomously used appropriate tools with 87.5% accuracy, reached correct clinical conclusions in 91.0% of cases and accurately cited relevant oncology guidelines 75.5% of the time. Compared to GPT-4 alone, the integrated AI agent drastically improved decision-making accuracy from 30.3% to 87.2%." Source: https://lnkd.in/dwjGvxcH
-
AI models are at risk of degrading in quality as they increasingly train on AI-generated data, leading to what researchers call "model collapse." New research published in Nature reveals a concerning trend in AI development: as AI models train on data generated by other AI, their output quality diminishes. This degradation, likened to taking photos of photos, threatens the reliability and effectiveness of large language models. The study highlights the importance of using high-quality, diverse training data and raises questions about the future of AI if the current trajectory continues unchecked. 🖥️ Deteriorating Quality with AI Data: Research indicates that AI models progressively degrade in output quality when trained on content generated by preceding AI models, a cycle that exacerbates each generation. 📉 The phenomenon of Model Collapse: Described as the process where AI output becomes increasingly nonsensical and incoherent, "model collapse" mirrors the loss seen in repeatedly copied images. 🌐 Critical Role of Data Quality: High-quality, diverse, and human-generated data is essential to maintaining the integrity and effectiveness of AI models and preventing the degradation observed with synthetic data reliance. 🧪 Mitigating Degradation Strategies: Implementing measures such as allowing models to access a portion of the original, high-quality dataset has been shown to reduce some of the adverse effects of training on AI-generated data. 🔍 Importance of Data Provenance: Establishing robust methods to track the origin and nature of training data (data provenance) is crucial for ensuring that AI systems train on reliable and representative samples, which is vital for their accuracy and utility. #AI #ArtificialIntelligence #ModelCollapse #DataQuality #AIResearch #NatureStudy #TechTrends #MachineLearning #DataProvenance #FutureOfAI
-
Can large language models be used in biotech? The short answer is yes. While LLMs are often associated with chatbots, their capabilities extend beyond that. In biotech, much of the data comes in the form of sequences – like nucleotides in DNA, or amino acids in proteins. Similar to sentences in natural language, these biological sequences have unique semantic meanings based on the arrangement of their components. When input data is fed into an LLM, a transformer converts these sequences into contextual vectors using its attention mechanism. This process allows the model to understand the context and relationships within the data, enabling it to predict subsequent elements. One such use case is prediction of neoantigens that enable targeting tumor cells in personalized cancer immunotherapies. Neoantigens are tumor-specific mutated peptides presented on the surface of tumor cells because they bind to human leukocyte antigen (HLA) molecules. LLMs can predict this binding affinity. This allows the development of personalized therapies that use the patient's own immune system to kill tumor cells without damaging healthy tissues.
-
The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx
-
Large Language Diffusion Models (LLaDA) Proposes a diffusion-based approach that can match or beat leading autoregressive LLMs in many tasks. If true, this could open a new path for large-scale language modeling beyond autoregression. More on the paper: Questioning autoregressive dominance While almost all large language models (LLMs) use the next-token prediction paradigm, the authors propose that key capabilities (scalability, in-context learning, instruction-following) actually derive from general generative principles rather than strictly from autoregressive modeling. Masked diffusion + Transformers LLaDA is built on a masked diffusion framework that learns by progressively masking tokens and training a Transformer to recover the original text. This yields a non-autoregressive generative model—potentially addressing left-to-right constraints in standard LLMs. Strong scalability Trained on 2.3T tokens (8B parameters), LLaDA performs competitively with top LLaMA-based LLMs across math (GSM8K, MATH), code (HumanEval), and general benchmarks (MMLU). It demonstrates that the diffusion paradigm scales similarly well to autoregressive baselines. Breaks the “reversal curse” LLaDA shows balanced forward/backward reasoning, outperforming GPT-4 and other AR models on reversal tasks (e.g. reversing a poem line). Because diffusion does not enforce left-to-right generation, it is robust at backward completions. Multi-turn dialogue and instruction-following After supervised fine-tuning, LLaDA can carry on multi-turn conversations. It exhibits strong instruction adherence and fluency similar to chat-based AR LLMs—further evidence that advanced LLM traits do not necessarily rely on autoregression. https://lnkd.in/eYp9Hi5y
-
Chain-of-Thought has been a fundamental architecture driving LLM performance. Now 'Chain of Continuous Thought' (Coconut) significantly improves reasoning performance through working in latent space rather than language space. This paper from Meta's AI research group lays out the logic and results: 💡 Continuous Reasoning Unlocks Efficiency: Large Language Models (LLMs) traditionally reason in "language space," where reasoning steps are expressed as explicit tokens, leading to inefficiencies. The Coconut (Chain of Continuous Thought) paradigm instead reasons in a continuous latent space by feeding the model’s hidden state back as input. This reduces reliance on explicit tokens and improves reasoning efficiency, especially for complex tasks requiring backtracking. 📊 Higher Accuracy in Complex Reasoning Tasks: Coconut achieves significant accuracy improvements on complex tasks requiring planning and logic. In ProsQA, a reasoning-intensive task, Coconut attains 97.0% accuracy, far exceeding Chain-of-Thought (CoT) at 77.5%. Similarly, in logical reasoning tasks like ProntoQA, it achieves near-perfect performance at 99.8% accuracy, outperforming or matching other baselines while demonstrating superior planning capabilities. ⚡ Greater Efficiency with Fewer Tokens: Coconut enhances reasoning efficiency by reducing the number of generated tokens while maintaining accuracy. For example, in GSM8k (math reasoning), Coconut achieves 34.1% accuracy using just 8.2 tokens, compared to CoT's 42.9% accuracy which requires 25 tokens. This token efficiency indicates that reasoning in latent space allows the model to process fewer explicit steps without sacrificing performance. 🌟 Parallel Reasoning Explores Multiple Alternative Steps: Coconut enables LLMs to simultaneously explore multiple reasoning paths by encoding alternative next steps in the continuous latent space. This parallel reasoning behavior mimics breadth-first search (BFS), allowing the model to avoid premature decisions and progressively narrow down the correct solution. 🔄 Multi-Stage Training Accelerates Learning: Coconut leverages a curriculum-based training strategy, where the reasoning chain is gradually replaced with latent thoughts. This phased approach facilitates model learning, improving performance on math problems (GSM8k) and logical tasks, outperforming baselines like No-CoT and iCoT. 🔍 Latent Reasoning Improves Planning and Focus: By reasoning in latent space, the model avoids premature decisions and progressively narrows down possibilities. Coconut shows reduced hallucinations and improved accuracy compared to CoT, demonstrating its ability to prioritize promising reasoning paths while pruning irrelevant ones. New model architectures are consistently improving LLM performance and efficiency. Even without more training data and underlying model progress we are seeing consistent advances. Link to paper in comments.
-
Generative AI cannot write what hasn’t been written. This is a subtle but profound truth the tech industry is just fully coming to understand. Ask a question of any large language model – from ChatGPT to Grok go Gemini – and you notice that LLMs cannot deliver reliable answers for topics that haven’t been covered. Gen AI programs work well when you are researching big companies like Nvidia and famous people like Tim Cook but not at all for the obscure or unsung. Since there is no way humans can keep up, the somewhat counterintuitive solution to this problem is to tap Gen AI to create a vast library of digital content. Effectively, we need machines to write articles so other machines can read them. To illustrate the challenge, consider Perplexity, an AI search engine, which is building a finance vertical. The biggest challenge won’t be the speed or depth of the LLMs that they leverage, but the lack of historical news about companies. In many cases, these are small or mid-sized companies that were never covered by reporters at Bloomberg or Reuters or the New York Times. Remember: Generative AI cannot write what hasn’t been written. The importance of specialized content was driven home by the recent announcement by DeepSeek, a Chinese LLM, when it said it had unveiled a high-performing open source large language model at a fraction of the cost of the version created by OpenAI. The DeepSeek announcement signaled that model enhancements will continue to come fast and furious, each leap-frogging the previous one and driving down inference costs. Developers seeking to build moats around their applications will rely less on model performance and more and more on the quality, reliability and comprehensiveness of content.
-
GenAI’s black box problem is becoming a real business problem. Large language models are racing ahead of our ability to explain them. That gap (the “representational gap” for the cool kids) is no longer just academic, and is now a #compliance and risk management issue. Why it matters: • Reliability: If you can’t trace how a model reached its conclusion, you can’t validate accuracy. • Resilience: Without interpretability, you can’t fix failures or confirm fixes. • Regulation: From the EU AI Act to sector regulators in finance and health care, transparency is quickly becoming non-negotiable. Signals from the frontier: • Banks are stress-testing GenAI the same way they test credit models, using surrogate testing, statistical analysis, and guardrails. • Researchers at firms like #Anthropic are mapping millions of features inside LLMs, creating “control knobs” to adjust behavior and probes that flag risky outputs before they surface. As AI shifts from answering prompts to running workflows and making autonomous decisions, traceability will move from optional to mandatory. The takeaway: Interpretability is no longer a nice-to-have. It is a license to operate. Companies that lean in will not only satisfy regulators but also build the trust of customers, partners, and employees. Tip of the hat to Alison Hu Sanmitra Bhattacharya, PhD, Gina Schaefer, Rich O'Connell and Beena Ammanath's whole team for this great read.