User Experience for Voice Interfaces

Explore top LinkedIn content from expert professionals.

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    DeepLearning.AI, AI Fund and AI Aspire

    2,463,151 followers

    The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future posts. Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it! However, compared to text-based generation, it is still hard to control the output of voice-in voice-out models. In contrast to directly generating audio, when we use an LLM to generate text, we have many tools for building guardrails, and we can double-check the output before showing it to users. We can also use sophisticated agentic reasoning workflows to compute high-quality outputs. Before a customer-service agent shows a user the message, “Sure, I’m happy to issue a refund,” we can make sure that (i) issuing the refund is consistent with our business policy and (ii) we will call the API to issue the refund (and not just promise a refund without issuing it). In contrast, the tools to prevent a voice-in, voice-out model from making such mistakes are much less mature. In my experience, the reasoning capability of voice models also seems inferior to text-based models, and they give less sophisticated answers. (Perhaps this is because voice responses have to be more brief, leaving less room for chain-of-thought reasoning to get to a more thoughtful answer.) When building applications where I need a more control over the output, I use agentic workflows to reason at length about the user’s input. In voice applications, this means I end up using a pipeline that includes speech-to-text (STT) to transcribe the user’s words, then processes the text using one or more LLM calls, and finally returns an audio response to the user via TTS (text-to-speech). This, where the reasoning is done in text, allows for more accurate responses. However, this process introduces latency, and users of voice applications are very sensitive to latency. When DeepLearning.AI worked with RealAvatar (an AI Fund portfolio company led by Jeff Daniel) to build an avatar of me, we found that getting TTS to generate a voice that sounded like me was not very hard, but getting it to respond to questions using words similar to those I would choose was. Even after much tuning, it remains a work in progress. You can play with it at https://lnkd.in/gcZ66yGM [At length limit. Full text, including latency reduction technique: https://lnkd.in/gjzjiVwx ]

  • View profile for Reid Hoffman
    Reid Hoffman Reid Hoffman is an Influencer

    Co-Founder, LinkedIn, Manas AI & Inflection AI. Founding Team, PayPal. Author of Superagency. Podcaster of Possible and Masters of Scale.

    2,762,137 followers

    I Am Voicepilled. A major step forward in human–computer interaction won’t come from bigger models alone, but from how we talk to them. Natively. With our voices. And that's why if you haven't started communicating with ChatGPT and other AI models simply by speaking to it, you should. It might change how you prompt in general. What does “voicepilled” mean exactly? Per The Matrix, it's that moment of sudden mental clarity, a way of seeing the world differently once you've taken "the red pill” or “blue pill." Being “voicepilled” is that moment of realization that once you start seriously using your voice to interact with technology, you unlock a new way to amplify your ability. If you use products like Wispr or ChatGPT Voice, you know what I mean. Human-computer interaction has always been a blend of art, psychology, and physiology. When DARPA first sponsored Douglas Engelbart’s team to create the mouse, it was a reimagining of what computing could feel like. Voice user interfaces do the same. We can use them to send quicker texts to friends, to query AI models, and eventually to interface with a Waymo as it takes you to your destination. Will people abandon keyboards completely? Not if they spend a lot of their time working with spreadsheets, writing complex documents, working with imaging and video-editing software, or love communicating with GIFs and emojis. But for many everyday purposes, voice is simply faster, more natural, and more flexible than typing. And what’s changed now is that state-of-the-art AI models can genuinely process what we say. This in turn can change how we prompt models like ChatGPT. Voice is iterative by default. You stumble, you rephrase, you interrupt yourself. You can do unstructured brain dumps: “just take this picture, and you figure it out.” So voice input can be a powerful antidote to perfectionism. It teaches you that iteration is the workflow. That the value is not in getting it right once, but in getting it better, repeatedly. And once the iteration stabilizes, you can simply say: “Take this discussion and generate a reusable prompt for me.” For companies, voice creates richer signals about what their users want. A shopping platform that receives a voice query gets more information about what the customer is searching for, and can deliver better results leading to increased sales. Voice might even reshape hardware and the architecture of work. Right now, sitting next to someone dictating all day feels unusual. But as voice becomes primary, we may design office spaces knowing that most of our time will be spent in dialogue with our workstations, versus typing on them. Every major leap in computing has arrived not just from smarter machines, but from better interfaces. As AI models scale, one of the most profound changes will be the fact that everyone can now interact with them in the most human way possible. To be voicepilled is to glimpse this future. 

  • View profile for Sumanth P

    Machine Learning Developer Advocate | LLMs, AI Agents & RAG | Shipping Open Source AI Apps | AI Engineering

    81,436 followers

    Microsoft just fixed a major speech recognition problem! They open sourced VibeVoice-ASR, a speech-to-text model that processes 60 minutes of audio in a single pass. Here's the problem with most ASR models. They slice audio into short chunks, usually 30 seconds or less. Process each chunk separately. Lose speaker context between segments. You get disconnected transcripts that can't track who said what across a full meeting. VibeVoice-ASR handles 60 minutes of continuous audio without chunking. The model maintains global context across the entire hour. The output is structured. Who spoke, when they spoke, what they said. Speaker diarization, timestamps, and transcription all in one pass. Key features: • 60-minute single-pass processing without chunking audio • Structured output: speaker labels, timestamps, and content combined • Customized hotwords: provide specific names or technical terms to improve accuracy • Multilingual support: 50+ languages • Joint ASR, diarization, and timestamping in one model The model is 7B parameters and fully open source. I've shared the repo link in the comments!

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    626,016 followers

    Cartesia Sonic-3 is the first AI voice model I’ve seen that nails Hindi perfectly. For years, even the best text-to-speech (TTS) models struggled with Hindi. The rhythm, tonality, and emotional micro-expressions just didn’t sound human and the accent was inaccurate. This model doesn’t just translate Hindi. It is specially trained for it, with precise control over pacing, expressions and  tonality, all rendered in real time. Under the hood, Sonic-3 is engineered for low-latency voice generation optimized for conversational AI agents, clocking in 3–5x faster than OpenAI’s TTS while maintaining superior transcript fidelity. What makes it stand out technically: → 𝗚𝗿𝗮𝗻𝘂𝗹𝗮𝗿 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝘁𝗮𝗴𝘀 let developers dynamically modulate speed, volume, and emotion inside the transcript itself. ("Can you repeat that slower?" now works in production.) → 𝟰𝟮-𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗺𝗼𝗱𝗲𝗹 built on a single unified speaker embedding, so one voice can switch between languages like Hindi, Tamil, and English natively while maintaining accent continuity. → 𝟯-𝘀𝗲𝗰𝗼𝗻𝗱 𝘃𝗼𝗶𝗰𝗲 𝗰𝗹𝗼𝗻𝗶𝗻𝗴 powered by a low-sample adaptive cloning pipeline that enables instant personalization at scale. → 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝘁𝗮𝗰𝗸 achieving sub-300 ms end-to-end latency at p90, tuned for live interactions like support agents, NPCs, and healthcare assistants. → 𝗙𝗶𝗻𝗲-𝗴𝗿𝗮𝗶𝗻𝗲𝗱 𝘁𝗿𝗮𝗻𝘀𝗰𝗿𝗶𝗽𝘁 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 that handles heteronyms, acronyms, and structured text (emails, IDs, phone numbers) which usually break realism in production systems. 🎧 Here is example of me trying Sonic-3’s Hindi. You have to hear it to believe it. If you’re building voice agents, conversational AI, or multimodal assistants, keep an eye on Cartesia. They’ve raised $100M to build the most human-sounding voice models in the world, and Sonic-3 just set a new benchmark for multilingual voice AI. #CartesiaPartner

  • View profile for Vasu Gupta

    L&D Leader | E-Leaning | Instructional Design | LMS | MF, PMS, AIF, Bonds, Unlisted, Insurance - Coach | NISM VA Certified | LIII | Centricity Wealthtech | Views are personal

    3,637 followers

    India just got its own multilingual AI stack Not a demo. A real platform. Most AI still speaks English first. India does not. We keep talking about AI scale. But ignore language reality. Sarvam AI just shipped something important. An open-source foundational model suite built for 10 Indian languages and designed voice-first. That changes who AI is for. Here’s what stands out to me: India’s first open-source 2B Indic LLM trained on ~4 trillion tokens Voice agents deployable via phone WhatsApp and in-app workflows Speech → text → translation → synthesis in a single Indic stack Legal AI workbench for drafting redaction and regulatory Q&A Pricing that starts around ₹1 per minute for multilingual agents This is not chasing Silicon Valley scale. It’s solving Indian constraints. Smaller efficient models that run where India actually is Voice interfaces for users who skip keyboards Agentic workflows not just chat responses And the quiet but big idea: Sovereign AI infrastructure. Data stays local. Models align with Indian regulation. Control stays domestic. That matters for BFSI, legal, telecom and any sector touching sensitive data. The real unlock is inclusion. AI that works in Hindi, Tamil, Telugu Malayalam, Punjabi, Odia Gujarati, Marathi, Kannada, Bengali AI that listens before it types We keep saying India will be an AI market. This is India building AI rails. Open-source, voice-first, enterprise-ready That combination is rare. If this ecosystem compounds India does not just consume AI It exports it. Watching this space closely. Local language AI is the next growth curve. What sectors do you think adopt first?

  • View profile for Allys Parsons

    Co-Founder at techire ai. ICASSP ‘26 Sponsor. Hiring in AI since ’19 ✌️ Speech AI, TTS, LLMs, Multimodal AI & more! Top 200 Women Leaders in Conversational AI ‘23 | No.1 Conversational AI Leader ‘21

    17,952 followers

    VoiceTextBlender introduces a novel approach to augmenting LLMs with speech capabilities through single-stage joint speech-text supervised fine-tuning. The researchers from Carnegie Mellon and NVIDIA have developed a more efficient way to create models that can handle both speech and text without compromising performance in either modality. The team's 3B parameter model demonstrates superior performance compared to previous 7B and 13B SpeechLMs across various speech benchmarks whilst preserving the original text-only capabilities—addressing the critical challenge of catastrophic forgetting that has plagued earlier attempts. Their technical approach employs LoRA adaptation of the LLM backbone, combining text-only SFT data with three distinct types of speech-related data: multilingual ASR/AST, speech-based question answering, and an innovative mixed-modal interleaving dataset created by applying TTS to randomly selected sentences from text SFT data. What's particularly impressive is the model's emergent ability to handle multi-turn, mixed-modal conversations despite being trained only on single-turn speech interactions. The system can process user input in pure speech, pure text, or any combination, showing impressive generalisation to unseen prompts and tasks. The researchers have committed to publicly releasing their data generation scripts, training code, and pre-trained model weights, which should significantly advance research in this rapidly evolving field of speech language models. Paper: https://lnkd.in/dutRcaAA Authors: Yifan Peng, Krishna C. Puvvada, Zhehuai Chen, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, Boris Ginsburg #SpeechLM #MultimodalAI #SpeechAI

  • View profile for Andreas Tussing

    charles | Marketing Automation & AI for WhatsApp, RCS & Co | 249% ROI by Forrester TEI

    17,000 followers

    “Make it sound like us.” Sounds easy. It isn’t. I smiled when I saw the post on X about finally getting ChatGPT to stop using em‑dashes. Two things can be true: it’s a tiny UX detail. it took serious work to make it reliable. for sure - it must have sat deep. It brings to one topic that we deal a lot with: Expectations for AI are sky‑high. We feel that every day. But LLMs don’t “follow rules” - they follow likelihood. Injecting deterministic expectations into probabilistic models is like steering a sailboat in shifting winds: you can set the course, the wind still has a say. We learned this early. Being “on point (or on dash)” from day one matters for brand voices. What actually makes it work in production? Strong data hygiene, crisp guardrails, agent evluation, reasoning, iterating: Instructions living in one source of truth (not in five docs and a Slack thread). Evaluation loops that flag drift fast - tone, phrasing, and compliance. ✅ We once had a client upload nearly 100 PDF pages on tone, words to avoid, gendering rules, style, you name it. Overkill? Maybe. Effective? Absolutely - because conversations with customers carry the brand every second. Will I miss the em‑dash? A bit. It became part of the ChatGPT “voice.” 🙂 But consistency beats charm when you represent a brand at scale. Any brand needs to think about what their brand voice prompt shall be look like, and how to make sure it turns out as deterministic as it can get — and which can stay “likely magic” ✨ #conversationalai #aiagents #aiselling

  • View profile for Ruth Zive

    4x CMO; Driving Growth at the Intersection of Voice + AI; Passionate about Ethical Tech & Brand Purpose

    18,300 followers

    Training a text-based LLM is NOT the same as training a voice-based model. Data for voice is scarce and the nuance is considerable. Recognizing the inevitability of a voice-first interface for apps and interactions, AI companies are scrambling to access data and get their models trained. This is where the world is headed. Lightning fast. The latest voice models are convincing. But when all is said and done, what will matter most to brands ISN'T the technology (which is fast becoming commoditized). It's the actual voice BEHIND the technology. Is it distinct? Is it ethically and fairly sourced? Is it yours? Has it been trained responsibly? Is it aligned to your brand identity? Do you have access to the human behind the voice? Just like your company office/storefront, letterhead, logo, website, app have defined your brand identity over the last century, the next brand frontier is your literal voice. And if you're not thinking about that today, you're likely lagging behind your competitors.

  • View profile for Toby Coppel

    Co-founder and Partner @ Mosaic Ventures | Startups

    18,269 followers

    Screens are optional—conversation isn’t. Voice agents have finally crossed the line from “nice demo” to mass scale live production. A Fortune 100 health insurer has replaced swaths of its call-centre workforce with an AI agent that listens to symptom descriptions, gauges urgency and benefit details, and steers members to the right in-house nurse or in-network provider. Early results show mis-routed calls collapsing while human nurses concentrate on the most complex cases—evidence that, when trained on medical nuance, automation can still deliver empathy. The same capability is trickling down to Main Street. A neighbourhood dental clinic now relies on a 24/7 AI receptionist that fills midnight cancellations, takes deposits and syncs instantly with the practice-management calendar, eliminating the Monday-morning voicemail backlog. Nearby, an auto body shop lets its voice agent quote repairs and capture credit-card details while mechanics sleep, winning leads that used to hang up after three rings. Why does this feel inevitable? Voice is simply higher bandwidth than text; tone, pace and sighs carry layers of meaning a text interaction cannot. Studies show people (and agents) read emotion and feel connection more accurately when they hear a voice. As latency drops below half a second and costs reach pennies per minute, talking will again beat typing for many tasks—only this time the “person” on the other end might be generated by silicon. Now imagine the next step: every brand offers you a personal concierge that remembers the hiking boots you bought last spring, the hotel room you preferred in Tel Aviv or your preference for classical hold music. It greets you by name, picks up the last conversation mid-sentence and suggests dinner before you even think to ask. Conversation becomes the API. Optimism doesn’t erase risk. Voice-cloning scams already account for more than 40 percent of fraud attempts in finance, up twenty-fold in three years. Protecting both brands and callers will demand a new security layer: real-time likeness checks, rotating pass-phrases and cryptographic watermarks baked into synthetic speech so a courtroom—or a phone—can tell the difference between a genuine agent and a deepfake. That challenge is an opening for startups. I’m curious: if you’re experimenting with voice, how are you balancing speed, empathy and security? And what surprised you when real customers finally started talking back? Happy to compare notes.

  • View profile for Vaibhav Goyal
    Vaibhav Goyal Vaibhav Goyal is an Influencer

    Agentic AI | Collections | IITM RP Mentor | Educator

    12,693 followers

    Imagine trying to get a workout recommendation while running, navigate a complex route while driving, or get tech support while cooking - all without touching a screen. This is the promise of voice-enabled LLM agents, a technological leap that's redefining how we interact with machines. Traditional text-based chatbots are like trying to dance with two left feet. They're clunky, impersonal, and frustratingly limited. Consider these real-world friction points: - A visually impaired user struggling to type support queries - A fitness enthusiast unable to get real-time guidance mid-workout - A busy professional multitasking who can't pause to type a complex question Voice AI breaks these barriers, mimicking how humans have communicated for millennia. We learn to speak by four months, but writing takes years - testament to speech's fundamental naturalness. Real-World Transformation Examples: 1️⃣ Healthcare: Emotion-recognizing AI can detect patient stress levels through voice modulation, enabling more empathetic remote consultations. 2️⃣ Fitness: Hands-free coaching that adapts workout intensity based on your breathing and vocal energy. 3️⃣ Customer Service: Intelligent voice systems that understand context, emotional undertones, and personalize responses in real-time. The magic of voice lies in its nuanced communication: - Tone reveals emotional landscapes - Intensity signals urgency or excitement - Rhythm creates conversational flow - Inflection adds layers of meaning beyond mere words - Recognize emotional states with unprecedented accuracy - Support rich, multimodal interactions combining voice, visuals, and context - Differentiate speakers in complex conversations - Extract subtle contextual intentions - Provide personalized responses based on voice characteristics In short, this technology is about creating more human-centric technology that listens, understands, and responds like a thoughtful companion. The future of AI isn't about machines talking at us, but talking with us.

Explore categories