Top LinkedIn Content on Data Quality for AI

Head of AIOps @ IBM || Speaker | Lecturer | Advisor

241,674 followers 11mo

𝗜𝗳 𝘆𝗼𝘂 𝘄𝗮𝗻𝘁 𝘁𝗼 𝗯𝘂𝗶𝗹𝗱 𝗮𝗻 𝗔𝗜 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝘆 𝗳𝗼𝗿 𝘆𝗼𝘂𝗿 𝗰𝗼𝗺𝗽𝗮𝗻𝘆, 𝘆𝗼𝘂 𝗳𝗶𝗿𝘀𝘁 𝗻𝗲𝗲𝗱 𝘁𝗼 𝗯𝘂𝗶𝗹𝗱 𝗮 𝘀𝗼𝗹𝗶𝗱 𝗱𝗮𝘁𝗮 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗮𝗻𝗱 𝗲𝗻𝗳𝗼𝗿𝗰𝗲 𝘀𝘁𝗿𝗶𝗰𝘁 𝗱𝗮𝘁𝗮 𝗵𝘆𝗴𝗶𝗲𝗻𝗲. Getting your house in order is the foundation for delivering on any AI ambition. The MIT Technology Review — based on insights from 205 C-level executives and data leaders — lays it out clearly: 𝗠𝗼𝘀𝘁 𝗰𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝗱𝗼 𝗻𝗼𝘁 𝗳𝗮𝗰𝗲 𝗮𝗻 𝗔𝗜 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. 𝗧𝗵𝗲𝘆 𝗳𝗮𝗰𝗲 𝗰𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗶𝗻 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆, 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲, 𝗮𝗻𝗱 𝗿𝗶𝘀𝗸 𝗺𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁. Therefore, many firms are still stuck in pilots, not production. Changing that requires strong data foundations, scalable architectures, trusted partners, and a shift in how companies think about creating real value with AI. Because pilots are easy, BUT scaling AI across the enterprise is hard. 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝘁𝗵𝗲 𝗸𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀: ⬇️ 1. 95% 𝗼𝗳 𝗰𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝗮𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝗔𝗜 — 𝗯𝘂𝘁 76% 𝗮𝗿𝗲 𝘀𝘁𝘂𝗰𝗸 𝗮𝘁 𝗷𝘂𝘀𝘁 1–3 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀: ➜ The gap between ambition and execution is huge. Scaling AI across the full business will define competitive advantage over the next 24 months. 2. 𝗗𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝗹𝗶𝗾𝘂𝗶𝗱𝗶𝘁𝘆 𝗮𝗿𝗲 𝘁𝗵𝗲 𝗿𝗲𝗮𝗹 𝗯𝗼𝘁𝘁𝗹𝗲𝗻𝗲𝗰𝗸𝘀: ➜ Without curated, accessible, and trusted data, no AI strategy can succeed — no matter how powerful the models are. 3. 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲, 𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆, 𝗮𝗻𝗱 𝗽𝗿𝗶𝘃𝗮𝗰𝘆 𝗮𝗿𝗲 𝘀𝗹𝗼𝘄𝗶𝗻𝗴 𝗔𝗜 𝗱𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 — 𝗮𝗻𝗱 𝘁𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗴𝗼𝗼𝗱 𝘁𝗵𝗶𝗻𝗴: ➜ 98% of executives say they would rather be safe than first. Trust, not speed, will win in the next AI wave. 4. 𝗦𝗽𝗲𝗰𝗶𝗮𝗹𝗶𝘇𝗲𝗱, 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀-𝘀𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗔𝗜 𝘂𝘀𝗲 𝗰𝗮𝘀𝗲𝘀 𝘄𝗶𝗹𝗹 𝗱𝗿𝗶𝘃𝗲 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝘃𝗮𝗹𝘂𝗲: ➜ Generic generative AI (chatbots, text generation) is table stakes. True differentiation will come from custom, domain-specific applications. 5. 𝗟𝗲𝗴𝗮𝗰𝘆 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝗮𝗿𝗲 𝗮 𝗺𝗮𝗷𝗼𝗿 𝗱𝗿𝗮𝗴 𝗼𝗻 𝗔𝗜 𝗮𝗺𝗯𝗶𝘁𝗶𝗼𝗻𝘀: ➜ Firms sitting on fragmented, outdated infrastructure are finding that retrofitting AI into legacy systems is often more costly than building new foundations. 6. 𝗖𝗼𝘀𝘁 𝗿𝗲𝗮𝗹𝗶𝘁𝗶𝗲𝘀 𝗮𝗿𝗲 𝗵𝗶𝘁𝘁𝗶𝗻𝗴 𝗵𝗮𝗿𝗱: ➜ From GPUs to energy bills, AI is not cheap — and mid-sized companies face the biggest barriers. Smart firms are building realistic ROI models that go beyond hype. 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗳𝘂𝘁𝘂𝗿𝗲-𝗿𝗲𝗮𝗱𝘆 𝗔𝗜 𝗲𝗻𝘁𝗲𝗿𝗽𝗿𝗶𝘀𝗲 𝗶𝘀𝗻’𝘁 𝗮𝗯𝗼𝘂𝘁 𝗰𝗵𝗮𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝗻𝗲𝘅𝘁 𝗺𝗼𝗱𝗲𝗹 𝗿𝗲𝗹𝗲𝗮𝘀𝗲. 𝗜𝘁’𝘀 𝗮𝗯𝗼𝘂𝘁 𝘀𝗼𝗹𝘃𝗶𝗻𝗴 𝘁𝗵𝗲 𝗵𝗮𝗿𝗱 𝗽𝗿𝗼𝗯𝗹𝗲𝗺𝘀 — 𝗱𝗮𝘁𝗮, 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲, 𝗴𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲, 𝗮𝗻𝗱 𝗥𝗢𝗜 — 𝘁𝗼𝗱𝗮𝘆.

145 Comments

Santiago Valdarrama

Computer scientist and writer. I teach hard-core Machine Learning at ml.school.

121,914 followers 1y

Machine learning education is broken, especially for those who aspire to start solving real-world problems at a company. Most classes, courses, and books start with a dataset and show you how to train a model. dataset → model This is, at best, 5% of the work you'll need to do. Real-life problems never start with a "dataset," and they never end after you finish training a model. I've never seen a company with a "dataset" ready to go. In fact, most companies don't even have any data at all. It's your job to determine what data you need and how to collect it. Here is a simplified process that will give you a better idea of how people solve real problems: problem → framing → data → model → feedback → repeat Before understanding the problem and deciding how you'll frame it to solve it, you can't start thinking about datasets. A few other challenges: 1. How do you get data from its source? 2. Is the data diverse enough to solve the problem? 3. Do you have enough data? 4. How is the data biased? 5. How frequently does the data change? 6. How sensitive is the data? 7. Are there missing, inconsistent, or incorrect values? 8. How noisy is the data? 9. How can you trace back every piece of data to its source? 10. Are there any legal restrictions on the use of the data? 11. How do you scale as data grows? 12. How quickly does the data become stale? Building systems that work requires a lot of effort. I wish more people would talk about this.

62 Comments

Jim Fan

NVIDIA Director of AI & Distinguished Scientist. Co-Lead of Project GR00T (Humanoid Robotics) & GEAR Lab. Stanford Ph.D. OpenAI's first intern. Solving Physical AGI, one motor at a time.

237,768 followers 1y

Exciting updates on Project GR00T! We discover a systematic way to scale up robot data, tackling the most painful pain point in robotics. The idea is simple: human collects demonstration on a real robot, and we multiply that data 1000x or more in simulation. Let’s break it down: 1. We use Apple Vision Pro (yes!!) to give the human operator first person control of the humanoid. Vision Pro parses human hand pose and retargets the motion to the robot hand, all in real time. From the human’s point of view, they are immersed in another body like the Avatar. Teleoperation is slow and time-consuming, but we can afford to collect a small amount of data. 2. We use RoboCasa, a generative simulation framework, to multiply the demonstration data by varying the visual appearance and layout of the environment. In Jensen’s keynote video below, the humanoid is now placing the cup in hundreds of kitchens with a huge diversity of textures, furniture, and object placement. We only have 1 physical kitchen at the GEAR Lab in NVIDIA HQ, but we can conjure up infinite ones in simulation. 3. Finally, we apply MimicGen, a technique to multiply the above data even more by varying the *motion* of the robot. MimicGen generates vast number of new action trajectories based on the original human data, and filters out failed ones (e.g. those that drop the cup) to form a much larger dataset. To sum up, given 1 human trajectory with Vision Pro -> RoboCasa produces N (varying visuals) -> MimicGen further augments to NxM (varying motions). This is the way to trade compute for expensive human data by GPU-accelerated simulation. A while ago, I mentioned that teleoperation is fundamentally not scalable, because we are always limited by 24 hrs/robot/day in the world of atoms. Our new GR00T synthetic data pipeline breaks this barrier in the world of bits. Scaling has been so much fun for LLMs, and it's finally our turn to have fun in robotics! We are creating tools to enable everyone in the ecosystem to scale up with us: - RoboCasa: our generative simulation framework (Yuke Zhu). It's fully open-source! Here you go: http://robocasa.ai - MimicGen: our generative action framework (Ajay Mandlekar). The code is open-source for robot arms, but we will have another version for humanoid and 5-finger hands: https://lnkd.in/gsRArQXy - We are building a state-of-the-art Apple Vision Pro -> humanoid robot "Avatar" stack. Xiaolong Wang group’s open-source libraries laid the foundation: https://lnkd.in/gUYye7yt - Watch Jensen's keynote yesterday. He cannot hide his excitement about Project GR00T and robot foundation models! https://lnkd.in/g3hZteCG Finally, GEAR lab is hiring! We want the best roboticists in the world to join us on this moon-landing mission to solve physical AGI: https://lnkd.in/gTancpNK

102 Comments

Armand Ruiz

building AI systems @meta

206,613 followers 1y

Sure, anybody can call OpenAI APIs to access cutting-edge models, but let’s be real: the true opportunity for businesses isn’t just plugging into those APIs. It’s about leveraging your most unique competitive advantage: your data. Data is the foundation of any successful AI system. Yet, the journey from raw data to actual value has many challenges: 1. Not enough data? Your model can’t be generalized. 2. Poor-quality data? Expect poor-quality results. 3. Nonrepresentative data? Say hello to biased predictions. 4. Too many irrelevant features? You’re adding noise, not value. 5. Not enough diversity? Your model won’t be robust. Garbage in, garbage out. Even the most advanced model is only as good as the data it learns from. For businesses, the opportunity lies in building data pipelines tailored to their unique context — clean, representative, and enriched with meaningful features. This is how you create an AI that’s not just smart, but aligned with your business goals. The frontier isn’t just in using AI. It’s in using AI to transform your data into a moat your competitors can’t cross.

78 Comments

Jeff Winter

Industry 4.0 & Digital Transformation Enthusiast | Business Strategist | Avid Storyteller | Tech Geek | Public Speaker

172,752 followers 1y

*𝑆𝑖𝑔ℎ* Yet again, I hear another company excitedly talking about implementing AI—integrating it, scaling it, “revolutionizing everything”—and yet they gloss over the need for a robust data strategy. It takes all my energy not to pull my hair out as I cringe, listening to the words. But instead of yelling into the void, I’ve learned a better approach: I ask questions. Good ones. The kind that make leaders pause and realize that AI without solid data foundations is just a very expensive experiment. 𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 𝐥𝐢𝐤𝐞: 1) What percentage of your data is truly usable—normalized, contextualized, indexed, and properly mapped? 2) How much of your data is “dark” (produced but unused), and what’s your plan to leverage it? 3) Do you have a defined data governance and data management framework, or is it mostly ad hoc? 4) What’s your process for ensuring data accuracy, completeness, and relevance for AI models? 5) How scalable is your data infrastructure to support AI at an enterprise level? 6) If AI solutions depend on a continuous flow of clean data, how confident are you that your processes can deliver that over time? This is when the lightbulb flickers. Because here’s the reality: You already produce more data than you know what to do with. And yet, no one is asking whether your data is reliable, clean, and strategically aligned. Oh, and let’s not forget—you’re probably not even collecting the right strategic data yet to unlock AI’s full potential. AI doesn’t live in isolation. It thrives on organized, high-quality data. Your first step to scaling AI shouldn’t be building models—it should be building a foundation: ✅ 𝐃𝐚𝐭𝐚 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 ✅ 𝐃𝐚𝐭𝐚 𝐠𝐨𝐯𝐞𝐫𝐧𝐚𝐧𝐜𝐞 ✅ 𝐃𝐚𝐭𝐚 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 ✅ And, most importantly, a 𝐝𝐚𝐭𝐚 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐲. 𝐒𝐨 𝐛𝐞𝐟𝐨𝐫𝐞 𝐲𝐨𝐮 𝐝𝐢𝐯𝐞 𝐢𝐧𝐭𝐨 𝐀𝐈, 𝐚𝐬𝐤 𝐲𝐨𝐮𝐫𝐬𝐞𝐥𝐟: “If AI is the engine of innovation, do we even have the fuel to power it?” (Trust me, the answer might surprise you.) ******************************************* • Visit www.jeffwinterinsights.com for access to all my content and to stay current on Industry 4.0 and other cool tech trends • Ring the 🔔 for notifications!

231 Comments

Ryan Law

Director of Content Marketing at Ahrefs

34,459 followers 2mo

In the last 3 months at Ahrefs, we analyzed over 1 billion data points across 11 studies*. Here's what we learned about AI search optimization: 1. YouTube mentions are the single strongest predictor of AI visibility (correlation: 0.737) – stronger than Domain Rating, backlinks, or any traditional SEO factor. YouTube is heavily cited in AI responses, and both Google and OpenAI train on YouTube content. 2. For a given query, AI Mode and AI Overviews reach the same conclusions 86% of the time – but cite almost entirely different sources (only 13.7% citation overlap). AI Mode responses are 4x longer and mention 3x more entities. 3. Content length has essentially zero correlation with AI citations (0.04). 53% of all AI Overview citations go to pages under 1,000 words. Writing ultra-long contentisan't necessary for AI visibility. 4. Google still sends 345x more traffic than ChatGPT, Gemini, and Perplexity combined – but ChatGPT accounts for 80%+ of all AI-driven website traffic. 5. AI Overviews have a 70% chance of changing from one observation to the next, with content lasting an average of just 2.15 days. But semantic meaning stays remarkably consistent (0.95 cosine similarity). 6. "Best X" blog lists make up 43.8% of all page types cited in ChatGPT responses. 35% of those lists come from low-authority domains. 7. 79% of blog lists cited by ChatGPT were updated in 2025, and 76% of top-cited pages were refreshed within the last 30 days. Freshness matters more than ever. 8. When asked questions without valid answers, AI systems choose fabricated content with specific numbers almost every time. ChatGPT resisted best (84% accuracy), but Grok and Copilot were fully manipulated. 9. Domain Rating correlates weakly with AI visibility (just 0.266-0.326 across platforms). Number of site pages is even weaker at 0.194. 10. 67% of ChatGPT's top 1,000 citations are essentially off-limits to marketers – Wikipedia alone accounts for 29.7%, followed by homepages (23.8%) and educational content (at just 19.4%). *i'll share all the study links in a comment!

266 Comments

Michael Streit

I help leaders build human–AI organizations that outperform. AI Strategist | Keynote Speaker | Executive Coach

8,055 followers 2mo Edited

Your AI isn’t hallucinating. It’s just accurately reflecting your messy data. "There is no AI - without IA." Seth Earley Your Information Architecture (IA) becomes your asset. Like Harari said: "𝙄𝙣𝙛𝙤𝙧𝙢𝙖𝙩𝙞𝙤𝙣 𝙞𝙨 𝙩𝙝𝙚 𝙖𝙩𝙩𝙚𝙢𝙥𝙩 𝙩𝙤 𝙧𝙚𝙛𝙡𝙚𝙘𝙩 𝙧𝙚𝙖𝙡𝙞𝙩𝙮, 𝙩𝙝𝙪𝙨 𝙩𝙝𝙚 𝙩𝙧𝙪𝙩𝙝." If you want your AI solution or Tool to add value to your business (which I think you do) - you need to make sure your model understands your business reality. Your data is that reality. Your IA is the foundation. Here are my 5 Pillars of Data Governance for making data your strategic asset: → 𝟭/ 𝗗𝗮𝘁𝗮 𝗖𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻, 𝗔𝗰𝗾𝘂𝗶𝘀𝗶𝘁𝗶𝗼𝗻 & 𝗥𝗲𝘁𝗶𝗿𝗲𝗺𝗲𝗻𝘁 𝘏𝘰𝘸 𝘴𝘩𝘰𝘶𝘭𝘥 𝘥𝘢𝘵𝘢 𝘦𝘯𝘵𝘦𝘳 𝘢𝘯𝘥 𝘦𝘹𝘪𝘵 𝘺𝘰𝘶𝘳 𝘰𝘳𝘨𝘢𝘯𝘪𝘻𝘢𝘵𝘪𝘰𝘯? - Define legal, ethical, and transparent acquisition channels. - Capture consent and regulatory compliance at source. - Set clear rules for retention and clean, timely deletion. → 𝟮/ 𝗗𝗮𝘁𝗮 𝗦𝘁𝗼𝗿𝗮𝗴𝗲, 𝗢𝗿𝗴𝗮𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻 & 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝘏𝘰𝘸 𝘥𝘰 𝘸𝘦 𝘴𝘵𝘳𝘶𝘤𝘵𝘶𝘳𝘦, 𝘴𝘵𝘢𝘯𝘥𝘢𝘳𝘥𝘪𝘻𝘦, 𝘢𝘯𝘥 𝘶𝘴𝘦 𝘥𝘢𝘵𝘢 𝘦𝘧𝘧𝘦𝘤𝘵𝘪𝘷𝘦𝘭𝘺? - Data strategy that handles volume, velocity, and variety. - Ensure data marts are business-ready, FAIR, and MECE. - Centralize business rules, logic and KPIs as SSoT. → 𝟯/ 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆, 𝗢𝘄𝗻𝗲𝗿𝘀𝗵𝗶𝗽 & 𝗦𝘁𝗲𝘄𝗮𝗿𝗱𝘀𝗵𝗶𝗽 𝘏𝘰𝘸 𝘥𝘰 𝘸𝘦 𝘦𝘯𝘴𝘶𝘳𝘦 𝘵𝘳𝘶𝘴𝘵 𝘢𝘯𝘥 𝘢𝘤𝘤𝘰𝘶𝘯𝘵𝘢𝘣𝘪𝘭𝘪𝘵𝘺? - Monitor data accuracy, completeness, and consistency. - Assign clear ownership and stewardship roles. - Establish accountability through data KPIs. → 𝟰/ 𝗗𝗮𝘁𝗮 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆, 𝗔𝗰𝗰𝗲𝘀𝘀 & 𝗣𝗿𝗶𝘃𝗮𝗰𝘆 𝘏𝘰𝘸 𝘥𝘰 𝘸𝘦 𝘱𝘳𝘰𝘵𝘦𝘤𝘵 𝘰𝘶𝘳 𝘥𝘢𝘵𝘢 𝘢𝘯𝘥 𝘴𝘩𝘢𝘳𝘦 𝘪𝘵 𝘳𝘦𝘴𝘱𝘰𝘯𝘴𝘪𝘣𝘭𝘺? - Live data access via “right people, right data, right time”. - Apply anonymization and role-based access control. - Stay compliant (GDPR, HIPAA) and conduct audits. → 𝟱/ 𝗗𝗮𝘁𝗮 𝗨𝘀𝗮𝗴𝗲, 𝗘𝘁𝗵𝗶𝗰𝘀 & 𝗖𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝗰𝗲 𝘏𝘰𝘸 𝘥𝘰 𝘸𝘦 𝘢𝘱𝘱𝘭𝘺 𝘥𝘢𝘵𝘢 𝘪𝘯 𝘱𝘳𝘢𝘤𝘵𝘪𝘤𝘦? - Set clear AI ethics rules, and monitor bias and fairness. - Align with internal policies, laws, and social expectations. - Track data lineage and usage logs for transparency. On a scale of 1 to 10, what priority does Data Governance currently have in your company? 1-3: Data What? 4-7: We're trying, but it's messy. 8-10: It's a strategic pillar. Hi I'm Michael 👨💻 AI Strategist | Keynote Speaker | Executive Coach 👉 Follow to Gain Competitive Advantage through AI

154 Comments

Tony Seale

The Knowledge Graph Guy

40,908 followers 2mo

The AI Wave Finally Hit - and AI Data Readiness will be next. Over the past few weeks, something has shifted. OpenClaw has been making waves in the open-source world. Claude Cowork's plugins landed with a thud, particularly the automated contract-analysis tool. In days, hundreds of billions in market value were wiped off established software and IT services stocks. Markets don't reprice like that over a single product. They reprice when a deeper assumption breaks. 🔵 The Threshold Those of us working closely with systems like Claude Code have seen this coming - especially since Opus-class models made agentic workflows viable. But this feels like the moment the conversation crossed a threshold. What was niche is now mainstream. Here is a prediction: people will soon wake up to the importance of how these systems are grounded. Frameworks like OpenClaw are powerful, but rely on emergent behaviour over loosely structured context. An ontology-backed data structure gives you something tighter: clearer constraints, more predictable reasoning, and far less ambiguity about what the system is allowed to conclude. That difference shows up as reliability, and it will become impossible to ignore once people start to engage seriously. 🔵 A Personal Resonance For years, my argument was simple: AI is coming, and organisations need to get their data ready. It was never about chasing the latest technology. It was about recognising that once AI arrived, the limiting factor would not be the models - it would be the data. What I didn't anticipate is how unprepared I would be when that moment truly arrived. Last week, as I fully "wire-headed" into one of our internal agents - with direct access to our ontology and knowledge graph - something crystallised. The speed with which organisational context became usable, the way complex structures turned into leverage, was both exhilarating and unsettling. We are not psychologically prepared for what this is going to feel like. You can see this by the slightly manic look in the eyes of those who have already wire-headed. 🔵 The Principles That Still Hold As things get increasingly volatile, it's worth returning to core principles I've been repeating for a decade. First, focus on your data. AI is like an iceberg: what you see above the surface gets the attention, but what matters is what sits underneath. Second, "getting your data ready" means two things: linking it together richly, and organising it semantically. Without that, AI systems either underperform or produce confident nonsense. Finally, stick to open standards. They are the only reliable way to maintain flexibility as tools, vendors, and architectures change faster than organisations can react. The recent market reaction wasn't panic over a single tool. It was a delayed recognition of a reality building for years. AI didn't arrive overnight. But now that it's here, the cost of not being data-ready will become visible - all at once.

144 Comments

Pooja Jain

194,209 followers 4mo

When a dashboard crashes, the finger-pointing starts. Is it the Engineer? The Analyst? The Steward? Think of data governance like building a bridge. One engineer can design brilliant steel beams, but if the foundation team uses weak concrete and the inspection team skips safety checks—the bridge collapses. You can't blame the steel. 🎬 𝗗𝗮𝘁𝗮 𝗚𝗼𝘃𝗲𝗿𝗻𝗮𝗻𝗰𝗲 = 𝗔 𝗦𝘆𝗺𝗽𝗵𝗼𝗻𝘆, 𝗡𝗼𝘁 𝗮 𝗦𝗼𝗹𝗼 🔧 Engineers: Build the pipelines (the stage) 📊 Analysts: Define the metrics (the script) 🔮 Scientists: Extract insights (direct the plot) 📜 Stewards: Own data quality (manage backstage) 📈 Business: Drive decisions (deliver the finale) Miss one cue? The entire show derails. 💥 What Happens Without Governance? ❌ Wrong data flows into dashboards → Bad decisions made with confidence ❌ Silos form between teams → Duplicate work, conflicting sources ❌ Finger-pointing replaces fixing → Problems fester, trust erodes ❌ Reactive patches, not root fixes → Same fires, different day The Damage: When Governance Fails → $12.9M lost per organization annually (as per Gartner's research) Wasted spend. Bad decisions. Endless rework. Missed opportunities. → 15-25% revenue leakage Decisions made on incomplete data. Inconsistent sources. Duplicate records. → $4.5M average breach cost (2025) U.S. and U.K.? Often $9-10M per incident. Security isn't optional anymore. → $3.1T drained from U.S. economy Failed initiatives. Wasted effort. Poor quality compounds across industries. 𝘕𝘰𝘵 𝘢 𝘵𝘦𝘤𝘩 𝘱𝘳𝘰𝘣𝘭𝘦𝘮. 𝘈 𝘵𝘦𝘢𝘮𝘸𝘰𝘳𝘬 𝘱𝘳𝘰𝘣𝘭𝘦𝘮. 💡 The Fix: Build Systems, Not Silos ✅ Automate quality checks → Catch issues before analysts do ✅ Track lineage & metadata → Know where data comes from, where it goes ✅ Design for observability → Monitor pipelines like you monitor apps ✅ Embed compliance early → Privacy isn't a checkbox—it's architecture ✅ Break down role barriers → Engineers, analysts, stewards—one team 🎯 The Bottom Line Governance isn't bureaucracy. It's the blueprint for data systems that don't crumble under pressure. 𝘎𝘳𝘦𝘢𝘵 𝘥𝘢𝘵𝘢 𝘪𝘴𝘯'𝘵 𝘣𝘶𝘪𝘭𝘵 𝘪𝘯 𝘪𝘴𝘰𝘭𝘢𝘵𝘪𝘰𝘯—𝘪𝘵'𝘴 𝘰𝘳𝘤𝘩𝘦𝘴𝘵𝘳𝘢𝘵𝘦𝘥 𝘵𝘰𝘨𝘦𝘵𝘩𝘦𝘳. What's your biggest data governance challenge? Drop it below. 👇

111 Comments

Llewyn Paine, Ph.D.

📊 Outcomes over output: Validated AI research guidance for product leaders | Training workshops | Speaking | Consulting

2,961 followers 12mo

I invited 31 researchers to test AI research synthesis by running the exact same prompt. They learned LLM analysis is overhyped, but evaluating it is something you can do yourself. Last month I ran an #AI for #userresearch workshop with Rosenfeld Media. Our first cohort was full of smart, thoughtful researchers (if you participated in the workshop, I hope you’ll tag yourself and weigh in in the comments!). A major limitation of a lot of AI for UXR “thought leadership” right now is that too much of it is anecdotal: researchers run datasets a few times through a commercial tool and decide whether or not the output is good enough based on only a handful of results. But for nondeterministic systems like generative AI, repeated testing under controlled conditions is the only way to know how well they actually work. So that’s what we did in the workshop. Our workshop participants produced a lot of interesting findings about qualitative research synthesis with AI: 1️⃣ LLMs can product vastly different output even with the exact same prompt and data. The number of themes alone ranged from 5 to 18, with a median of about 10.5. 2️⃣ Our AI-generated themes mapped pretty well to human-generated themes, but there were some notable differences. This led to a discussion of whether mapping to human themes is even the right metric to use to evaluate AI synthesis (how are we evaluating whether the human-generated themes were right in the first place?). 3️⃣ The bigger concern for the researchers in the workshop was the lack of supporting evidence for themes. The supporting quotes the LLM provided looked okay superficially, but on closer investigation *every single participant* found examples of data being misquoted or entirely fabricated. One person commented that validating the output was ultimately more work than performing the analysis themselves. Now, I want to acknowledge that this is one dataset, one prompt (although, a carefully vetted one, written by an industry expert), and one model (GPT 4o 2024-11-20). Some researchers claim that GPT 4o is worse for research hallucinations–and perhaps it is–but it is still a heavily utilized model in current off-the-shelf AI research tools (and if you’re using off-the-shelf tools, you won’t always know which models they’re using unless you read a whole lot of fine print). But the point is–I think this is exactly the level at which we should be scrutinizing the output of *all* LLMs in research. AI absolutely has its place in the modern researcher’s toolkit. But until we systematically evaluate its strengths and weaknesses, we're rolling the dice every time we use it. We'll be running a second round of my workshop in June as part of Rosenfeld Media’s Designing with AI conference (ticket prices go up tomorrow; register with code PAINE-DWAI2025 for a discount). Or, to hear about other upcoming workshops and events from me, sign up for my mailing list (links below).

186 Comments

LinkedIn respects your privacy

Data Quality for AI

Explore categories

Data Quality for AI

More in Data Quality for AI

More Artificial Intelligence topics

Explore categories