Pandas was not built like a database. Wes McKinney knew that. He shipped it anyway. "If I had taken three years longer to get something useful into the market, to do things the right way, it would have been too late." People needed to read CSV files. They needed basic data wrangling. The fact that pandas was architecturally imperfect did not matter because it solved real problems at the right moment. The "perfect" version (something like Polars) came 10 years later, when the scale actually demanded it. Shipping early with known limitations created the most-used Python data library in the world. Check out the full conversation on the Data Renegades Podcast wherever you listen to your favorite podcasts. #pandasPython #startuplessons #dataengineering #DataRenegades #shipearly
Recce
Data Infrastructure and Analytics
Helping data teams preview, validate, and ship data changes with confidence.
About us
Recce helps modern data teams preview, validate, and ship data changes with confidence. By turning pull requests into structured, context-rich reviews, Recce makes it easy to spot meaningful changes, verify intent and impact, and reduce cognitive load for authors and reviewers alike. Curate reproducible checklists that compare data across environments — so you can catch what matters, skip what doesn’t, and align your team before merging. Accelerate development, cut down on manual QA, and bring visibility, verifiability, and velocity to your data workflows.
- Website
-
https://datarecce.io
External link for Recce
- Industry
- Data Infrastructure and Analytics
- Company size
- 2-10 employees
- Headquarters
- San Francisco
- Type
- Privately Held
- Specialties
- dbt, Modern Data Stack, code review, Data Engineering, SQL, Data Lineage, Query Diff, Lineage Diff, and Data Model Diff
Locations
-
Primary
Get directions
San Francisco, US
Employees at Recce
Updates
-
AI coding tools generate wrong SQL all the time. Not syntax errors. Logic errors where the query runs, the numbers look plausible, and the dashboard updates without complaint. The fix isn't a smarter model. It's giving the model the context it needs, when it needs it. AI skills are markdown files that encode domain knowledge, workflows, and guardrails into AI coding tools. No framework. No SDK. Structured text in a repo, version-controlled like dbt docs or YAML configs. The real power is the loop. Code guided by domain rules. Review catches what the code got wrong. Handoff captures the fix into persistent context. Updated skills make the next session smarter. Every cycle compounds. At Recce, one aggregation bug turned into a permanent rule the system now enforces automatically. Seven columns across three models, fixed once and remembered forever. Dori Wilson broke down the full framework at March's Data Debug SF, from skill anatomy to the self-improving loop to scaling skills into team-wide plugins. Full writeup on our blog. Link in comments. #DataEngineering #AnalyticsEngineering #AI #dbt
-
-
Text-to-SQL works well on benchmarks. It falls apart on real schemas. On our Data Renegades Podcast ep 9, Wes McKinney was at the CIDR database conference where Michael Stonebreaker presented the Beaver benchmark -- a text-to-SQL eval built on actual institutional database schemas from inside MIT. Frontier models struggled to generate correct SQL, even though MIT is home to many of the researchers who trained those models. The core problem: LLMs do not understand the subtleties of table relationships. Is this a one-to-one join or one-to-many? Should this metric count distinct users or total sessions? The join semantics and counting logic that determine whether a dashboard shows the right numbers -- these are exactly where models break down. Wes sees semantic modeling as the answer. The idea: predefine table relationships and metric logic so LLMs generate queries within understood boundaries instead of free-forming SQL across complex schemas. Tools like Malloy (Lloyd Tabb's successor to LookML) are the current best implementation of this approach. Without guardrails? "Otherwise, people are just going to be pointing Claude code at their production databases and blowing their feet off basically." Ben Stancil (founder of Mode Analytics & Data Renegades guest on ep 5) calls it the "vibe and verify" revolution. Analytics and data engineering resist full vibe coding because there are too many subtleties where models fail silently. Wes agrees. And he sees this as job security for data practitioners who understand their schemas deeply. Listen wherever you catch your favorite podcasts. #texttoSQL #AI #dataanalytics #semanticmodeling #dataengineering #DataRenegades
-
Can AI coding agents build something as intricate as Apache Arrow? Not yet, says the co-creator Wes McKinnny on the Data Renegades Podcast. "Arrow is a project that has the intricacy of a fine Swiss watch. There's a lot of very small details that were created very painstakingly over a long period of time." Wes McKinney uses AI agents daily. He runs parallel Claude Code sessions and shipped two new open source projects in the last month. But he draws a sharp line at core infrastructure: file formats, processing engines, metadata management. These require deliberation and architectural nuance that current agents lack. Data infrastructure remains one of AI's hardest frontiers. Listen wherever you catch your favorite podcasts. #ApacheArrow #datainfrastructure #AIlimits #dataengineering #DataRenegades
-
Wes McKinney's dropped this thesis in a Data Renegades Podcast: AI has created radical accountability for every software vendor. Building software just got dramatically cheaper. One engineer with a Claude subscription can prototype a replacement for tools that entire teams used to tolerate. Customers no longer have to accept mediocre products because the cost of leaving has collapsed. Wes's message to vendors shipping broken tools: "This is bad. Why haven't you fixed this yet? If I was on your engineering team, I would have already fixed this. I would have done it like today with Claude code." The flip side: a flood of AI-generated projects that only make sense to their creators. Hyper-personalized software with bad taste. The barrier to entry dropped, but the bar for credibility rose. "People are going to decide which companies to pay attention to on the basis of how credible the people are involved." Full episode linked in comments, or wherever you listen to podcasts #AIcodingagent #radicalaccountability #agenticworkflows #dataengineering #DataRenegades
-
-
Data Debug brought three practitioners to Mux's office this past Tuesday to answer one question: how do you make AI actually reliable for data work? The talks told a complete story. Claire Gouze CEO/Founder at nao Labs (YC X25) benchmarked 21 AI analytics tools on text-to-SQL accuracy. The headline finding: going from no context to a cleaned data model jumped accuracy from 17% to 86%. Semantic layers alone? 4% correct. Context quality is everything. Our own Dori Wilson shared the AI skills framework she built to operationalize that context. Skills are markdown files that encode domain knowledge, workflows, and guardrails into AI coding tools. Structured as a self-improving loop, every session compounds. She walked through a real aggregation bug Claude introduced, how a review skill caught it, and how the fix became a permanent rule the system enforces automatically. Kasia Rachuta (Lead Data Scientist) showed the breadth of what's possible today: analyzing CS tickets with Snowflake Cortex AI, fuzzy address matching that beat regex by 20%, automated Slack responses from documentation, and ETL doc generation. The practical filter: knowing when AI saves time versus when it's faster to write the code yourself. All three full talks are now on YouTube. See them here: https://lnkd.in/g6f_TxSP Data Debug SF runs monthly. If you're building with AI in data, this is the room to be in. #DataDebugSF #DataEngineering #AnalyticsEngineering #AI #dbt
-
-
The creator of pandas went full-time on an unpaid open source library at 26. No salary. A year of savings. A mouse-infested apartment in the East Village. "I would just wake up and write Python code. Take a break to eat and go to yoga. Then work until midnight or one o'clock in the morning, pretty much every day, seven days a week." Wes McKinney built the most-used Python data library in the world on founder hours before the term existed. Full conversation on Data Renegades. Link in comments. #pandasPython #opensource #dataengineering #DataRenegades
-
Recce reposted this
One question we got during our last webinar: "Is the underlying data warehouse Iceberg? What are the options?" Short answer: yes. Apache Iceberg is the open format handshake between Bauplan and the rest of your stack. Your data never moves — it stays in your storage layer, your existing source of truth. No migration, no lock-in. Turns out that matters quite a bit when AI agents are running dozens of experiments in parallel. In this webinar, we show exactly how that works end to end — with our friends at Recce: → An agent builds a user segmentation pipeline from scratch, in full branch isolation → A second agent adds bot detection to that same pipeline → Recce's review agent compares the branches, surfaces schema diffs + lineage impact, and generates an auditable merge report Zero production risk. Full human oversight. Structured workflow. Trusting AI agents with your data is one of the hardest unsolved problems in data engineering. This is how we're solving it. Full recording 👇 https://lnkd.in/g4FpT8Nc
Trusting AI with Your Data: Safe Automation from Branch to Production
https://www.youtube.com/
-
What's the worst production failure you've seen? Scott Breitenother borrows a comedian's line to reframe the question: "When an escalator breaks, it just becomes stairs. When your data workload fails, it often just results in stale data." Early in his career, a failed pipeline meant panic. Page the team, drop everything, scramble to fix it. Over time, the real lesson was learning to separate the severity levels. A real error (wrong numbers going to an exec) is fundamentally different from a pipeline that didn't run and left the data three hours old instead of one. Most data failures fall into the second category. The dashboard is stale. The report is delayed. But the numbers, when they arrive, are correct. Understanding that distinction changes how teams build alerting, handle on-call rotations, and decide what actually deserves a 2am page. "I think we'll be OK." Listen to the full Data Renegades Podcast episode with Scott wherever you get your favorite podcasts. #DataEngineering #DataReliability #Analytics
-
"The data team can't afford to be two years behind. You need to be using AI to generate the code. You need to get that exoskeleton going now." Scott Breitenother on Data Renegades, on why the gap between data teams and engineering teams is no longer acceptable. Data teams have historically trailed engineers by about two years in adopting modern software practices. Git, CI/CD, testing frameworks. Every cycle, data was a step behind. Scott says this time the technology is moving too fast and the cost of falling behind is too high to wait. The answer is the full-stack data person who owns the entire pipeline and uses AI agents to move at engineering speed. Not five specialized roles. One person, full stack, augmented. Our own CL Kaoasks if every data team should operate this way. Scott doesn't hedge: "Yes. We need to be making full stack data folks that can move faster and use their exoskeletons." The question isn't whether the shift is coming. It's whether your team is already behind. Listen to the full episode with Scott wherever you get your favorite podcasts. #DataEngineering #DataScience #AIAgents #Analytics