We opened our London office this month to be closer to our experts and customers that are advancing the frontier of AI. We're currently hiring Strategic Project Leads to join our London-based team. Apply at the link in the comments.
Mercor
Software Development
San Francisco, California 694,862 followers
Defining the future of work
About us
Mercor is defining the future of work. We connect human expertise with leading AI labs and enterprises to train frontier models.
- Website
-
mercor.com
External link for Mercor
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- San Francisco, California
- Type
- Privately Held
- Founded
- 2023
Locations
-
Primary
Get directions
San Francisco, California 94105, US
Employees at Mercor
Updates
-
Kimi K2.6 from Kimi (Moonshot AI) scores 27.9% at pass@1 on APEX-Agents AA from Artificial Analysis. The scores are evaluated on 452 of the 480 public tasks from our benchmark for long-horizon professional work in investment banking, management consulting, and corporate law. K2.6 (27.9%) is a substantial improvement over K2.5 (11.5%), putting it within 5 points of GPT-5.4 (xhigh) and Claude Opus 4.6 (Max) on professional services work.
-
-
Anthropic Claude Opus 4.7 (Max) is only the second model ever to cross 50% mean score on APEX-Agents, our benchmark for complex, long-horizon professional work in investment banking, corporate law, and management consulting. GPT 5.4 was first. Opus 4.7 is second. It places 3rd overall on the leaderboard at 33.9% Pass@1, and tops the investment banking leaderboard at 37.2%, beating out GPT 5.2 (xHigh). The most interesting finding is that Opus 4.7 thinks harder than its predecessor, and that comes at a token cost: roughly 2x Opus 4.6 at the same effort level. Check out the latest leaderboard at the link in the comments.
-
-
Ayushi spent years building at the intersection of AI and healthcare, most recently as the founder of a healthcare AI startup. She knew what it felt like to search for product-market fit from the inside, and what it cost when you didn't find it. When she started thinking about what came next, she was deliberate. She wanted colleagues who understood founder life without her having to explain it. About 30% of people at Mercor are former founders. "After years of trying to build something from nothing, there is a specific energy in joining a team that's already sprinting and finding out you can keep pace." At Mercor, she's working on problems that only exist at scale, helping build the infrastructure that connects human expertise to AI advancement. Read Ayushi's story at the link in the comments.
-
-
We are excited to announce our collaboration with Artificial Analysis on APEX-Agents-AA — an independent, live leaderboard evaluating AI agents on the professional tasks that knowledge workers do every day. The leaderboard is built on APEX-Agents, Mercor's open-source benchmark of 480 tasks across investment banking, management consulting, and corporate law — including tool implementations, rubrics, and grading workflows, all available to the community for evaluation and training. Artificial Analysis runs a subset of these tasks through their open-source Stirrup harness, providing a reproducible, independent baseline that any team can verify and build on. APEX-Agents-AA results: 🥇 GPT-5.4: 33.3% 🥈 Claude Opus 4.6: 33.0% 🥉 Gemini 3.1 Pro Preview: 32.0% The top three frontier models are separated by just 1.3 percentage points. The leaderboard will update with key model releases. Check it out at the link in the comments.
-
-
The privacy and security of our customers and contractors is foundational to everything we do at Mercor. We recently identified that we were one of thousands of companies impacted by a supply chain attack involving LiteLLM. Our security team moved promptly to contain and remediate the incident. We are conducting a thorough investigation supported by leading third-party forensics experts. We will continue to communicate with our customers and contractors directly as appropriate and devote the resources necessary to resolving the matter as soon as possible.
-
Does Training on APEX-Agents Dev Set Generalize Beyond the Benchmark? Applied Compute post-trained GLM-4.7 on ~2,000 expert Mercor tasks and achieved state-of-the-art legal performance on APEX-Agents. We then evaluated that model, AC-Small, on benchmarks outside its training distribution. On GDPVal, AC-Small's win+tie rate rose from 55.0% to 62.7% (+7.7pp), placing it 5th overall and ahead of Opus 4.5. To understand where the gain came from, we ran two ablations: On Toolathalon, AC-Small improved by +8.0pp, from 26.5% to 34.6%. On APEX, which removes tool use and agent loops, AC-Small moved up seven spots, beating Opus 4.5, Sonnet 4.5, and Grok 4. The biggest surprise was medicine. AC-Small placed 4th at 64.8%, ahead of GPT 5.4, Gemini 3.1 Pro, and o3, despite zero medical tasks in training. The gains appear to come from stronger procedural discipline: preserving sub-details, checking intermediate outputs, and catching logical errors. Read more at the links in the comment.
-
-
"The most important problem in the world is what we do all day for work and how the knowledge work economy operates." - Brendan Foody, at Upfront Ventures Summit. Brendan sat down with Sundeep Peechu of Felicis to talk about the future of work, what's blocking enterprise AI, and why humans become more valuable as AI advances. Watch the full video at the link in the comments.
-
Mercor reposted this
Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with Cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems that work and debug them when they don't. APEX-SWE Leaderboard | Pass@1 🥇OpenAI GPT-5.3 Codex (High) at 41.5% 🥈Anthropic Opus 4.6 (High) at 40.5% 🥉Anthropic Opus 4.5 (High) at 38.7% Every frontier model fails on nearly 60% of real production tasks.