Benchling Engineering - Medium

Fragmentation to framework: Spec-first development at Benchling

Eli Levine — Thu, 19 Feb 2026 17:29:16 GMT

Reaching the limit of manual platform development

Benchling’s platform handles diverse scientific data, such as DNA sequences, antibodies, notebook entries, inventory containers, workflow runs, and dozens more. Each object type carries unique domain logic: how it is validated, what relationships it holds, and what actions users can perform on it.

As Benchling matured, capabilities were added that customers expected to work across all these objects including REST APIs for integration, a data warehouse for analytics, search indexing, and configuration migration tools for moving setups between tenants, among many others.

Each product team is expected to expose their data in all platform surface areas. However, because this process is manual it can also be brittle and costly. With M object types and N platform capabilities, and each object requires custom integration with each capability, you’re maintaining M×N integration points. Add a new object? You’ll need to integrate it with every platform capability. Add a new capability? You’ll need to integrate it with every object.

In practice, this meant product and tech debt: some objects were available via API but missing from the warehouse, or a feature was exposed in the UI but not in other platform surface areas. It also meant behavioral drift. The same object would have slightly different field names or validation logic depending on which surface you accessed it through.

As Benchling grew, so did its customers. Enterprise customers expect platforms that are designed for multi-modal integration that covers the full spectrum of Benchling’s data and functionality.

AI is quickly reshaping how knowledge work is done across all industries. But some fundamentals have not shifted. The same integration capabilities that make enterprise architectures more powerful are what makes agents more powerful too: both require data access and interoperability.

Thus a different approach was needed. Ideally one where the cost of adding types of domain-specific scientific data and functionality, and platform capabilities did not grow exponentially. And one where platform functionality can be built quickly and uniformly to unlock the power of enterprise and AI for our customers.

Decoupling apps from platform

Most platform capabilities were asking the same questions about every object. What is the object’s data model: what is it called, what fields does it have and what are the relationships to other objects? How do I find and read one? How do I create or update one?

These questions were being answered separately for each object <> platform capability pair. API shapes were defined via OpenAPI specs. Warehouse mappers (which internally need to find and read objects) defined warehouse tables shapes. The search team maintained its own index configurations (which also needed to find and read objects). Each team solved the same fundamental problems, such as implementing the shape and behavior of domain objects, in isolation.

The alternative is the spec-first approach: define each object once, in a way that any platform capability can consume. Instead of objects integrating with platform capabilities, objects declare their shape and behavior through a unified contract. This enables platform capabilities to read from that schema and operate generically across all objects that conform to it. The integration point moves from per-platform capability implementations to a single shared one.

This reframes the work for both sides. Object owners focus on defining their objects correctly and implementing domain-specific logic. They don’t write API endpoints or warehouse mappers. Platform teams focus on making these capabilities robust and performant.

We call our implementation of this approach the Object Framework.

Introducing the Object Framework

The architecture of Benchling’s core service follows a fairly classic three-tier architecture. Data is stored in a relational database. SQLAlchemy models represent persistent data. Business logic related to various product areas resides in a largely monolithic codebase. The code is well-structured internally but lacks an overall organization for higher levels of the stack to automatically “work” with all types of Benchling data and functionality, and to consistently expose it externally to customers through various platform touch points, such as APIs.

The Object Framework does not replace the core architecture. Instead it provides a wrapper layer to give it enough structure so that Benchling’s product functionality can be leveraged uniformly. The framework consists of three major components outlined below.

Domain Graph is the schema layer. It is a unified type system that serves as the single source of truth for all object definitions. We use GraphQL Schema Definition Language (SDL) as the specification language, chosen because it is declarative, has good support for relationship between types and allows for metadata extensibility via directives. This is where every object in Benchling is defined. The Domain Graph declares field names, types, relationships, and allows types to specify various types of metadata, such stability levels and feature flags. This controls the object’s shape and declarative aspects of their behavior. It also allows object owners to declaratively control the behavior of platform capabilities with SDL directives.

Example of an object defined in the domain graph

Data Connectors are the public service interface of objects. They encapsulate domain-specific business logic behind a standardized and dependency-free interface, abstracting away an object’s internal implementation details. A connector is implemented in Python and implements standard CRUD operations, such as get, list, create, update, and archive, as well as other domain-specific methods that make up its public interface. When another domain or a platform capability needs to interact with an object, it calls the connector. Connectors are registered and retrieved by type of domain object, which avoids direct coupling.

Example of a registered data connector for a domain object

Data Mappers are the persistence layer. They translate between storage representations and domain objects. This abstraction isolates business logic from database details.

Impact across the stack

For platform teams, the framework means building a capability once and having it work everywhere. A new type of API called Bulk Import API is a good example. It is a new platform capability that exposes a file-based API for ingesting large amounts of Benchling object data. Each data connector implements standard methods for creating and updating batches of objects. Bulk Import API builds on top of this to create a higher-level API that accepts large amounts of data and orchestrates calls to data connectors.

There is a significant amount of complexity that goes into building a robust and scalable system. Bulk Import API endpoints automatically chunk large payloads, distribute work across our job infrastructure, and handle per-record results and errors. The team that built it wrote zero object-specific code, allowing it to focus on uniquely challenging features of this horizontal capability, not object coverage. When a domain team onboards a new object to the framework bulk import just works for that object with no additional integration required.

The same pattern holds for all other platform capabilities built on the framework.

For domain teams, the framework means focusing on what they know best. The engineers who understand the intricacies of notebook entries or molecular biology workflows spend their time on domain logic and user experience, not on wiring up API endpoints or warehouse mappers. Define the object in the Domain Graph, implement the connector interface, and the platform capabilities follow.

For customers, the impact is consistency and coverage. Field names are the same whether you’re querying the API or looking at the warehouse. Validation logic behaves identically across surfaces. If an object exists in Benchling, it’s accessible and consistent across the whole platform.

Several platform capabilities have shipped on top of the framework: V3 REST API (in beta), V3 Bulk Import and Export APIs (in beta), V3 Events (in beta), V3 GraphQL API (internal), with more on the way.

Lessons learned

Developer experience is key. Earlier initiatives at Benchling aimed at similar goals but required teams to adopt rigid patterns that didn’t easily adapt to their domains’ needs. The Object Framework was more successful by focusing on interfaces over implementation: define your object in the Domain Graph and implement the connector contract. How the code is organized internally is up to domain teams. This, of course, is not black-and-white. More standardization is better in some cases. But when you are adopting a framework on top of existing systems, especially those that handle varied life science domains, one must be mindful of the flexibility that domain teams need to retain.

A framework only delivers value when teams actually use it. In order to facilitate adoption we have invested in an explicit cross-team program: monthly adoption goals, tracked coverage across domain teams, coordination between platform engineers providing support and domain engineers doing the migration work. Dashboards show which objects are in the framework, at what stability level, and what’s blocking the rest.

Example of internal tooling built for tracking adoption

Tension between stability and flexibility. Benchling platform defines multiple levels of stability. Teams feel the pull in both directions: stay at Alpha too long, and internal consumers won’t depend on your object, customers can’t confidently build on it. Promote too early, and you’re locked into field names and behaviors you’ll want to change. While this is by design, we learned that stability lock-in is an acute concern for object owners.

The codebase becomes self-documenting. Language models excel when they have clear structure, such as well-defined interfaces, consistent patterns, explicit contracts. The Domain Graph is both human-readable and machine-parseable by design. When an engineer or an AI assistant needs to access data from another domain, the answer is always “call the connector.” The more objects that conform to the framework, the more examples exist for a model or a human to learn from.

Building toward a unified platform

The Object Framework represents a maturation in how Benchling builds applications. The project isn’t finished. We think of it as a flywheel, where each object added to the framework increases the value of building new platform capabilities on top of it and each capability added increases the value of onboarding the next object.

For customers, we hope this means a platform that gets more consistent, complete over time. One that delivers value and innovation faster.

Acknowledgements

The Object Framework represents work by engineers across Benchling. Special thanks to the Domain Graph team and many application and platform teams who refined the framework through real-world usage and feedback.

Fragmentation to framework: Spec-first development at Benchling was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leveling Up: Your Roadmap to Senior Engineering Manager

Swathi Sundar — Wed, 28 Jan 2026 00:21:09 GMT

Tactical Reflections and Self-Assessment for Your Journey from Team Management to Organizational Leadership

Are you an individual contributor (IC) who’s just transitioned into an Engineering Manager and trying to find your path to the next level? Or are you a seasoned Engineering Manager (EM) figuring out how to break through to a Senior Engineering Manager role? Read on!

Most companies have a career matrix that calls out the expectations for moving from Engineering Manager (M1) to Senior Engineering Manager (M2) also known as Manager of Managers (MoM). In this blog, I’ll share reflections and frameworks to help you find the right direction and progress your career.

What is a Senior Engineering Manager Role?

You might wonder what a Senior Engineering Manager (M2) or Manager of Managers (MoM) actually does.

As an M2, you draw on years of hands-on leadership experience. You typically oversee large or multiple teams, or lead major software initiatives that span several projects or departments. Your responsibilities often include managing complex programs with broad organizational impact and longer-term goals, going well beyond the scope of a single application or scrum team.

Here are some pre-requisites to consider that will impact your path to Senior Engineering Manager:

Time in Role
One of the key factors is the actual wall-clock time you spend as an effective engineering manager. This will include managing a single team through multiple quarters and overseeing several product releases before stepping up to Senior EM.

In most companies, the Engineering Manager ladder includes an M0 or transition band for those moving internally from IC roles. You typically stay at that level for about a year before moving into M1, which is considered the entry-level EM role. Some companies, like Uber or Facebook, only hire externally for M1 (not for M0). In other organizations, there may not be a transition band, and M1 itself is broad enough to include both new managers and those hired from outside to manage a single team. The M1 role can be a terminal level if you prefer to continue managing a single team. Depending on your company, it may take a couple of years to excel as a frontline EM and transition through M0 → M1 → M2.

There are also some experiences you’ll likely only encounter with time. For example, you might start with a strong team and not have to worry about managing a low performer for a year or two. Or your company or team may not have had open headcount to fill, so you won’t gain experience in hiring or dealing with the challenges of a rapidly growing team right away.

Breadth vs Depth
You’ll also need to choose between a deep-technical manager path and a broader scope managing multiple teams.

Historically, opportunities for deep-technical management roles were limited. However, with the rapid adoption of AI across industries, the demand for leaders with strong technical expertise is growing significantly. Today, being a deep-technical manager is more important than ever for driving innovation and guiding teams through complex, technology-driven challenges. Most career ladders don’t do a great job of calling out the split between these two tracks as they do for ICs. The key difference is the scope of your impact and how you achieve it.

You could be a manager of a single scrum team at the M2/Senior EM level if your work has significant leverage and strategic impact across the organization, or if you play an architect-type role for a team that requires it. For example, areas like search, infrastructure, and machine learning often have deep M2 managers. If you go down the deep path, you’ll likely specialize in an area (such as search or ML) and stick to that area for much of your career, becoming known as a specialist.

Most people, however, follow the broad path as generalists. This is the more standard trajectory:

You manage a team.
You organically grow the team with an expanded charter

The keyword here is organic, growing as the business grows and NOT to engage in empire-building.

You split the team into two or more. If you prove yourself competent, you may have the opportunity to run both teams.
You hire a manager for one team and continue to manage the other directly, eventually moving into manager-of-managers roles as your group expands.

Another version of the broad path sometimes emerges: you manage one team very well, and as the need for another team arises in an adjacent area — or an existing team in an adjacent area needs a manager — you are asked to take on a second team in a rapidly growing organization.

As you can see, both paths depend on business needs and opportunities that arise over multiple years.

Having said that, let’s look at some tactical ways you can move from M1 to M2, or from Engineering Manager to Senior Engineering Manager:

1. People Management

The key here is to improve your team’s efficiency through active people management — enabling and helping your people stay engaged and succeed at your company.

Today, most companies are highly selective in their hiring, so you’ll likely work with engineers who are driven and focused, and who can often self-organize and deliver results. The difference you make as their manager is in how you help them grow and become better engineers.

Aiming for and achieving mediocrity in People Management is a reachable goal for most Engineering Manager/M1, but striving for excellence here is important as you scale into Senior Engineering Manager.

To excel as an Engineering Manager, you need to consider:

Are you working on a career plan for each engineer that clearly outlines what they need to grow and improve to be successful in your company?
Are you guiding your team members through improvement and feedback loops multiple times, covering different skill sets?
Are you helping some of your engineers get promoted from one level to the next (for example, from L to L+1) over the course of a couple of performance feedback cycles? Does this include growing junior and mid-level engineers, as well as shaping senior and staff-level leaders?
If you have several new team members who are just starting to be productive at your company, are you considering what support you can provide to help them be effective at their level?
Are you holding a high bar for performance, and, if needed, actively managing the performance of those who do not meet expectations?

🌟 Your path to Senior EM🌟

Are you making your engineers stewards of your company, not just your team? Are you identifying key talent who could become future technical or people leaders, and helping them find their path? Are you getting them the visibility they need from your leadership team?
Are you encouraging and enabling your senior and staff engineers to contribute to broader groups or organizations outside your scrum team? Are you actively finding opportunities for them to pursue that go beyond your team’s boundaries?
If someone on your team wants to explore management, are you setting them up to mentor interns or junior ICs, giving them the chance to pseudo-manage a project or team, and helping them transition from an IC to an EM role?
In short, are you advocating ruthlessly for your team?

So how do you know if you are excelling in this area?
Some signals you can look for include your team’s employee satisfaction scores, pulse survey results focusing on the manager section, and upward feedback from your direct reports.

2. Execution & Delivering Results

This is the most important area because even if you meet every other expectation, if your team isn’t delivering consistent and impactful results, there is no path to the next level. Your primary role as a manager is to deliver value for the business. How effectively you do this depends on your skills in people management, technical leadership, and product strategy. The only way for you to deliver measurable value to your company is to execute flawlessly and ensure your team delivers results.

Given infinite time and resources, all features can, and will be delivered, but the nuance here is how do you deliver these features faster with high quality and with lesser resources.

To excel as an Engineering Manager, consider:

How can you continuously increase the velocity and quality of your team’s work?
Is everyone on your team working at their fullest potential? If not, how can you unlock it?
Are you pushing your team enough? Are you holding them accountable?
How can your team do the most impactful work, while delegating or having a process to scale for the more routine tasks?
Are you building the most important things for the business that align with your team’s charter?
What projects should you be investing in to provide direct impact for your company’s priorities?
What metrics do you have to measure your team’s success in execution?

🌟 Your path to Senior EM🌟

What cross-team partnerships do you need to forge to ensure smooth execution for your team?
How can you share the right level of detail with your manager to gain their support for investing in or resourcing your team? [managing up]
What is the perception of your team and your team’s execution across the organization? Do leaders feel like this is a critical team to invest in? If not, what level of context should you share, or what actions do you need to actively take to change the priority or perception of your team?

3. Technical Leadership

Your main focus here should be developing a technical strategy and ensuring you deliver on it. This could mean building a brand new 0-to-1 product, or taking on platform migrations to improve scalability or performance.

The incremental improvements in day to day is almost a given, but looking at a 1/2/3 year roadmap and investing in the right technical solution is the key.

To excel as an Engineering Manager, consider the below:

Are you regularly reviewing the critical technical plans for your team? Do you have a forum for reviewing architectural decisions?
What technical trade-offs should you consider between long-term and short-term investments?
How much should you slow down to invest in tech debt so you can move faster on overall deliverables?
Do you have the right operational rigor? Are you ensuring operational excellence in reliability (including availability and latency), performance, data integrity, and proactive detection and monitoring to maintain the overall technical health of your team?

🌟 Your path to Senior EM🌟

Are you doing thorough research to understand where your company is headed and anticipating the need to scale? Are you looking for cues in All Hands meetings and having conversations with your Head of Strategy or your VP/CTO?
Are you ensuring that your technical investments align with your company’s multi-year technology strategy?
Are you identifying gaps in technical strategy across the organization and building alignment for your proposals with key decision makers? Are you partnering with and deploying your trusted staff, senior staff, or architects to help make your vision a reality?
Where should you leverage AI, and which AI tools should you invest in as a team or organization?

4. Strategy & Partnerships

In most companies, Product Managers (PMs) and Engineering Managers (EMs) are two peas in a pod. While you each have distinct roles, building the right product features — in the right order — to unlock customer value is a shared responsibility. This includes:

Partnering with your PM to define goals that address critical product gaps.
Creating a structured feedback loop with customers to inform your roadmap.
Co-owning the product backlog and collaborating on prioritization.
Developing innovative technical solutions to execute the roadmap.

Healthy debate and occasional tension over priorities are normal and even necessary. For platform and tooling teams, investing in developer experience for internal customers remains equally critical.

To excel as an Engineering Manager, consider:

What features should we build next year to align with the company’s key focus areas?
Will this investment drive new revenue, improve customer adoption, or accelerate engineering velocity?
How do we measure the impact of our developer experience improvements?
What are the top unresolved pain points for our customers, and how can we address them?

🌟 Your path to Senior EM🌟

What are the top three problems for your customers, and should you solve them? If so, can you pick one unsolved pain point and get your team to chase after it?
What else should your team focus on to help your company succeed? Should you invest in turning your product into a platform and building for leverage?
What should you NOT do?

What will be the impact if we do NOT invest in your team’s charter at all for the next six months? What else should your team work on to drive revenue for your company?

It can be difficult to ask yourself this kind of existential question, but doing so will help you assess the importance of your team and enable you to shape your team’s charter more effectively.

5. Scaling the organization

What this entails is building the right engineering culture for your team and your company.

Being satisfied with status-quo is likely one of the failure modes here.

To excel as an Engineering Manager, consider:

Is what we are doing good enough? And what would make it great?”
Are you providing a psychologically safe environment for your team? Do they feel comfortable suggesting alternative and innovative ideas that challenge the current situation and lead to new ideas, processes, or solutions?
Are you participating in working groups and helping to push for a better culture?

🌟 Your path to Senior EM🌟

Are you looking beyond your own team to improve processes across your company? Every organization faces challenges in areas like hiring, branding, onboarding, fostering innovation (such as through hackathons), setting up effective mentorship for engineers who feel stagnated, improving interview questions, maintaining social connections across teams, enhancing technical operations, investing in documentation, and refining SDLC processes, among many others.
To truly excel and move toward the M2/Senior EM role, are you stepping up to drive these changes from the front, making a measurable difference and impact across the organization?

That’s it! As you reflect on your journey from Engineering Manager to Senior Engineering Manager, remember that growth in leadership is both a personal and organizational pursuit. By focusing on people development, technical strategy, operational excellence, and cross-functional collaboration, you can position yourself — and your team — for lasting impact.

The landscape of engineering leadership is evolving rapidly, especially with the rise of AI and emerging technologies. How do you envision the EM or Senior EM role changing as AI becomes an even more integral part of our organizations? I’d love to hear your thoughts — share your perspective in the comments below!

Leveling Up: Your Roadmap to Senior Engineering Manager was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Multi-Model Playbook

Sumedh Bhattacharya — Fri, 16 Jan 2026 16:02:06 GMT

The Multi-Model Playbook: Patterns in Agentic Engineering

Building production AI systems that work reliably across multiple model providers requires more than just swapping API keys. Over the past year, working on AI agents like the Data Entry Agent and Compose Agent at Benchling, I’ve learned that successful multi-provider strategies come down to understanding what’s universal versus what’s provider-specific, and designing around those constraints. The clearest revelation here was that the architectural principles underlying reliable software — modularity, separation of concerns, clear interfaces — apply just as fundamentally to AI systems as they do to traditional code.

The Data Entry Agent (DEA) extracts structured data from PDFs and images, while Compose is an agent that helps scientists write electronic lab notebooks (ELNs) by extracting content from attached files, connecting that with data in Benchling’s Registry, and outputting structured scientific protocols, analysis, and more. These systems currently support five different model families (OpenAI GPT, Anthropic Claude, Google Gemini, Meta Llama, and Amazon Nova), typically using four in any given run. This experience has revealed patterns that hold true across providers — patterns around task decomposition, prompt structure, caching strategies, and data presentation. While each provider has its quirks, these foundational strategies have proven consistently effective.

In this post, I’ll cover:

How to break down problems for optimal LLM performance
Why the system versus user prompt distinction matters for caching
Best practices for presenting structured data as context
Practical comparisons between model providers
How to apply these principles when using AI coding assistants.

Breaking Down Problems: Small & Complex versus Large & Simple

LLMs lose accuracy when handling multiple separate tasks simultaneously or when operating on large input contexts. The sweet spot is to give them either a small, complex task or a large, simple one.

Even for complex tasks, it’s often better to identify modular, independent portions and run them in parallel on lighter models. For example, when building a PDF data import tool, we found that asking Sonnet 4.5 to transcribe an entire large file in one completion produced inconsistent results — it would summarize or gloss over certain sections. Instead, we used Haiku 4.5 to transcribe small chunks independently and in parallel, then stitched them together at the end using a MapReduce approach. This proved more accurate, faster, and cheaper, despite using a lighter model.

System Prompt versus User Prompt: It’s About Caching

The distinction between system and user prompts doesn’t significantly affect accuracy, but it matters enormously for prompt caching. Since the system prompt always precedes the user prompt in the final message to the LLM, any change to the system prompt causes a cache miss even if the user prompt remains unchanged. Therefore, when an operation requires multiple completions over the same base content, keep the system prompt constant and vary only the user prompt for different tasks.

The system prompt is reintroduced at the beginning of each conversation turn and can be cached across all turns. Subsequent turns are appended after it, with the latest user or assistant message appearing last.

For effective prompt caching, every string up to the cache point must match exactly. This means the system prompt must remain identical since it’s the first item in the message blocks, and user prompt blocks must also match up to the cache point.

Presenting Structured Data as Context

There’s no single best approach, but I’ve found the following patterns generally improve accuracy:

Tabular data: Raw CSV works fine for smaller datasets. However, when column or row counts exceed 20, include indexes alongside the data itself. Without them, the model loses track of its position and can mix values across rows or columns.

Non-tabular data: JSON works excellently. The nested structure and close semantic proximity between keys and values means key/value pairs are rarely mixed up.

Source files: When including source files you want the model to reference, place them at the very beginning of the user prompt within specially formatted XML tags, with individual tags for each file. While this is specifically recommended for Anthropic’s long context usage, the general strategy of separating distinct prompt areas with XML tags works well across providers.

Model Provider Comparisons

The above Sonnet vs Haiku guidance refers to different-sized language models provided by Anthropic; the same could be said about Google’s Gemini 2.5 Pro vs Gemini 2.5 Flash models. However, each provider has slight quirks. These comparisons are between equivalent model tiers across providers:

Anthropic delivers the best combination of accuracy, speed, and cost. Sonnet 4.5, with long context and extended thinking enabled, was the first model to pass our hardest evaluation test cases for Data Entry Agent across both image and text inputs. Even Haiku 4.5 now offers nearly Sonnet 4 levels of accuracy at much faster speeds and lower costs.

Gemini is typically accurate with structured data but quite slow, especially with 2.5 Pro. This is likely better now with Gemini 3 but we are waiting for that to come out of Preview prior to using it in production.

OpenAI is usually fast and consistently produces well-structured JSON, but sometimes struggles with accuracy or understanding nuanced instructions. This has improved greatly with GPT 5 and 5.1 but still lags behind Sonnet 4.5 for our use-cases without custom prompt tuning.

Llama performs adequately in terms of accuracy and speed but has difficulty generating well-structured JSON for large responses.

Nova showed the lowest scores across all of our model providers for our use cases. Nova Premier is very slow, and Nova Pro lacks accuracy. Both struggle to produce well-formed JSON responses.

Applying These Principles to AI Coding Assistants

While the strategies above focus on production systems and API usage, the same principles apply when using AI coding assistants like Cursor or Antigravity for development work. The key insight — breaking down complexity and being strategic about context — translates directly to these tools.

I’ve found two largely different styles of using AI coding assistants for different types of scenarios:

For new features where I don’t have much existing context: First, I ask the assistant to plan and investigate the approach. I iterate on the plan until I’m satisfied, digging into any areas of the code it’s identifying that I don’t know about, then ask it to execute pieces modularly. At each modular step, I either test manually or use Antigravity or Cursor’s browser integration to have it test itself for full-stack changes. For these problems, I let it work largely autonomously on each modular step, checking in only at intervals, since each step requires a fair bit of thinking and writing. This approach is fantastic for quickly prototyping and exploring solution spaces. However, the code generated through this exploratory process isn’t production-ready as-is — it needs to be broken down into modular chunks, with each chunk then refined through the second style below to create well-scoped, tested PRs for production.

For tasks with existing context: This is where I take a more hands-on, deliberate approach, whether refining prototyped code or building well-understood features from scratch. I provide very detailed instructions and include all files I know it will need for context. The more specific and accurate my instructions, the better the response and the more it feels like it is doing exactly what I would expect it to do. While it’s working, I watch the output as it’s being generated to ensure it stays on track and to understand exactly what it’s writing. At any moment it goes off track, I stop it and correct it. Then, at each step, I also @-mention the Git diff from the main branch to ask it to write unit tests specifically for the portion of code it has just written to guarantee its validity. This tighter feedback loop produces well-scoped, readable, and tested code that only requires minimal manual intervention prior to being ready for production.

Remote vs. Local Agents: I’ve found it quite useful to do the above practices in parallel for different tasks. For example, I will first work with a local agent on a new feature to come up with the plan. Then, I’ll hand the plan off to a remote agent such that it can work on the cloud on a new branch. While it’s working, I’ll switch my local branch so that I can also work on a well-scoped problem that I have existing context on. Here, I’ll iterate with the agent, make my manual edits, and get a PR ready. Once the PR is set up and CI is running, I can now pull in the new feature branch my remote agent was working on to see how it is doing.

Key Takeaways

When you break problems down sufficiently, lighter, cheaper models like Haiku can often replace Sonnet. The architectural discipline of modular task decomposition improves reliability regardless of which model you choose. While the cost difference per run might seem small, it compounds at scale. More importantly, the improved reliability and maintainability make this approach worthwhile even beyond cost considerations.

The deeper lesson here is that good engineering practices for AI systems mirror good engineering practices in general. The same principles that make traditional software maintainable — modularity, separation of concerns, clear interfaces — also make AI systems more reliable and debuggable. When an LLM fails on a monolithic task, it’s often unclear where things went wrong. When it fails on a well-scoped subtask, the problem is isolated and fixable.

This has implications for how we should think about building with AI going forward. As models continue to improve, the temptation will be to throw increasingly complex, multi-step problems at them as single prompts. But the systems that will scale and remain maintainable are those that treat LLMs as components in a larger architecture, not as magical black boxes that can handle anything. The future of production AI isn’t about finding the one perfect model — it’s about building systems that work reliably across models, degrade gracefully when models fail, and remain comprehensible to the humans maintaining them.

The Multi-Model Playbook was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We Run Terraform At Scale

Christian Monaghan — Tue, 04 Mar 2025 16:32:39 GMT

Managing over 165k cloud resources across hundreds of workspaces could seem daunting. But for us, it’s just another day at Benchling. Here’s how we do it.

We currently have:

165k cloud resources under management
625 Terraform workspaces
38 AWS accounts
170 engineers (40 of whom are infra specialists)

We perform:

225 infrastructure releases daily (terraform apply operations)
723 plans daily (terraform plan operations)

We’ve been successfully operating Benchling’s infrastructure release system for the past two years (spoiler, it’s Terraform Cloud), over which time we’ve doubled our infrastructure footprint with minimal additional release overhead.

Before Terraform Cloud: The Chaos

Our infra release process wasn’t always this smooth. Let me rewind and take you back to how it was before.

As is common guidance for small Terraform projects, our team would previously apply all infrastructure changes via laptop. Also in line with common guidance, our team used S3 to store state files, with DynamoDB state locks, which prevented any apply-time collisions. This is a great strategy for a small team working on up to a dozen workspaces. However, this slowly starts to break down as the team’s workspace footprint grows. It’s like the proverbial frog in the pot of water, slowly heated to a boil. By the time we made the switch, Benchling was managing 350 workspaces. We were approaching the boiling point.

Pain Points: Developer Toil and Inefficiency

Managing 350 workspaces with this approach had several downsides:

Necessitated elevated AWS access permissions for the infrastructure team.
It was time-consuming as the engineer had to navigate to each directory, run terraform apply, review and approve the run, then verify it succeeded. Very commonly a single change could affect over 120 workspaces, which would mean repeating this process 120 times. (We had developed a custom python script which helped parallelize this somewhat.)
Accumulated infra drift. Often an engineer would go to apply their change and find numerous unrelated pending infrastructure changes. This situation could arise for many reasons — a previous engineer had missed this workspace while rolling out a change, did not realize an apply step was required, or missed that their apply had failed. The unlucky engineer who encountered this would then need to track down the author of the change which caused this drift, confirm whether this change was intended and safe to apply, and roll out the change. Then they’d have to repeat this process for each of the impacted workspaces.

Applying a single change could easily take a full day, particularly if you encountered unexpected drift (pain point #3). Because of the release overhead associated with an additional workspace, the team was incentivized towards several anti-patterns.

The first anti-pattern was to put as many resources as possible into a single directory/workspace to minimize the number of workspaces that required an apply (thus minimizing pain point #2). This meant some workspaces were managing upwards of 4k resources, which made plan times painstakingly long (30+ min) and increased the blast radius for any change that went poorly.

These excessive plan times for our monster workspaces (pain point #2) and accumulated drift (pain point #3) pushed our team towards a second anti-pattern — using the Terraform -target feature. This feature allows a developer to limit changes to a subset of the full infrastructure configuration. While this can be useful in limited circumstances, it functions by only applying changes to a subset of terraform’s acyclic graph (which maps all resource dependencies), so it can cause all sorts of unintended chaos if used indiscriminately. Hashicorp themselves, the authors of Terraform, strongly discourage use of the -target feature for routine operations due to the possible side effects.

Overall this tooling gap was a source of developer toil and risk. It was clear for an organization at our scale we needed to automate our infrastructure release process.

Our Solution: Automate Terraform with Terraform Cloud

We evaluated several infrastructure automation tools — specifically Spacelift, Terraform Cloud, and Atlantis. We ended up deciding on Terraform Cloud, mostly for the perceived benefits of working with Hashicorp, who were larger, more established, and authored and owned Terraform.

Successfully rolling out Terraform Cloud required two big changes to our developer workflow. In particular:

Move from an “apply then merge” workflow to a “merge then apply” workflow. This was a big source of uncertainty as we rolled out since there is really no way to test for apply-time errors on all workspaces before merging a PR to our main branch.
Move to untargeted applies.

We helped ease the pain of this transition with several training sessions, a detailed FAQ, a dedicated Slack channel for questions, and by carefully watching Terraform Cloud for the first few months to ensure no backlogs of releases, erroring runs, etc.

We used an incremental rollout strategy to limit the blast radius, to give our engineering teams time to build familiarity in lower-risk workspaces first, and to learn and adapt our resource capacity planning for Terraform Cloud agents.

The Impact: Efficiency, Reliability, and Developer Happiness

The resulting impact of this change:

Eliminated drift (problem #3 above), a huge source of risk and developer toil.
Roughly 8000 developer hours saved annually (40 infra specialists × 4 hrs/wk × 50 weeks/yr = 8000 hrs) — that’s equivalent to getting 4 developers back!
Audit log of every change for a given workspace linked to commit and author. We can’t emphasize enough just how helpful this is for debugging issues.
Speculative plans — a prospective change can be automatically tested across dozens of impacted workspaces and the results displayed directly in GitHub CI.

This screenshot shows how we limit speculative plans to a small number of canary workspaces (in this case just one).

In the time since we initially rolled out Terraform Cloud two years ago, we’ve continued to refine and improve this system in many ways both big and small.

How We Run Terraform Cloud Today

We host our own installation of Terraform Cloud with our TFC agents running in our own AWS account. (Hashicorp calls this product Terraform Cloud for Business.) We prefer to keep all admin access to production infrastructure in-house without conferring any production permissions to Hashicorp. We run these Terraform Cloud agents in our own ECS cluster. Our contract allows us to run up to 200 concurrent agents, though we typically run 120 across two agent pools (40 in our dev pool and 80 in our prod pool). This allows us to release changes to our 625 workspaces with high concurrency. For example, if a single change impacted 80 workspaces, it can be applied to all 80 workspaces simultaneously.

The Things We Monitor Obsessively

Agent exhaustion / concurrency limits: If there are no available agents for a sustained period, we page our on-call (we intend to implement autoscaling one day).
Plan time: If plan time exceeds 4 min in dev, we notify our team. We care most about ensuring quick plan times in dev workspaces since this reduces developer feedback loops during infra development.
Infra drift: After a year of measuring minimal drift, we eventually stopped measuring this because drift doesn’t meaningfully exist in our infrastructure anymore. Since 1) all applies are untargeted, 2) zero engineers have prod write access by default, 3) releases are so frequent that any drift is quickly addressed by the next release.

Quality of Life Optimizations

Although Terraform Cloud has been a great tool for us, as a large-footprint organization and power-user, we’ve found it lacks some features we need. Here are the custom features we’ve built around it.

TFC CLI

Some of our Terraform modules are used across many many workspaces. For example we have 261 workspaces affected by changes to our “deploy” module. Any change that impacts this module requires 261 reviews and approvals, even though the actual changes are substantively the same. Clicking through the Terraform Cloud UI is tedious, so we wrote a CLI. Our tool lets us run tfc apply --commit abcd1234 to review plans and apply changes. A more sophisticated invocation might look like tfc review --commit abcd1234 --wildcard update:module.stack.*.access_controls --include-tag type:deploy. This command auto-approves changes that match the specific commit SHA, the wildcard resource address, and the provided label/tag.

We’ve added several other features over time, but the commands that get the most heavy use are tfc run (trigger new plans), and tfc review (review and approve pending applies).

Notifications

Because we require manual review and approval for each production change, a developer can easily merge their change and then forget to apply it. We built a Slack notifier service to solve this problem. It runs every 10 minutes and notifies the commit author of any pending Terraform Cloud applies. It only runs during business hours and does an exponential backoff so as to not be too annoying.

Workspace Managers

We have 625 workspaces, so of course we manage our Terraform workspaces with Terraform! We make heavy use of the tfe provider. We built a tfc-workspace module which we use to provision each workspace.

Ownership Delegation

Our team owns Terraform Cloud as a service provided to our infra, security, and developer teams. We try to keep these workspaces in a non-errored state, with providers updated monthly, and any deprecation warnings addressed promptly. However some workspaces manage resources outside our team’s expertise, at which time we need to delegate to the appropriate team to address these issues. To solve this we’ve developed a convention of applying tags to each workspace with tags like owner:{github_team_name}, for example owner:infra-monolith or owner:security-eng. This allows us to notify the appropriate team when an issue with the workspace arises.

Here’s an example of how we tag workspaces today. We apply these via our tfc-workspace terraform module to keep naming consistent.

TFC Usage Reporting

Lately, our Terraform Cloud contract has come up for renewal, which means we need to predict future growth and usage. Unfortunately, Terraform Cloud only tells you total Resources Under Management at the current point in time, but nothing else.

To this end we built a script that uses the TFC API to query each state version for each workspace, going back a year, and tabulates this data into a CSV, after which we build some charts. These charts allow us to track growth by provider resource type (e.g. aws_s3_bucket, aws_ec2_instance), workspace type (e.g. type:region), or AWS account. Ick, but it works.

TFC State Backup

We need a disaster recovery strategy in case Terraform Cloud is down. During a disaster recovery incident we can revert to local mode, if approved to do so, utilizing break-glass functionality and processes. However, the one gap here is we need access to the state file, which is stored in Terraform Cloud. To protect against a loss of the state file, we’ve implemented a flavor of this post to back them up to an S3 bucket after each apply.

Here’s an example of our state backup webhook. Upon completion of a terraform apply, it triggers a lambda which copies the terraform state file from Terraform Cloud to a secondary location in S3 which we can use in a disaster recovery event.

Workspace Dependency Map

One great thing about Terraform Cloud is it allows you to have a given workspace watch certain repository directories for changes. For example, a workspace can watch for changes to the tf-modules/* and trigger plans if anything in that directory changes. However, this doesn’t work very well at our size because we both use a monorepo and also have 180+ modules with 625 workspaces each using some subset of those modules. (For example, if all 625 workspaces tracked tf-modules/* and a single file was changed in that directory, then it would trigger 625 runs, quickly exhausting our agent pool of 120 agents, even if most workspaces resulted in a no-op.) Thus, we built a custom tool that maps out the module dependency tree for each workspace and generates a yaml configuration which is read by our tfc-workspace module to determine which directories to watch.

This shows the directories we track for one example workspace.

Provider Upgrades With Dependabot

With 625 workspaces, each using an average of 3 providers, that’s 1875 unique provider-workspace upgrades to perform. We use Dependabot to help with this, upgrading all providers on a monthly cadence.

Even managing Dependabot at this scale takes some work, so we’ve built automation that allows us fine-grained manipulation of the dependabot.yml file. This enables us to allow-list certain providers for upgrade, deny-list other providers, isolate upgrades to just dev or prod workspaces, or treat individual workspaces with special conditions. Here’s a gist that shows how our dependabot.yml is structured. Fully-generated for all workspaces, this file runs to 2000+ lines of yaml.

Ongoing Improvements: Optimizing for Scale

Our infra release system is still a work in progress. It’s not perfect, but we continue to make improvements every day. Here’s where we’d love to take it next:

Staged Rollouts

We currently use a main branch. Once merged to main, it gets released to most workspaces (our validated customers, or GxP customers, are an exception as they only receive quarterly releases). We’d like to move to staged rollouts with more release tiers (e.g. dev, staging, prod, gxp). We would verify success across the tier before promoting to the next tier.

Decompose Large Workspaces Into Many Smaller Workspaces

Although we’ve grown from 350 to 625 workspaces, we still have numerous workspaces managing thousands of Terraform resources. This makes plan and apply operations slow to complete. Since rolling out changes to all workspaces is now fully automated now, we should lean into decomposing these workspaces further, breaking these 625 workspaces into 1500+ workspaces to reduce plan and apply time and to minimize blast radius.

Enhanced Notifications

The ability to reassign Slack notifications to other users has been a popular feature request. Also, it’d be nice if the Slack bot prepared a tfc review command scoped to just the impacted workspaces.

Agent Autoscaling

Auto scaling this kind of workload is complicated and Hashicorp supports an EKS Operator to do just that. We hope to migrate our agent pool to EKS to employ the supported pattern.

Open-Source

We’ve built a lot of custom tooling to support our infrastructure automation. Most of it solves use cases we imagine many other teams have, so we’d love to open-source this work.

It Takes a Village

This system was built by many hands, thanks to the collaboration and insights of many engineers across Benchling. Our deep gratitude to all those Benchlings who have been partners in helping us get to where our infrastructure story is today!

We hope this post proves helpful to you and your organization in designing your cloud infrastructure for scale.

How We Run Terraform At Scale was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building an LLM-Powered Slackbot

Christian Monaghan — Fri, 13 Dec 2024 17:32:12 GMT

Background

At Benchling we run cloud infrastructure across several regions and environments. To coordinate and manage this complexity, our team operates a self-hosted implementation of Terraform Cloud, managing around 160,000 terraform resources across five data centers. About 50 engineers from across the engineering org release some form of infrastructure change within a given month — some are infrastructure specialists, and others are application engineers who are completely new to Terraform Cloud.

Understandably, we get a lot of questions about how to use Terraform Cloud or how to debug a specific issue, and that forum is usually in Slack. We have a glorious 20-page FAQ in Confluence that answers most questions, supplemented by numerous Slack threads documenting previous problems and their eventual solutions.

So we have good documentation, but finding it is a pain. Who wants to read through a 20-page FAQ? Or go Slack spelunking to find that answer 40 messages deep into a thread?

We set out to solve this problem by building a Slackbot that could dynamically answer any user question without doing any tedious searching. To accomplish this we implemented a Retrieval-Augmentated Generation (RAG) Large Language Model (LLM). Here’s the story of how we did it and what we learned along the way.

What we built

We built an internal Slackbot that enables Benchling engineers to interact with a knowledge base to answer common Terraform Cloud questions. It also serves as a reference implementation for future LLM-powered tools at Benchling. It demonstrates how we can combine disparate information sources, both internal and public (web, Slack, Confluence), with the latest Large Language Models to expose this to the user through a familiar Slack interface. This pattern can be reused to develop Slack assistants for other specialized knowledge areas such as answering HR questions, surfacing past solutions to customer issues, or explaining software error codes.

Here’s what the interface looks like:

How does it work?

We built the RAG LLM portion of our tool using Amazon Bedrock. Read more about how this works in this AWS post. The TLDR is:

Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.

For simplicity we’ll just use the term “knowledge base” throughout the rest of this post. The core concept behind it is:

Search a database for content relevant to the user’s query
Feed this content into an LLM prompt, along with instructions for how to use this content and generate a response

You can visualize it like this:

To see how this works in practice, take a look at Bedrock’s default knowledge base LLM prompt:

This prompt is comprised of three key components:

Instructions
Search results
User query

To set up our knowledge base, we used the Amazon Bedrock knowledge base setup wizard, which walks you through the steps in a few minutes. Behind the scenes it creates an OpenSearch Serverless database (a specific type of vector database within the Amazon OpenSearch service, used to source content related to the user query). It also sets up all the necessary IAM roles and policies, creates the Bedrock resources, and establishes data sources (the reference data that will be embedded and stored in the vector database). These data sources are then processed by async jobs and saved in the OpenSearch Serverless database.

What data powers our knowledge base?

We’ve implemented our knowledge base so that four different data sources are ingested and stored in the vector database. When a user query is received, the system runs a search against the vector database to find the most relevant sections of text across all the ingested data sources. Those query results are then fed into an LLM prompt (we use Claude 3.5 Sonnet v2) to synthesize a helpful response based on the retrieved answers.

The data sources we configured are:

Confluence: Terraform Cloud FAQ (this page was exported to PDF then stored to S3)
Web: Selected Terraform Cloud documentation on Hashicorp’s public documentation site
Web: Selected Terraform language documentation on Hashicorp’s public documentation site
Slack: Selected Slack threads where a Terraform Cloud issue was raised and eventually solved (for the proof of concept these were hand-copied from a few Slack threads, pasted into a .txt file and stored to S3)

This is a minimal set of data to prove out these concepts, but we can expand and enrich each of these or add new data sources in the future.

Here are what the currently supported data sources look like in Amazon Bedrock:

And these are the data sources we have configured:

After going through the process of building out our knowledge base and integrating it with Slack, here’s what we learned:

Limitations

No images. The knowledge base cannot process images submitted as part of a query, nor does it include any images from our documentation in its responses. This is unfortunate as our help documents include numerous images in the form of architecture diagrams, screenshots of a UI component, or an error message.

No terraform support, yet. The Terraform AWS provider’s current support for Amazon Bedrock is a bit paltry. None of the resources we used here are supported by the provider yet, though support will likely be added soon. We’ll keep checking back on the Terraform Bedrock resources page until the latest knowledge base resources are supported.

Potential future enhancements

Present answer citation links to the user. Currently this is available in the Bedrock UI when testing a model. However the answer we send to Slack does not include any citations or link to the source documents.

Make it easy to save relevant Slack threads to the knowledge base. For example, it would be nice to allow the user to trigger a webhook from Slack with something like “@help-terraform-cloud remember this thread.”

Automatic updates for each data source. Currently a manual data sync is required. We plan to set up a Cloudwatch event cron to trigger a data sync at least weekly.

Use the Confluence API. Currently we are exporting our FAQ page from Confluence to PDF and saving this to S3. In the future we plan to connect to Confluence via API.

Multi-turn conversation. Currently our Lambda is a stateless function and only the Slack message that explicitly tags our @help-terraform-cloud user is made available. One enhancement could be to preserve conversation context so the user can have a multi-turn conversation and build on a previous answer.

Learnings

Chunking strategies. In our initial prototype we used the default Bedrock chunking strategy of 300 tokens. This returns about one paragraph of text. This led to substandard results since many of our FAQ answers include several ordered steps and can stretch into several paragraphs. This meant our search results were often cut off midway, providing incomplete documentation to the LLM prompt. There are several alternative chunking strategies to choose from, and after trying a few, we found that Hierarchical chunking worked best, with a parent token size of 1500 tokens (about 5 paragraphs). The goal is to select a token size near the upper limit of your longest answers. However you also don’t want your token size any larger than necessary, as this feeds more (possibly irrelevant) data to the LLM which could confuse its answers. For our FAQ, our longest answers were around 1500 tokens in length, and thus this was a good fit. You’ll want to try out a few different chunking strategies and test how it performs with each to find the best fit.

Parsing PDFs is quite robust. Although it loses all the images, it’s quite robust at parsing text. Pointing Bedrock at a PDF in S3 worked on the first try.

Setting up a knowledge base is easy! Previously, setting up all the necessary plumbing for a knowledge base yourself would have been a multi-day project. However Bedrock’s knowledge base feature automates this process into something that takes minutes instead of days.

More targeted help bots? Perhaps the ease of deployment paves the way for numerous targeted help bots in the future. Using a more tightly-scoped dataset also reduces the chances of hallucination or the potential for non-relevant data to be returned from the vector database.

Architecture

Our architecture is quite simple. It’s comprised of:

A Slack App
AWS API Gateway
AWS Lambda (runs a stateless python function)
AWS Bedrock
AWS OpenSearch Serverless (vector database)

We’re using two different models:

Amazon Titan Text Embeddings v2 (for embedding)
Claude 3.5 Sonnet v2 (for inference)

Since the Terraform AWS provider doesn’t yet support the Bedrock resources we use, our implementation was created manually via the Bedrock Knowledge Base setup wizard in the UI.

The infrastructure components we use for the API Gateway and Lambda were built using open-source community modules and we can share our implementation with you here:

##
# variables.tf
##
variable "environment" {
  description = "Name of the lambda function"
  type        = string
  validation {
    condition     = contains(["dev", "prod", "sandbox"], var.environment)
    error_message = "Environment must be a valid value"
  }
}

variable "knowledge_base_id" {
  description = "Bedrock knowledge base id"
  type        = string
}

variable "account_name" {
  description = "Name of the AWS account"
  type        = string
}

##
# main.tf
##
locals {
  service_name      = "tfc-help-slackbot-${var.environment}"
  bedrock_model_arn = "arn:aws:bedrock:${data.aws_region.current.name}::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
  account_id        = data.aws_caller_identity.current.account_id
}

data "aws_region" "current" {}
data "aws_caller_identity" "current" {}

data "aws_secretsmanager_secret" "slack_token" {
  name = "${var.account_name}/tfc_help_slackbot/slack_token"
}

data "aws_secretsmanager_secret" "slack_signing_secret" {
  name = "${var.account_name}/tfc_help_slackbot/slack_signing_secret"
}

module "api_gateway" {
  source  = "terraform-aws-modules/apigateway-v2/aws"
  version = "5.2.0"

  name               = "http-${local.service_name}"
  description        = "API Gateway for ${local.service_name}"
  protocol_type      = "HTTP"
  create_domain_name = false

  cors_configuration = {
    allow_headers  = []
    allow_methods  = ["*"]
    allow_origins  = ["*"]
    expose_headers = []
  }

  routes = {
    "$default" = {
      integration = {
        uri                    = module.lambda.lambda_function_arn
        payload_format_version = "2.0"
        timeout_milliseconds   = 30000
      }
    }
  }
}

module "lambda" {
  source  = "terraform-aws-modules/lambda/aws"
  version = "7.4.0"

  function_name = local.service_name
  description   = "@help-terraform-cloud slackbot"
  handler       = "index.lambda_handler"
  runtime       = "python3.12"
  source_path = [
    {
      path             = "${path.module}/files",
      pip_requirements = "${path.module}/files/requirements.txt"
    }
  ]
  trigger_on_package_timestamp      = false # only rebuild if files have changed
  create_role                       = true
  role_name                         = local.service_name
  policies                          = [aws_iam_policy.lambda.arn]
  attach_policies                   = true
  number_of_policies                = 1
  memory_size                       = 128 # MB
  timeout                           = 60  # seconds
  architectures                     = ["arm64"]
  publish                           = true # required otherwise get error "We currently do not support adding policies for $LATEST."
  cloudwatch_logs_retention_in_days = 90
  environment_variables = {
    SLACK_TOKEN_ARN          = data.aws_secretsmanager_secret.slack_token.arn
    SLACK_SIGNING_SECRET_ARN = data.aws_secretsmanager_secret.slack_signing_secret.arn
    REGION_NAME              = data.aws_region.current.name
    KNOWLEDGE_BASE_ID        = var.knowledge_base_id
    MODEL_ARN                = local.bedrock_model_arn
  }
  allowed_triggers = {
    APIGatewayAny = {
      service    = "apigateway"
      source_arn = "${module.api_gateway.api_execution_arn}/*"
    }
  }
}


##
# iam.tf
##
resource "aws_iam_policy" "lambda" {
  name   = "tfc-help-slackbot-${var.environment}"
  policy = data.aws_iam_policy_document.lambda.json
}

data "aws_iam_policy_document" "lambda" {
  statement {
    sid    = "CloudWatchCreateLogGroupAccess"
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup",
    ]
    resources = [
      "arn:aws:logs:${data.aws_region.current.name}:${local.account_id}:*",
    ]
  }
  statement {
    sid    = "CloudWatchWriteLogsAccess"
    effect = "Allow"
    actions = [
      "logs:CreateLogStream",
      "logs:PutLogEvents",
    ]
    resources = [
      "arn:aws:logs:${data.aws_region.current.name}:${local.account_id}:log-group:/aws/lambda/${local.service_name}:*",
    ]
  }
  statement {
    sid    = "BedrockAccess"
    effect = "Allow"
    actions = [
      "bedrock:InvokeModel",
      "bedrock:RetrieveAndGenerate",
      "bedrock:Retrieve",
    ]
    resources = [
      "arn:aws:bedrock:${data.aws_region.current.name}:${local.account_id}:knowledge-base/${var.knowledge_base_id}",
      local.bedrock_model_arn,
    ]
  }
  statement {
    sid    = "SecretsManagerAccess"
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue",
    ]
    resources = [
      data.aws_secretsmanager_secret.slack_token.arn,
      data.aws_secretsmanager_secret.slack_signing_secret.arn,
    ]
  }
  statement {
    effect  = "Allow"
    actions = ["kms:Decrypt"]
    resources = [
      "arn:aws:kms:*:1234567890:key/mrk-abcd123456789abcd123",
    ]
  }
}


##
# outputs.tf
##
output "api_endpoint" {
  value       = module.api_gateway.api_endpoint
  description = "This is the API endpoint to save in the slack app configuration"
}

Note this terraform code presupposes that the following sensitive values were previously set in AWS Secrets Manager:

{account_name}/tfc_help_slackbot/slack_token
{account_name}/tfc_help_slackbot/slack_signing_secret

Where can you use knowledge bases in your work?

Are there situations where you wish you had access to an LLM that also had knowledge specific to your team or company? Think of scenarios such as:

Information lookup (e.g. error codes)
Answering common questions

Do you have a high quality text-based dataset?

FAQ docs
Public web documentation
Conversation histories (fact-checked)

If you have a use case where answers to the above questions are both true, then you might want to consider using a knowledge base.

You will also want to assess the security and privacy risks to your company. Some questions we asked before starting development:

Is this data sensitive / proprietary?
What is the downside risk of an incorrect result or hallucination?
Which models are already approved for use at Benchling? Can we use one of these models, or do we need to get a new model approved?

Overall it was relatively quick to get this prototype up and running. We advocate for experimenting with new tools and technologies as soon as they become available, and this is one technology that seems mature enough for broader use. We hope this was a helpful guide that can support you in building your own LLM-based tools!

Building an LLM-Powered Slackbot was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Scientific Data: Migrating Benchling’s Schema Model for Performance at Scale

Melody Ding — Wed, 04 Dec 2024 15:01:06 GMT

Benchling is a unified platform for scientific data. It allows scientists to collaborate on complex science, automate work, and power AI. Customers store large volumes of data on our platform, leveraging it across many applications both within Benchling and in their own infrastructure. It’s critical that customer data is accessible in a performant and scalable way.

In this article, we’ll explore a recent shift in how we store and retrieve customer data. By migrating to a more compact structure, we’ve tackled key performance challenges associated with increased data volumes. This transition has required a careful balance between speed and flexibility, as well as a phased approach that minimized disruption for users.

Benchling Schemas

At the core of Benchling’s system is Schemas, a product that allows both Benchling internal teams and customers to configure the various shapes of data, defining fields, attributes, and constraints that entities must follow. These data structures represent entities like equipment, storage, biological molecules, workflows, tasks, lab notes, and recorded results from scientific tests. Schemas reside in what we refer to as the definition layer.

An example schema for defining the data structure of a molecule

Each instance of a schema, referred to as a schematizable item, represents the actual data input by scientists. We call this the instance layer. These items are populated with field values conforming to the schema’s defined fields. As Benchling’s user base grows and the amount of schematized data ingested into the platform increases every year, optimizing the storage of field values has become crucial.

Relationship between actual instances of a molecule and its defined schema

The Challenge: Scale and Performance

Historically, Benchling saw a shift from manual data upload by scientists to integrations with lab equipment, leading to automated data collection. This significantly increased the speed and volume of data ingestion.

Assay results, capturing experimental data, are the most common schematized items in Benchling. Assays are laboratory procedures used to measure the presence, amount, or activity of a specific target (such as a molecule or biological entity) in a sample. By 2021, declining assay results ingestion performance made it apparent that we needed a more scalable approach to storing field values to (1) improve data ingestion speed and (2) avoid database scaling limitations, particularly to stay within PostgreSQL size limits without resorting to sharding.

The Old World: Entity-Attribute-Value Model (EAV)

Benchling initially adopted an Entity-Attribute-Value (EAV) model, a flexible data model that is well-suited for storing sparse data. This model was effective in Benchling’s early years when data volumes were manageable and access patterns were less defined.

Attribute values used to be stored using an EAV model

However, as the volume of data increased, several shortcomings of the EAV model emerged:

Sparse Rows: In the EAV table, each data type was represented as a separate column, but only one column per row was typically populated. This resulted in sparse rows with unused columns.

Metadata Overhead: Postgres metadata for each row created inefficiencies at scale. Each attribute required its own row, leading to O(n*k) rows for n entities and k attributes. Postgres’s 23-byte overhead per row exacerbated the space inefficiency.

Inefficient Access Patterns: Most reads require fetching all the attributes for an entity at once. Spreading these attributes across many rows required us to query for and return many rows to read a single entity.

Extraneous Joins: Since each entity’s attributes are stored in separate rows, querying for matches on multiple attributes required multiple joins against the EAV table. For example, finding entity matches on “name”, “formula”, and “weight” would require that the EAV table join against itself two times.

These limitations made the EAV model less practical for Benchling’s needs.

The New World: Using PostgreSQL’s JSONB

To address EAV shortcomings, we adopted PostgreSQL’s JSONB data type, which supports efficient key-value querying and indexing. This allowed us to condense all the field values for an entity into a single JSON blob stored in one row.

We now use a JSONB column to store attribute values

This new model had several advantages:

Compact Data Storage: The JSONB format significantly reduced the number of rows by storing all attributes of an entity in one row.
Improved Read and Write Performance: Since most reads involve retrieving all the attributes of an entity, querying a single row proved much faster than querying for attributes across several rows. Similarly, when uploading an entity, writing one large row to the database proved to be faster than writing many sparse rows.

However, this model also introduced new challenges:

Querying Field Information: Some questions are less efficient to answer. For example, checking if a field has non-empty values requires more complex queries that dive into JSON structures. A key-value index could mitigate this issue.
Lock Contention: Since all the fields of an entity are stored in one JSONB document, if one process updates the “formula” field and another process updates the “weight” field, they are both modifying the same row. This can create a bottleneck, as only one process can lock the row at a time. In contrast, in the old EAV model, different fields were stored in different rows, so there was less contention. This trade-off was considered acceptable because such simultaneous writes are rare in Benchling’s current usage patterns.

To mitigate some of these challenges, we also considered alternatives like GIN indexes for faster key-value lookups within JSONB fields. Specifically, we need to quickly traverse entity linkages, and answer questions like, “What other entities link to a given entity?”

The new JSONB structure makes answering questions like above less straightforward

However, while GIN indexes speed up reads, they slow down inserts because PostgreSQL needs to update the index every time a new JSONB document is inserted. Because we are storing millions of records, constantly updating a GIN index would become a bottleneck. Since we were already bottlenecked on data ingestion speed, we opted for a simpler approach: adding a table to track linkages between entities.

Inserting a row for every linkage does cause high overhead and is less space efficient than a GIN index, but this is no worse than our previous model where we also stored linkages in a separate table outside of the EAV table. The linkage table allows us to make minor updates to keys without triggering a re-index of the entire JSONB structure, reducing unnecessary overhead. Additionally, by recording the attribute name in the linkage table, we can efficiently query not only which items are linked together but also the specific attribute that establishes the link.

Performance Improvements

After rolling out the new model, as expected, we saw improved performance in bulk reads and writes. This improved performance across areas such as our data warehouse and results ingestion in notebook tables.

Here is a sample of our resulting metrics from internal tests and aggregate data from production environments:

Up to 7x faster ingestion of assay results
33% faster to map items from our internal database to our data warehouse models
60% faster to find items that need to be updated in the data warehouse when a sequence is updated
Querying entities in our Analysis product is about 2x faster

Performance results may vary depending on data volume, workload patterns, and system configuration. The improvements highlighted here are most significant in environments with higher data volumes.

Granular Rollouts for Data Integrity

Schema field values are pervasive across Benchling, necessitating data parity between the old and new systems. We first applied the rollout phases to our results product, since that’s where Benchling was scaling the most and starting to encounter ingestion slowness. To overcome the limitations of written tests — such as the inability to account for all possible edge cases encountered in production — we rolled out this refactor over the course of three years in several phases.

Stage 1: Dual writes and integrity checks: Every value update writes to both the old tables and the new tables. However, we are still reading from the old tables. At this stage, we ran nightly backfills from the old table to the new table, logging any discovered inconsistencies along the way. Using the nightly integrity check, we patched code paths that either missed writes to the new table or caused inconsistent writes.

Stage 2: Switching to JSONB reads (still with integrity checks): After feeling confident that the new table contained the correct field values, we switched to reading from the new tables. We are still writing to both tables at this stage, and continuing the nightly integrity check between the two tables.

Stage 3: Deprecating the old EAV Tables: This was the point of no return. We deprecated the old field values table and stopped writing to it. Any attempt to access the old field value tables raised an error, since this is preferred over potentially missing a write to the new tables or reading corrupt data from the old tables.

Challenges and Solutions

Throughout the migration, we encountered various challenges, particularly with maintaining data integrity and ensuring system boundaries were clear. A few key issues included:

Transaction Isolation: Using a READ-COMMITTED isolation level helped prevent most race conditions without causing process blocks. However, some race conditions required additional advisory locks on top of Postgres’s row-level locks.
Deadlock Management: With the new system, we had to manage potential deadlocks that arose from lock contention when multiple processes updated fields and related data on an overlapping range of entities.
Type System Flexibility: The less-strict type system in the JSONB model introduced new complexities in input validation and data coercion. Ensuring that different field types were correctly represented required meticulous attention to detail, more thorough testing, and close collaboration between product and engineering teams to align on expected behaviors.
Scope Creep: With such a significant refactor, there was a temptation to enhance existing architectures beyond a one-to-one migration. To avoid endlessly expanding the project scope, we focused on a few select improvements that offered major benefits. One example was restructuring how we check for unique constraint violations between schematizable items. This restructuring reduced the time for duplicate checking from several minutes to just 10–20 milliseconds, significantly improving performance for customers with large datasets.

To overcome most of these challenges, we emphasized the importance of thorough testing and clear error logging to identify and fix issues in a timely manner.

Looking Ahead: Building for Resilience and Scale

To keep pace with the growing demands of our platform, we’re focusing on several enhancements that will make Benchling even more efficient, resilient, and ready for the future. Here’s how we are preparing:

Clearer Performance Monitoring: Transparency about how well your product performs is the first step to making a great user experience. One of our immediate priorities is to instrument better metrics that can give us deeper insights into the system’s performance, both at a granular level (such as individual field value access times) and across the broader application. This will help us quickly identify any bottlenecks and address them proactively.

Refactoring Field Access Patterns: Despite the separation between our data layer and our application logic, changing the implementation of the data layer does affect the application layer. With the shift to storing schematized field values as JSON blobs, we have the opportunity to optimize how different teams across Benchling access these values to take advantage of the new representation, which could reduce lock contention and improve overall system performance.

Strengthening System Boundaries: Modular code is the key to fast development in a growing team and a rapidly expanding product. We plan to build cleaner interfaces and enforce stricter contracts across modules. This will help reduce the complexity of future migrations and ensure more maintainable code.

Key Takeaways

From this migration project, we’ve learned several valuable lessons:

Trade-offs in Data Modeling: While the new JSONB-based model solved many performance and storage issues, it introduced challenges around querying and lock contention. It’s important to carefully evaluate these trade-offs to ensure the benefits outweigh the drawbacks.
Granular Rollouts Minimize Risk: Rolling out the new data model in multiple phases allowed us to address edge cases and ensure data integrity. This staged approach is essential for large-scale migrations that touch critical parts of the system.
Importance of Robust Testing and Monitoring: A combination of unit testing, integration testing, and live integrity checks was key to identifying and addressing issues in the system.
Cross-team Collaboration is Crucial: Since field values are accessed and updated across various teams and workflows, collaborating with product managers and other engineering teams was essential. Ensuring that the new model integrates smoothly into all parts of the system required clear communication and coordination.

Building for Scale

Migrating from the entity-attribute-value model to a more compact JSONB-based representation ensured that Benchling can scale with its growing data ingestion demands. While the new model has significantly improved read and write performance, it also required careful planning to manage the trade-offs involved, such as lock contention and querying complexity. Our phased approach to the migration enabled us to catch and resolve issues as they arose, ensuring a smooth transition with minimal impact on users.

Moving forward, we will continue refining the system by improving performance metrics, optimizing data access patterns, and ensuring that our infrastructure is capable of supporting Benchling’s continued growth. The lessons learned during this migration process have not only strengthened our engineering practices but have also set the foundation for more scalable and efficient data models in the future. With a focus on collaboration and continuous improvement, we’re well-positioned to tackle the challenges ahead.

Scaling Scientific Data: Migrating Benchling’s Schema Model for Performance at Scale was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

A behind-the-scenes look at building interactive analysis capabilities in Benchling

Wonja Fairbrother — Tue, 11 Jun 2024 13:01:25 GMT

Authors: Wonja Fairbrother and Eli Levine

Science is iterative. To design the next experiment, scientists need to analyze the results of previous ones. Interactive Analysis in Benchling allows scientists to perform real-time data transformation, visualization, and analysis without having to transfer it into other systems. In this post we will describe the architecture behind interactive analysis capabilities in Benchling and give a peek into the decision journey we took along the way¹.

Interactive Analysis allows scientists to:

1. Select data from many sources:

Benchling entity and results data
Instrument data
Notebook tables
Data upload via both API and UI

2. Transform, visualize, and analyze data in real time, without leaving Benchling:

Data transformations: filtering, aggregations, window functions, etc.

Visualizations: line chart, bar chart, scatter plot, etc.

Scientific analysis methods: IC50 and various curve fitting functions

Overall architecture

The architecture backing Interactive Analysis consists of:

The Benchling web application
An auto-scaling stateless internal service running on EKS that performs the transformations
Temporary S3 storage locations for input and output data, shared between the web app and the service

The frontend of the application is responsible for taking in input datasets and transformation configurations from users. The backend of the web application collects all the input data from the appropriate sources, serializes and uploads the data to S3, and sends a synchronous transformation request to the service.

The service’s API consists of one main endpoint that takes in a JSON payload of transformation parameters. The service can accept a single transformation, or a list of many transformations to perform. In this endpoint, the service downloads and deserializes the input data, performs the transformation with an analysis engine, and serializes and uploads the resulting data to S3. Each request spins up its own self-contained in-memory analysis engine, so that no data is stored outside of memory or between requests. As such, the service can serve requests for multiple customers concurrently.

When the request completes, the web app downloads and deserializes the output data and displays it to the user. The frontend enables charting and visualization on the output data from any transform step. The user can then apply more transformations in an iterative manner, and the output then becomes the input to the next transformation.

All of these input and output data files can pile up very quickly in S3. However, since analysis is an exploratory and iterative process, they are ephemeral — therefore they are stored in a “temporary” S3 bucket with a lifecycle configuration to keep storage costs down.

Service internals

Data storage and representation

Because the implementation details of the transformations are not exposed to the user, we have the freedom to choose the data format that works best for us and our architecture. However, there are some considerations to take into account:

Type information for each column should be preserved as data is transformed.
Primitive data types determine valid operations and visualizations on specific columns.
Benchling types (references to a first class object in Benchling like an Entity, DNASequence, or Protocol) are used to allow object linkage in the application.
Type awareness is critical for building successful machine learning models, and vastly simplifies feature engineering and preparation of AI-ready data.
The data format should be optimized for read-heavy analytical queries.
De/serialization latency should be as low as possible to support the interactive experience.

Given these factors, we chose Parquet as our preferred data storage format, and Arrow as our in-memory data representation. Parquet is a compact and efficient columnar file format, and Arrow is a language-agnostic columnar data structure platform. Using Parquet and Arrow together unlocks many advantages. The columnar in-memory layout allows for O(1) random access, efficient column pruning, predicate pushdown, and improved data compression in analytical workloads. Arrow’s Parquet serializer also has a lightweight schema encoding mechanism to attach type information while keeping data interchange fast.

We support other file formats (CSV, JSON, Avro) for analytical data at Benchling, but using Parquet for Interactive Analysis queries gives us the speed we are looking for. For the flexibility to switch between formats, we have a BenchlingDataFrame wrapping interface that is shared between services. This interface handles serialization and type schema bookkeeping. The service has an I/O module that talks to S3 and de/serializes the data using the BenchlingDataFrame interface.

Analysis kernel

For the analysis internals of the service, our aim is to keep things as lightweight as possible, with our low latency goals in mind. Our users are accustomed to performing these types of analyses locally on their machines, so the experience needs to be as close to that as possible, to minimize any degradation of user experience.

Transformations fall into two categories: basic and advanced. Think of basic operations as things you might want to do in Excel like filter, pivot, or create new columns. Advanced operations are those that involve more complex calculations, statistics, or machine learning.

The vast majority of transformations that we anticipate our users will perform are basic transformations. In addition, we are primarily working with tabular data. This makes the choice to use DuckDB as a base case for transforms straightforward. DuckDB is a serverless SQL OLAP engine with support for larger-than-memory processing, APIs in multiple languages, direct querying of different file formats (Parquet, CSV, JSON), an active community, and many useful extensions and integrations being added with each new version. It is optimized for fast analytical queries on tabular data, fitting our use case exactly. DuckDB can also directly operate on Arrow and Pandas objects.

While DuckDB supports a wide range of SQL functions and complex SQL queries that mostly cover basic transformations, we want to be able to use different libraries and tools for transforms that involve advanced statistical functions. To accomplish this, we have an ExecutionEngine wrapper class that can choose the right underlying methods to use for a given operation. Here, our use of Arrow gives us an advantage too — Arrow has zero-copy interoperability between systems (ref — see image). This allows us to use DuckDB for SQL-like operations, and Pandas, etc for others, selecting the best underlying tool to perform each analytical action on any input Arrow Table. For example, when the service receives a request for a four parameter logistic regression (4PL) transformation, the ExecutionEngine can simply zero-copy convert the Arrow Table to a Pandas DataFrame, and call the right function that uses Pandas or Numpy rather than DuckDB.

Summary

We use Apache Parquet in tandem with Apache Arrow for fast serialization and optimized read-heavy analytical queries, and use our BenchlingDataFrame interface to preserve data type information. The service’s ExecutionEngine wrapper and Arrow’s zero-copy interop allow us to use DuckDB for common operations while having the option to call other packages into play when needed.

Key architectural decisions

When designing a new system there are invariably decisions that must be made along the way. Here we highlight some notable decision points that our team encountered, and our thought process for resolving them. We started the project with a proof of concept (POC) that used the simplest architecture possible for Benchling’s environment: a stateless service that performed data transformations on datasets stored in S3 synchronously. We started with a service because Benchling’s infrastructure team has built excellent support for quickly standing up Kubernetes services. This was a paved path with few unknowns. We started with a stateless service because state requires management, which introduces complexity, which stood in the way of getting an end-to-end prototype into customers hands quickly.

Stateless vs. stateful service

A user interactively analyzing their data usually performs a number of data operations in sequence, such as loading a dataset, applying a filter, applying another filter and performing an aggregation. One decision we had to make is whether we should implement caching of intermediate data during a user’s interactive analysis session. Basically: save the results of an operation, such that the next user operation is performed faster.

We considered a number of possible solutions to caching intermediate data. One approach was to employ an architecture in some ways similar to JupyterHub, employing a backend kernel dedicated to a user session. All user interactions within a user session would be routed to the same dedicated node that would cache data in memory for subsequent calls to access with low latency. Another approach was to keep a cluster of nodes with routing logic either based on user session or dataset hash. The idea would be the same: subsequent calls would likely encounter datasets already in memory. Before embarking on the journey of building one of these approaches we spent time looking at the necessity of going this route.

The life cycle of a single data operation, such as a filter, without caching is something like this: (1) read data from S3 (mostly network overhead), (2) load data into DuckDB, (3) perform actual data operation, (4) write out results to S3. Steps (3) and (4) are invariables — they would have to be performed even with caching. (3) is the actual data operation and writing out data to S3 in (4) is required in order for UI to render charts and tables. The main question to answer was whether the S3 network transfer + load into Arrow times were significant enough to invest in some sort of caching of intermediate data.

A quick test in a representative environment showed that even for some of the largest datasets we expect additional overhead of always loading from S3 is on the order of 350ms, which is within product requirements for latency. In this case the simpler solution was deemed sufficient, at least until Benchling has a need to support significantly larger datasets.

Service vs on-demand container

On the topic of big data, there is the possibility that our service will not be able to handle very large datasets and/or a large number of transformations without running out of the memory allocated to a single replica. In order to scale for these scenarios, one could argue that we should spin up a container with enough memory to handle the request, rather than sending all requests to the memory-constrained service.

Then, why not just use on-demand containers for all requests to begin with? The primary reason is latency — we cannot viably add the time cost of spinning up a container (on the order of a few seconds) to the round trip of a transformation, especially for trivial operations like filtering. And, as mentioned above, currently the vast majority of analyses are done on data that fit comfortably into service worker nodes. Therefore, using the service is acceptable and preferable in most cases.

Nonetheless, having the option of running transforms in a container would enable some of Benchling’s larger customers with larger data to use Interactive Analysis. In addition, we plan to allow customers to run a saved set of analytical transformations on incoming data in an automated manner. This truly asynchronous non-interactive use case is fitting for an on-demand container as there is no user sitting in front of a screen waiting on the result, affording us the extra spin-up time cost. Our infrastructure team’s internal compute framework allows us to package up our service image and use it to run one-off jobs, making this option easy to integrate in the future.

Result data in payload vs S3

Realistically, a user will only need to inspect a preview or summary of a transformation result, especially if the result is large. That being the case, the service could simply return the displayable result preview in the response payload, cutting out the need for the web application to download and deserialize the full output from S3. It might seem as though this option would be the most straightforward for the POC, but it would have required a large refactor in the backend of the web app. The de facto way that the backend represents these tabular data requires that the full object is downloaded and deserialized into memory. So, we decided to use the existing functionality for the initial version, and show the result in the UI by reading from S3. Now that we are in the stage of improving and optimizing the architecture, we are refactoring the web app’s dataset representation to allow partial loading. From there, the service can pass back only what is needed for display, and the web app can directly use that, further reducing end to end transformation latency.

Interactive Analysis is a powerful addition to Benchling, empowering scientists and researchers to perform complex data analysis with ease without switching between applications. Hopefully we have been able to show you some internal mechanics that make the product tick, and the architectural decisions that went into building it. We are excited to see the impact it will have on advancing research and look forward to the innovative ways in which our users will use it to power their R&D.

[1]: This post may describe features in testing that are subject to change.

A behind-the-scenes look at building interactive analysis capabilities in Benchling was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Signals, shells, and docker: an onion of footguns

raylu — Wed, 22 May 2024 16:01:32 GMT

On a few occasions, we’ve needed to debug POSIX signals (SIGINT, SIGTERM, etc.). Inevitably, there’s a shell involved too. One day, we were debugging some weird interaction between signals, shells, and containers and found ourselves bamboozled by some behaviors. People who consider themselves knowledgeable about Linux have found some of the details of our investigation surprising, so read on if this sort of thing doesn’t make you want to defenestrate your laptop and become an alpaca-farming hermit.

The scene of the crime

At Benchling, we have a pretty standard testing/continuous integration (CI) setup: when you push code to a pull request branch, we run tests for you. A few years back, we added a little optimization: if you push again and tests are still running on the previous commit, we cancel the previous test run. You probably don’t care about that run anyway and we save some money… or do we?

The code that runs our tests is basically

def test_pipeline() -> int:
    test_result = subprocess.run(["pytest", …])
    report_test_metrics()
    upload_artifacts()
    return test_result.returncode

So our process tree is

test_pipeline
└──pytest

subprocess.run blocks until the child process exits, so it should take almost all the time. We see in our CI logs that the tests get interrupted halfway through and then we see no more logs, so it sure looks like it’s working. But we’re able to get metrics and artifacts for our canceled runs, which makes no sense. We’ll later discover that while we reported that the run was canceled and stopped forwarding logs, pytest just kept running.

Back to basics

Thinking that perhaps the problem was not forwarding a signal from test_pipeline to pytest, we thought about basic signal handling first. In a terminal running zsh, we can get the pid of zsh with

$ echo $$
20147

Then, we can run bash inside zsh and sleep infinity (like our tests, a very slow command) inside bash.

$ bash
$ sleep infinity

From another shell, we can see the process tree.

$ pstree -p 20147
zsh(20147)───bash(65453)───sleep(65904)

(pstree is in the psmisc package on Debian/Ubuntu and the pstree formula in brew.) This shows zsh running bash running sleep, as expected. If we now send a SIGINT with ctrl+c, sleep stops.

Why does that happen? The terminal interprets ctrl+c as “send SIGINT”. zsh receives SIGINT and forwards it to the foreground process which happens to be bash. bash receives the signal and forwards it to sleep. sleep didn’t set up its own signal handler for SIGINT and the default signal handler exits (SIGINT has the “term” disposition).

At the start of the investigation, this was our mental model for shell signal handling.

Non-interactive shells

The actual problem manifested when a shell script was run with bash (we run the python code above in a bash script).

bash
  └─test_pipeline
      └─pytest

Thinking that perhaps interactive shells (which read stdin, among other differences) behaved differently than non-interactive ones or “scripts”, we wrote 2 lines to a file

sleep infinity
echo done

and ran

$ ./test.sh

In another shell, we could see the same process tree

$ pstree -p 20147
zsh(20147)───bash(65910)───sleep(65911)

Then, we tried directly signaling bash

$ kill -s INT 65910

but nothing happened. Buried in the bash docs (man bash) is a “signals” section that says

When job control is not enabled, […] the shell and the command are in the same process group as the terminal, and ‘^C’ sends SIGINT to all processes in that process group. […]

When Bash is running without job control enabled and receives SIGINT […], it waits until that foreground command terminates and then [exits itself]

Job control is on by default for interactive shells and off for scripts (see the docs about “monitor mode”). So that explains why nothing happened: bash was waiting for sleep (the foreground command) to terminate.

But there’s also a hint in there about process groups. pstree can show us those too (unless you’re on macOS):

$ pstree -pg 20147
zsh(20147,20147)───bash(65910,65910)───sleep(65911,65910)

So here, we see that bash, which we ran in an interactive zsh, got its own process group. But sleep, which we ran in a non-interactive bash, shares a pgid with bash. We can signal both processes in the group by negating the pid:

$ kill -s INT -65910

This causes sleep to receive a SIGINT and exit. bash also received a SIGINT and, like the docs say, exits itself. Back in our interactive zsh, we can run

$ sleep infinity

and see that sleep gets its own pgid, as expected.

$ pstree -p 20147
zsh(20147,20147)───sleep(65916,65916)

Last command in a non-interactive shell

So now we know that sometimes, the shell won’t forward signals to its child process. At one point, someone tried to reproduce this by running bash -c 'sleep infinity'. They were able to ctrl+c and stop sleep. But that’s a non-interactive shell, so bash shouldn’t be forwarding SIGINT! What gives?

$ bash -c ‘sleep infinity’

As usual, in another shell:

$ pstree -p 20147
zsh(20147)───sleep(65920)

Wait, where did bash go? We ran bash! Why does pstree say that zsh is running sleep?

When we “run” a program, what we generally mean is we fork and then exec it. fork sets the new process’ parent pid so that tools like pstree can come along after the fact and draw a pretty tree. exec sets the new process’ command so that tools like pstree can show you something meaningful about what that pid is running.

But what happened here is that bash simply didn’t fork before exec-ing sleep. We couldn’t find any documentation about this behavior, so instead we offer you some ash source code:

/* Can we avoid forking? For example, very last command
* in a script or a subshell does not need forking,
* we can just exec it.
*/

So bash replaced itself with sleep and pstree shows that the parent of the thing that is now running sleep is zsh. We can get the previous behavior by instead running bash -c 'sleep infinity && done'.

This was particularly exciting because we actually run our bash script with sh -c, so our mental model was

sh
└─bash
    └─test_pipeline
        └─pytest

for a bit until we realized the sh wasn’t its own pid in the tree.

A brief interlude about sh, bash, dash, and ash

Wait, what is ash? Did you just link me to some unrelated code? (Yes, sort of; the behavior is the same as bash but the source code is less… abstracted.)

sh is the Bourne shell (but usually referred to as “POSIX sh”). Bash is the Bourne Again shell. Historically, many systems linked sh to bash, which would check argv[0] and run in sh compatibility mode. On modern Linux systems, sh is now usually dash, but on macOS, it is still bash in sh mode.

The original ash was the Almquist shell from 1989 written for NetBSD. It was ported to Linux and renamed to dash (Debian Almquist shell). Nowadays, “ash” generally refers to busybox ash, which is a derivative of dash. Yes, you read that correctly: the lineage is ash → dash → ash. Shell programmers are not the best at naming things.

By the way, bash in sh compatibility mode and ash both implement the exec-without-fork behavior described in the previous section, but dash does not. Also, if you try to run sh in the official bash image on Docker Hub (docker run -it --rm bash sh), instead of bash in sh compatibility mode like you’d expect, you get ash (not to be confused with ash).

Flowchart

Here’s the flowchart that we wish had existed before we started peeling the onion of shell signal handling.

Back to the crime scene

Armed with our handy flowchart, we went to read our ci-agent’s code and found that when a build is canceled, it sends SIGTERM to the running job.

ci-agent
    └─bash
        └─test_pipeline
            └─pytest

bash was run non-interactively, test_pipeline was not the last command, so no signals are forwarded anyway. Does that explain what happened?

We tried to cut the bash out of the tree by making it exec test_pipeline.py, but that didn’t fix the problem. That must mean our process tree is still wrong.

Containers

The ci-agent actually just tells docker to run our script.

ci-agent
    └─docker
        └─bash
            └─test_pipeline
                └─pytest

Are signals being forwarded by docker to bash? Docker creates a new pid namespace for each container, so the command it runs becomes pid 1. 1 is a very special pid (it’s normally the init process) and doesn’t get default signal handlers. A common trick is to use tini or dumb-init to be pid 1 to solve this problem.

After investigating our image, it turned out we were already using dumb-init, leaving us with this tree

ci-agent
    └─docker
        └─dumb-init
            └─bash
                └─test_pipeline
                    └─pytest

and no explanation for the problem.

This is the last tree, I swear

Actually, we don’t run the docker container directly; we use docker compose run.

ci-agent
    └─docker compose
        └─docker
            └─dumb-init
                └─bash
                    └─test_pipeline
                        └─pytest

After finally building this tree, we were able to reproduce the problem. It only occurs on docker compose versions between v2.0.0 and v2.19.0, where docker compose run fails to forward signals. This was fixed here after we reported the issue.

The bug manifested when we upgraded from docker-compose (v1; note the hyphen) to docker compose (v2). Noticing the missing hyphen was necessary to understanding this problem, but it was tough to notice because both versions take nearly identical arguments and have nearly identical behavior. One takeaway from reading this story should be that naming things, despite being hard, is important. If you ever find yourself writing docs like “update scripts to use Compose V2 by replacing the hyphen (-) with a space”, you’ve probably made a critical naming error.

Another thing that made debugging this hairy was needing to understand the full chain of custody. Signals need to be forwarded by each process to their children. Understanding why pytest didn’t receive a signal required constructing the tree up to the point that the forwarding chain was broken, which in this case was quite far.

We considered downgrading back to docker compose v1, but we instead chose to track containers run by our CI step and docker kill them at the end. Later, after upstream fixed the issue, our mitigation simply never kicked in. With the problem fixed, our CI runs now actually stop when we tell them to again. When someone pushes multiple times in quick succession to a PR branch, we don’t waste cycles running on old commits, resulting in faster runs overall! (We also no longer report metrics about these canceled runs, which helps us greatly in identifying flaky or failing tests.)

Bonus about foreground processes

Back in the “non-interactive shells” section, we had a process tree of

zsh(20147)───bash(65910)───sleep(65911)

and directly signaled bash with

$ kill -s INT -65910

Why didn’t we just signal zsh instead? zsh is running interactively, so shouldn’t it forward SIGINT to bash? We can try

$ kill -s INT -20147

but nothing happens.

It turns out when you hit ctrl+c in this situation, the terminal sends the SIGINT to bash, not zsh. This is because zsh is no longer in the foreground process group. We can see this by running

$ ps -xO stat
   PID STAT S TTY          TIME COMMAND
 20147 Ss   S pts/0    00:00:00 zsh
 65910 S+   S pts/0    00:00:00 bash
 65911 S+   S pts/0    00:00:00 sleep

The “process state codes” section of man ps says

+ is in the foreground process group

And we can see that bash and sleep are, but zsh isn’t. They can’t both be at the same time anyway, since there can only be one foreground process group and zsh gave bash its own process group (because zsh is running interactively). So when we said “zsh receives SIGINT and forwards it to the foreground process which happens to be bash”, it turns out that was a lie.

But wherefore is bash’s process group the foreground one? tcsetpgrp. We can see it being called with ltrace:

$ ltrace -e tcsetpgrp bash
bash->tcsetpgrp(255, 0xa9850, 0, 0x7f290bdb2fe4) = 0

and when bash exits, the parent shell (zsh, in my case) reclaims foreground status with the same call.

Signals, shells, and docker: an onion of footguns was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

10x faster python test iteration via fork(2)

raylu — Thu, 20 Jul 2023 16:01:45 GMT

It’s ideal to get feedback on your code faster — to make a code change and see the result instantly. But, as projects get larger, reload times get longer. Each incremental dependency or bootstrap code block that adds 200ms feels worth it, but 50 of them later and it takes 10 seconds to see the result of a code change.

On the Build team at Benchling, that’s where we found ourselves one day. We used 146 packages which pull in 128 transitive dependencies for a total of 274 packages. We also spent a lot of time waiting for SQLAlchemy models to initialize. The result is our test harness took 10 seconds to set up. After making a code change, you’d start the test runner, wait a few seconds, alt+tab to your browser, get distracted for a few minutes, and then find out you had a typo in your code.

This is a common challenge for a growing codebase, but it’s something we knew we needed to fix. Here’s the process we arrived at which allowed the second run of tests to start 10x faster — 90% less waiting. While it’ll work a little differently for your codebase depending on the language, dependencies, etc. you’re using, hopefully this can inspire you on your journey to faster feedback and testing.

importlib.reload()

Since the problem is that we spend so long setting up a bunch of modules just right and then want to see the change in a single file we’re editing, the most obvious solution is to use importlib.reload from the standard library.

import importlib
import sys
import test_harness_stuff  # takes 10 seconds
import tests
def rerun_tests(changed_path):
    for mod in sys.modules.values():
        if mod.__file__ == changed_path:
            importlib.reload(mod)
            tests.run_tests()
            break
if __name__ == '__main__':
  setup_file_watcher(rerun_tests)
  tests.run_tests()

This (with some special handling for built-in modules, relative path resolution, and batching to handle editors that perform multiple filesystem operations per save) works alright when the file being changed is a test file (or any other leaf node in the dependency tree).

However, as you’ve probably guessed from the very long documentation for reload(), this doesn’t work in many other cases. A very common one is if you have animal.py:

cow = "woof"

and then cow_say.py:

from animal import cow

If you change cow = "moo", reloading animal.pyis not enough because cow_say.py has its own global bound to the old str. After reloading animal.py, you must then reload all reverse dependencies in topological order. You must also ensure that if a class definition is changed, all instantiations of that are reinitialized. For projects of almost any complexity, this is not feasible.

Not importing

Despite reload() not solving our problems, thinking about its issues is helpful in building a more useful solution. The giant list of caveats with reload() means you need to do surgery on the already-loaded modules.

What if we just didn’t load the code you were going to change until after you changed it? Then we wouldn’t need to do surgery! It’s not too hard to guess what code might be changed. Roughly speaking, our codebase has 3 kinds of modules: 3rd-party dependencies, SQLAlchemy models, and actual app code/tests. More than 90% of the time, we’re working in that last category, so we can just import the 3rd-party dependencies and SQLAlchemy models and not load the app/tests until we’re ready to run a test.

zeus, fork()

That leaves one problem: after we run a test, the test is loaded. How do we reset back to the state where dependencies and models were loaded but not app/tests? zeus actually solved this for Rails: load Rails, fork(), then load app code.

fork() creates a new process by duplicating the calling process. […] The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. Memory writes […] performed by one of the processes do not affect the other.

So we can use fork()to snapshot the parent, import some code that is going to change (app/tests), and then rewind back to the snapshot later. Rather than doing surgery on in-memory modules, we can just let the child process exit, re-fork, and re-import any changed code.

import os
import sys
import test_harness_stuff  # takes 10 seconds
def run_tests():
    pid = os.fork()
    if pid == 0:  # child
        import tests
        tests.run_tests()
        sys.exit()
    else:  # parent
        os.waitpid(pid, 0)
if __name__ == '__main__':
    setup_file_watcher(run_tests)
    run_tests()

Something like this sped up our test iteration time from 10 seconds to 1 second, which is a workflow-altering speed improvement (someone told me “I wouldn’t have bothered writing this tricky test if it weren’t for the fast reloader”).

zeus actually has a multi-level process tree and, when a file changes, it identifies which level imported it and terminates that process and all its ancestors. We do this too at Benchling: we divide up our modules into tiers based on how often developers work on them and where they fall in our dependency tree and then import each tier after forking. This allows us to discard as little import work as possible when a file closer to the root of our dependency tree changes.

Our process tree

We actually ended up with some other components for ergonomics (a terminal forwarder that uses libreadline) and performance (file watcher that can’t fork because it’s threaded).

Bonus: memory savings by not garbage collecting

Once you start running python code after os.fork(), you start running into the same memory usage problems Instagram faced. They run a Django web server and load up all their dependencies before forking the web workers. At first, they tried to solve their runaway memory usage by disabling garbage collection entirely. Later, they came up with a more elegant solution and upstreamed it into CPython 3.7.

But what caused the memory usage? In short, copy-on-write pages and reference counting.

Copy-on-write

The fork() docs say “At the time of fork() both memory spaces have the same content. Memory writes […] performed by one of the processes do not affect the other”. The simplest way to implement this is to copy all the memory from the parent into the child.

The Linux kernel doesn’t do that. Instead, it makes new page tables for the child process that point back at the parent’s memory and marks them both as read-only. When the child tries to write to any memory, it triggers a page fault. The kernel’s page fault handler looks at the page, sees that it was a copy-on-write page, makes an actual copy of the page, and lets the child retry the write operation.

Parent and child processes sharing the same physical memory

As you can imagine, this saves a lot of memory (and makes fork quite a bit faster). So what’s the problem? The child rarely writes to any modules imported by the parent (the app/tests code rarely makes any changes to SQLAlchemy models or 3rd-party dependencies); it only reads them and calls functions defined in them.

gc_refs

Python’s garbage collector needs to know which objects are safe to free. To do this, every object has a gc_refs field stored in its header that is incremented whenever it is referred to (for example, added to a list).

This means that if a module imported by our parent process defines a str and we later read that str in the child (which we do all the time), we will modify its object header to increment the ref count and trigger the kernel’s copy-on-write behavior.

Child process with its own memory after incrementing gc_refs

gc.freeze()

Instagram’s solution to this problem is to (rewrite all the CPython code that looks at gc_refs, introduce, and then) call gc.freeze(). This tells the interpreter that all existing objects should be considered ineligible for garbage collection and future accesses shouldn’t increment the ref counter. (The new object header layout, after Instagram’s changes in 3.7 and after another change in 3.12, is documented here.)

Implementing this is very easy: just call gc.freeze() right before you fork()! Running a typical test, we saw a 160 MiB reduction in unique set size.

Don’t gc.collect()!

Now that you’re thinking about the garbage collector, you might be tempted to call gc.collect() right before freezing and forking. It sounds like it would save memory — otherwise, objects with no refs in the parent will stick around forever in both the parent and the child. Unfortunately, that’s a bad idea.

When the garbage collector actually “collects” something, the object allocator “frees” that object’s memory. This doesn’t return any memory back to the system; it simply marks that memory as unused. It also creates a “hole” in the memory. A later allocation can fill that hole by using that freed memory.

If we think about what happens in the child after GC has created “holes” in the memory, we realize that the child will fill those holes in copy-on-write pages. In your development environment, your pages are likely 4 KiB. If you free a 1 KiB object, in the absolute best case it resides entirely within a page boundary and you replace it with another 1 KiB worth of objects. When the child tries to allocate 1 KiB, the kernel copies the entire 4 KiB page: you spent 3 KiB to save 1 KiB.

3 KiB used to save 1 KiB

This is why Instagram actually disables GC entirely in the parent. In their words, “we’re wasting a bit of memory in the shared pages to save a lot of memory later (that would otherwise be wasted on copying entire pages after forking).”

General applicability

The approach we’ve described here solves a problem that we think a lot of others face — if you rack up enough dependencies, you probably have slow startup/reload times. It works on any system with fork (everything but Windows sans WSL). There are a few caveats, though:

You need to be able to fork and then continue executing your code. Some languages’ standard libraries, such as nodejs, don’t offer this out of the box, so you may need platform-specific C extensions.
Your language needs to be able to dynamically load modules at runtime. This is pretty tricky for most compiled languages.
If you want this to work for your webserver (like zeus), it’s a bit more work. You need to integrate with your WSGI/rack/etc. server to handle requests in a properly setup child process. Each server is different, so we don’t have any general advice for how to do this.

Also, the benefits are only realized after you separate out modules based on their position in the dependency tree and frequency of edit. Because this is going to be different for everyone, we don’t have much code to share. We undertook this project because we noticed that SQLAlchemy models were close to the root of our dependency tree and took up the majority of startup time, but your mileage may vary.

We’re hiring!

If you’re interested in working with us to solve complex engineering problems, check out our careers page or contact us!

10x faster python test iteration via fork(2) was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Exposing AWS KMS Asymmetric Keys as a JWKS

Brian Maloney — Thu, 02 Feb 2023 20:53:31 GMT

Here at Benchling, interaction with services is a large part of our business, from employees interacting with the software-as-a-service products with which we conduct our daily business, all the way down to interactions between the services that make up the Benchling application platform itself. Secure authentication and authorization to services is a long-standing issue in the industry, but one that has been improving in recent years due to the widespread adoption of modern standards such as OAuth 2.0 and OpenID Connect (OIDC).

One specific use case for service-to-service authentication that is important to Benchling Security is connecting our Threat Detection Pipeline to our enterprise identity services vendor. We use this connection to connect log and other data provided by the vendor to our centralized Threat Detection Platform, where we correlate this with other sources of intelligence to detect risky or suspicious user activity in near real-time.

Modern Authentication with OIDC

Our specific identity services vendor offers two options for authenticating to its API: either an API token that an administrator can generate, or interaction by acting as an Application. API tokens, while very easy to use, are a poor choice for two reasons:

First, they are a static secret that must be handled carefully and rotated frequently to mitigate the risk of a leaked key, which causes significant management overhead.
Second, the identity services vendor links the privileges and identity of an API token inextricably to the administrator who generated it. This causes actions using the key to be attributed to the administrator and also makes it impossible to implement the principle of least privilege.

Client authentication when acting as an Application allows the use of OIDC, and this vendor specifically requires the use of the private_key_jwt Client Authentication method. Enforcing this requirement is a good choice on the part of the vendor — by using public-key encryption, no secrets need to be shared, only public keys need to be exchanged, and neither party can impersonate the other.

At this point you may be thinking, “Even though secrets don’t need to be exchanged, isn’t there still overhead for rotating the public key? And what is the best way to manage and safeguard the private keys in a modern cloud environment?” These are legitimate concerns and both have relatively simple solutions.

For managing private keys in a cloud environment, AWS was kind enough to solve this for us when they added asymmetric key functionality to KMS in 2019. The KMS asymmetric key functionality allows you to provision public/private keypairs using the same cloud infrastructure tooling you’re already using. The private key never leaves AWS infrastructure, but the public key can be exported and shared. Signing and verification operations must therefore be done using AWS APIs rather than the traditional SSL toolkits, however this functionality is readily available via existing SDKs and command-line tools.

Client token flow using private_key_jwt with KMS Asymmetric Keys

The Challenge: Public Key Rotation

The first issue described above is a bit more complex, and isn’t solved for us by our cloud provider. The KMS Asymmetric key must still be rotated for maximum security, but that means that the new public key needs to be shared with the identity provider. The OIDC Discovery standard provides a common method for discovery of information about OpenID Providers using standard web technologies, including providing a URL to a JSON Web Key Set (JWKS defined in RFC7517) containing the public keys for the OpenID Provider. Our identity services vendor supports this approach — when creating a new API Services Application, public keys can either be uploaded into the provider or a JWKS URL can be provided. Because our vendor caches the JWKS, it is only loaded infrequently to update the cache, or when a new key is used.

Given the flexibility gained from dynamic JWKS sharing, we strongly preferred this approach for our integration. With the design selected, implementation can be broken into two major steps:

Enable signing of JWTs using AWS KMS Asymmetric Keys in the client
Dynamically generate and serve the JWKS for the relevant KMS Keys

Signing of JWTs using KMS Asymmetric Keys is a common use case — there are already multiple examples of how to do this available (Python, Node.js). Integrating signing into your client workflow is fairly straightforward following one of these examples, so we won’t dig deeper into that in this article.

JWKS Construction and Serving

While this isn’t that complex of a task, there are numerous ways to accomplish it with different capabilities and performance characteristics. This article will cover the way we tackled the problem to meet Benchling Security’s specific needs without being prescriptive. Although for this post we are using a very simple design, there are still some challenges to overcome on the way to a functioning solution.

The basic design of our JWKS service is a Lambda function, fronted by the relatively new Function URLs feature of AWS Lambda. Function URLs allow a single-function microservice (like our JWKS exporter) to be served without the additional infrastructure of an API Gateway. This greatly reduces the amount of infrastructure we have to build to provide this service.

JKWS Lambda Flow

Key Selection

In any AWS region, there may be multiple KMS Asymmetric Keys defined, but we don’t need to export every key in the region in this JWKS. AWS has some great ways to group, select, and filter cloud resources, tags being the best example. Our initial expectation was to tag each of the keys for each use case and construct the JWKS from only those keys. Unfortunately, the response from the AWS API’s ListKeys function response does not include the tags defined on the keys. The only way to filter the list of keys based on tags would be to iterate over each key in the region, calling ListResourceTags on each key, which isn’t scalable for a function that needs to return within a few seconds.

Because of these limitations, the only reasonable approach is to use key aliases to identify the keys to be exported in the JWKS. Using aliases allows the use of the ListAliases API function and filter the results to just the keys which should be exported.

Rendering Public Keys as JWKs

Now that we have a list of Asymmetric key resources within KMS that we want to export, the public keys need to be converted into JWK structures which can then be combined into a JWKS. Because simplicity is a design goal, we want to use the fewest possible steps to convert the public key returned by the AWS GetPublicKey API into a JWK. There are numerous Python libraries available that implement the JOSE (Javascript Object Signing and Encryption) standards, including JWK. Unfortunately, many of the popular options on PyPI have limited JWK functionality. For example, some libraries can only handle JWKs which are already in JWK format but cannot convert an existing public key or certificate to a JWK. Fortunately, the Python ecosystem is large and JWCrypto has a full suite of JWK-handling functions, including conversion from PEM.

The only remaining piece of the puzzle is conversion from the DER key delivered by GetPublicKey into a PEM formatted public key. While this is a simple operation to do by hand, this functionality is provided by the very popular python Cryptography library, which is already used by JWCrypto.

Finally, all that’s needed is to gather the keys into a JWKS structure and render that as JSON, and write that as the body of your Lambda response.

Putting it all together

Now that we know how to do all the individual steps, we can assemble them into a pleasingly simple Python Lambda with only 3 functions:

import os
import logging

import boto3
from cryptography.hazmat.primitives import serialization

from jwcrypto import jwk

kms = boto3.client('kms')

for var in ["ALIAS_START"]:
    if not var in os.environ:
        logging.critical(f'{var} environment variable not set')
        exit()

# Search for and return the list of enabled keys with aliases
# that start with our desited string
def find_enabled_keys(starts_with):
    alias_paginator = kms.get_paginator('list_aliases')
    alias_iterator = alias_paginator.paginate().search(f'Aliases[?TargetKeyId != null && starts_with(AliasName, `{starts_with}`)].TargetKeyId')

    key_ids = [page for page in alias_iterator]

    enabled_keys = [
        key_id
        for key_id in key_ids
        if kms.describe_key(KeyId=key_id)['KeyMetadata']['KeyState']
        == 'Enabled'
    ]

    return enabled_keys

# Return a single key, formatted as a JWCrypto JWK object
def get_jwk(key_id):
    response = kms.get_public_key(KeyId=key_id)

    pubkey = serialization.\
        load_der_public_key(response['PublicKey'])

    pub_pem = pubkey.public_bytes(
         encoding=serialization.Encoding.PEM,
         format=serialization.PublicFormat.SubjectPublicKeyInfo
    )

    key = jwk.JWK.from_pem(pub_pem)
    key.update(use=response['KeyUsage'][0:3].lower(), kid=key_id)

    return key

def lambda_handler(event, context):
    # Construct JWKS structure from keys that match ALIAS_START
    jwks = {
         "keys": [
             get_jwk(key_id)
             for key_id
             in find_enabled_keys(os.environ['ALIAS_START'])
         ]
    }

    # Construct Lambda return structure
    return({
        'statusCode': 200,
        'headers': {
            "Content-Type": "application/json",
        },
        'body': jwks
    })

Enhancements for Production Use

The code presented in this article should be considered a poof-of-concept only — if you intend to use this pattern in production, you should consider your use case. If your needs require frequent reloading of this JWKS, it would be more scalable to write this out to an object store when changes occur, and serve the JWKS directly from the object store or a CDN. If you do not have production-level requirements, we still recommend building additional resilience into the function by expanding error handling.

Exposing AWS KMS Asymmetric Keys as a JWKS was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.