ral-arturo.org

New job at Chainguard

Fri, 27 Mar 2026 08:00:00 +0000

A few months ago, in June 2025, I joined Chainguard, a company focused on software supply chain security. This post is a reflection on how I got here, what I’ve been doing, and why this role feels like a natural fit for my interests in Linux and open source technology.

The company and its mission

Chainguard’s mission is to make the software supply chain secure by default. The company is built around the idea that the software we all depend on — from operating system packages to container base images — carries hidden risk in the form of vulnerabilities, unverified provenance, and untrusted build processes.

The company is perhaps best known for Chainguard Images: a catalog of minimal, hardened container base images that are continuously rebuilt and kept free of known CVEs. Each image is accompanied by a signed SBOM (Software Bill of Materials) and a verifiable provenance attestation, making it possible to cryptographically verify what went into a given image and how it was built.

Chainguard has an extensive catalog of software, and maintaining it up-to-date and CVE-free is a significant engineering challenge.

What I do

I joined the Chainguard Sustaining Engineering team as a Senior Software Engineer. We are responsible for maintaining packages and images in the software catalog up-to-date and CVE-free. The core of the business, basically.

We focus on the horizontal dimension of the catalog (pretty much all packages and images).

With +30,000 packages and +2,000 images, this is indeed an interesting task.

My role as Debian Developer, and my experiencie in the Debian LTS project was extremely valuable when joning this new team.

Looking ahead

Software supply chain is truly a deep topic, gaining more and more relevance every day, especially as new technologies emerge and get adopted everywhere.

Since early in my career, I saw a recurrent problem of how companies, enterprises, or even governments, relate to and consume open source software, in a reliable, secure way. I believe Chainguard is doing the right things in the ecosystem, and I’m happy to be participating in the effort.

Wikimedia Cloud VPS: IPv6 support

Tue, 20 May 2025 13:00:00 +0000

Dietmar Rabich, Cape Town (ZA), Sea Point, Nachtansicht — 2024 — 1867-70 – 2, CC BY-SA 4.0

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

Wikimedia Cloud VPS is a service offered by the Wikimedia Foundation, built using OpenStack and managed by the Wikimedia Cloud Services team. It provides cloud computing resources for projects related to the Wikimedia movement, including virtual machines, databases, storage, Kubernetes, and DNS.

A few weeks ago, in April 2025, we were finally able to introduce IPv6 to the cloud virtual network, enhancing the platform’s scalability, security, and future-readiness. This is a major milestone, many years in the making, and serves as an excellent point to take a moment to reflect on the road that got us here. There were definitely a number of challenges that needed to be addressed before we could get into IPv6. This post covers the journey to this implementation.

The Wikimedia Foundation was an early adopter of the OpenStack technology, and the original OpenStack deployment in the organization dates back to 2011. At that time, IPv6 support was still nascent and had limited implementation across various OpenStack components. In 2012, the Wikimedia cloud users formally requested IPv6 support.

When Cloud VPS was originally deployed, we had set up the network following some of the upstream-recommended patterns:

nova-networks as the engine in charge of the software-defined virtual network
using a flat network topology – all virtual machines would share the same network
using a physical VLAN in the datacenter
using Linux bridges to make this physical datacenter VLAN available to virtual machines
using a single virtual router as the edge network gateway, also executing a global egress NAT – barring some exceptions, using what was called “dmz_cidr” mechanism

In order for us to be able to implement IPv6 in a way that aligned with our architectural goals and operational requirements, pretty much all the elements in this list would need to change. First of all, we needed to migrate from nova-networks into Neutron, a migration effort that started in 2017. Neutron was the more modern component to implement software-defined networks in OpenStack. To facilitate this transition, we made the strategic decision to backport certain functionalities from nova-networks into Neutron, specifically the “dmz_cidr” mechanism and some egress NAT capabilities.

Once in Neutron, we started to think about IPv6. In 2018 there was an initial attempt to decide on the network CIDR allocations that Wikimedia Cloud Services would have. This initiative encountered unforeseen challenges and was subsequently put on hold. We focused on removing the previously backported nova-networks patches from Neutron.

Between 2020 and 2021, we initiated another significant network refresh. We were able to introduce the cloudgw project, as part of a larger effort to rework the Cloud VPS edge network. The new edge routers allowed us to drop all the custom backported patches we had in Neutron from the nova-networks era, unblocking further progress. Worth mentioning that the cloudgw router would use nftables as firewalling and NAT engine.

A pivotal decision in 2022 was to expose the OpenStack APIs to the internet, which crucially enabled infrastructure management via OpenTofu. This was key in the IPv6 rollout as will be explained later. Before this, management was limited to Horizon – the OpenStack graphical interface – or the command-line interface accessible only from internal control servers.

Later, in 2023, following the OpenStack project’s announcement of the deprecation of the neutron-linuxbridge-agent, we began to seriously consider migrating to the neutron-openvswitch-agent. This transition would, in turn, simplify the enablement of “tenant networks” – a feature allowing each OpenStack project to define its own isolated network, rather than all virtual machines sharing a single flat network.

Once we replaced neutron-linuxbridge-agent with neutron-openvswitch-agent, we were ready to migrate virtual machines to VXLAN. Demonstrating perseverance, we decided to execute the VXLAN migration in conjunction with the IPv6 rollout.

We prepared and tested several things, including the rework of the edge routing to be based on BGP/OSPF instead of static routing. In 2024 we were ready for the initial attempt to deploy IPv6, which failed for unknown reasons. There was a full network outage and we immediately reverted the changes. This quick rollback was feasible due to our adoption of OpenTofu: deploying IPv6 had been reduced to a single code change within our repository.

We started an investigation, corrected a few issues, and increased our network functional testing coverage before trying again. One of the problems we discovered was that Neutron would enable the “enable_snat” configuration flag for our main router when adding the new external IPv6 address.

Finally, in April 2025, after many years in the making, IPv6 was successfully deployed.

Compared to the network from 2011, we would have:

Neutron as the engine in charge of the software-defined virtual network
Ready to use tenant-networks
Using a VXLAN-based overlay network
Using neutron-openvswitch-agent to provide networking to virtual machines
A modern and robust edge network setup

Over time, the WMCS team has skillfully navigated numerous challenges to ensure our service offerings consistently meet high standards of quality and operational efficiency. Often engaging in multi-year planning strategies, we have enabled ourselves to set and achieve significant milestones.

The successful IPv6 deployment stands as further testament to the team’s dedication and hard work over the years. I believe we can confidently say that the 2025 Cloud VPS represents its most advanced and capable iteration to date.

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

My experience in the Debian LTS and ELTS projects

Thu, 17 Apr 2025 09:00:00 +0000

Last year, I decided to start participating in the Debian LTS and ELTS projects. It was a great opportunity to engage in something new within the Debian community. I had been following these projects for many years, observing their evolution and how they gained traction both within the ecosystem and across the industry.

I was curious to explore how contributors were working internally — especially how they managed security patching and remediation for older software. I’ve always felt this was a particularly challenging area, and I was fortunate to experience it firsthand.

As of April 2025, the Debian LTS project was primarily focused on providing security maintenance for Debian 11 Bullseye. Meanwhile, the Debian ELTS project was targeting Debian 8 Jessie, Debian 9 Stretch, and Debian 10 Buster.

During my time with the projects, I worked on a variety of packages and CVEs. Some of the most notable ones include:

There are several technical highlights I’d like to share — things I learned or had to apply while participating:

CI/CD pipelines: We used CI/CD pipelines on salsa.debian.org all the times to automate tasks such as building, linting, and testing packages. For any package I worked on that lacked CI/CD integration, setting it up became my first step.
autopkgtest: There’s a strong emphasis on autopkgtest as the mechanism for running functional tests and ensuring that security patches don’t introduce regressions. I contributed by both extending existing test suites and writing new ones from scratch.
Toolchain complexity for older releases: Working with older Debian versions like Jessie brought some unique challenges. Getting a development environment up and running often meant troubleshooting issues with sbuild chroots, qemu images, and other tools that don’t “just work” like they tend to on Debian stable.
Community collaboration: The people involved in LTS and ELTS are extremely helpful and collaborative. Requests for help, code reviews, and general feedback were usually answered quickly.
Shared ownership: This collaborative culture also meant that contributors would regularly pick up work left by others or hand off their own tasks when needed. That mutual support made a big difference.
Backporting security fixes: This is probably the most intense —and most rewarding— activity. It involves manually adapting patches to work on older codebases when upstream patches don’t apply cleanly. This requires deep code understanding and thorough testing.
Upstream collaboration: Reaching out to upstream developers was a key part of my workflow. I often asked if they could provide patches for older versions or at least review my backports. Sometimes they were available, but most of the time, the responsibility remained on us.
Diverse tech stack: The work exposed me to a wide range of programming languages and frameworks—Python, Java, C, Perl, and more. Unsurprisingly, some modern languages (like Go) are less prevalent in older releases like Jessie.
Security team interaction: I had frequent contact with the core Debian Security Team—the folks responsible for security in Debian stable. This gave me a broader perspective on how Debian handles security holistically.

In March 2025, I decided to scale back my involvement in the projects due to some changes in my personal life. Still, this experience has been one of the highlights of my career, and I would definitely recommend it to others.

I’m very grateful for the warm welcome I received from the LTS/ELTS community, and I don’t rule out the possibility of rejoining the LTS/ELTS efforts in the future.

The Debian LTS/ELTS projects are currently coordinated by folks at Freexian. Many thanks to Freexian and sponsors for providing this opportunity!

Wikimedia Toolforge: migrating Kubernetes from PodSecurityPolicy to Kyverno

Thu, 04 Jul 2024 09:00:00 +0000

Christian David, CC BY-SA 4.0, via Wikimedia Commons

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

Summary: this article shares the experience and learnings of migrating away from Kubernetes PodSecurityPolicy into Kyverno in the Wikimedia Toolforge platform.

Wikimedia Toolforge is a Platform-as-a-Service, built with Kubernetes, and maintained by the Wikimedia Cloud Services team (WMCS). It is completely free and open, and we welcome anyone to use it to build and host tools (bots, webservices, scheduled jobs, etc) in support of Wikimedia projects.

We provide a set of platform-specific services, command line interfaces, and shortcuts to help in the task of setting up webservices, jobs, and stuff like building container images, or using databases. Using these interfaces makes the underlying Kubernetes system pretty much invisible to users. We also allow direct access to the Kubernetes API, and some advanced users do directly interact with it.

Each account has a Kubernetes namespace where they can freely deploy their workloads. We have a number of controls in place to ensure performance, stability, and fairness of the system, including quotas, RBAC permissions, and up until recently PodSecurityPolicies (PSP). At the time of this writing, we had around 3.500 Toolforge tool accounts in the system. We early adopted PSP in 2019 as a way to make sure Pods had the correct runtime configuration. We needed Pods to stay within the safe boundaries of a set of pre-defined parameters. Back when we adopted PSP there was already the option to use 3rd party agents, like OpenPolicyAgent Gatekeeper, but we decided not to invest in them, and went with a native, built-in mechanism instead.

In 2021 it was announced that the PSP mechanism would be deprecated, and removed in Kubernetes 1.25. Even though we had been warned years in advance, we did not prioritize the migration of PSP until we were in Kubernetes 1.24, and blocked, unable to upgrade forward without taking actions.

The WMCS team explored different alternatives for this migration, but eventually we decided to go with Kyverno as a replacement for PSP. And so with that decision it began the journey described in this blog post.

First, we needed a source code refactor for one of the key components of our Toolforge Kubernetes: maintain-kubeusers. This custom piece of software that we built in-house, contains the logic to fetch accounts from LDAP and do the necessary instrumentation on Kubernetes to accommodate each one: create namespace, RBAC, quota, a kubeconfig file, etc. With the refactor, we introduced a proper reconciliation loop, in a way that the software would have a notion of what needs to be done for each account, what would be missing, what to delete, upgrade, and so on. This would allow us to easily deploy new resources for each account, or iterate on their definitions.

The initial version of the refactor had a number of problems, though. For one, the new version of maintain-kubeusers was doing more filesystem interaction than the previous version, resulting in a slow reconciliation loop over all the accounts. We used NFS as the underlying storage system for Toolforge, and it could be very slow because of reasons beyond this blog post. This was corrected in the next few days after the initial refactor rollout. A side note with an implementation detail: we stored a configmap on each account namespace with the state of each resource. Storing more state on this configmap was our solution to avoid additional NFS latency.

I initially estimated this refactor would take me a week to complete, but unfortunately it took me around three weeks instead. Previous to the refactor, there were several manual steps and cleanups required to be done when updating the definition of a resource. The process is now automated, more robust, performant, efficient and clean. So in my opinion it was worth it, even if it took more time than expected.

Then, we worked on the Kyverno policies themselves. Because we had a very particular PSP setting, in order to ease the transition, we tried to replicate their semantics on a 1:1 basis as much as possible. This involved things like transparent mutation of Pod resources, then validation. Additionally, we had one different PSP definition for each account, so we decided to create one different Kyverno namespaced policy resource for each account namespace — remember, we had 3.5k accounts.

We created a Kyverno policy template that we would then render and inject for each account.

For developing and testing all this, maintain-kubeusers and the Kyverno bits, we had a project called lima-kilo, which was a local Kubernetes setup replicating production Toolforge. This was used by each engineer in their laptop as a common development environment.

We had planned the migration from PSP to Kyverno policies in stages, like this:

update our internal template generators to make Pod security settings explicit
introduce Kyverno policies in Audit mode
see how the cluster would behave with them, and if we had any offending resources reported by the new policies, and correct them
modify Kyverno policies and set them in Enforce mode
drop PSP

In stage 1, we updated things like the toolforge-jobs-framework and tools-webservice.

In stage 2, when we deployed the 3.5k Kyverno policy resources, our production cluster died almost immediately. Surprise. All the monitoring went red, the Kubernetes apiserver became irresponsibe, and we were unable to perform any administrative actions in the Kubernetes control plane, or even the underlying virtual machines. All Toolforge users were impacted. This was a full scale outage that required the energy of the whole WMCS team to recover from. We temporarily disabled Kyverno until we could learn what had occurred.

This incident happened despite having tested before in lima-kilo and in another pre-production cluster we had, called Toolsbeta. But we had not tested that many policy resources. Clearly, this was something scale-related. After the incident, I went on and created 3.5k Kyverno policy resources on lima-kilo, and indeed I was able to reproduce the outage. We took a number of measures, corrected a few errors in our infrastructure, reached out to the Kyverno upstream developers, asking for advice, and at the end we did the following to accommodate the setup to our needs:

corrected the external HAproxy kubernetes apiserver health checks, from checking just for open TCP ports, to actually checking the /healthz HTTP endpoint, which more accurately reflected the health of each k8s apiserver.
having a more realistic development environment. In lima-kilo, we created a couple of helper scripts to create/delete 4000 policy resources, each on a different namespace.
greatly over-provisioned memory in the Kubernetes control plane servers. This is, bigger memory in the base virtual machine hosting the control plane. Scaling the memory headroom of the apiserver would prevent it from running out of memory, and therefore crashing the whole system. We went from 8GB RAM per virtual machine to 32GB. In our cluster, a single apiserver pod could eat 7GB of memory on a normal day, so having 8GB on the base virtual machine was clearly not enough. I also sent a patch proposal to Kyverno upstream documentation suggesting they clarify the additional memory pressure on the apiserver.
corrected resource requests and limits of Kyverno, to more accurately describe our actual usage.
increased the number of replicas of the Kyverno admission controller to 7, so admission requests could be handled more timely by Kyverno.

I have to admit, I was briefly tempted to drop Kyverno, and even stop pursuing using an external policy agent entirely, and write our own custom admission controller out of concerns over performance of this architecture. However, after applying all the measures listed above, the system became very stable, so we decided to move forward. The second attempt at deploying it all went through just fine. No outage this time 🙂

When we were in stage 4 we detected another bug. We had been following the Kubernetes upstream documentation for setting securityContext to the right values. In particular, we were enforcing the procMount to be set to the default value, which per the docs it was ‘DefaultProcMount’. However, that string is the name of the internal variable in the source code, whereas the actual default value is the string ‘Default’. This caused pods to be rightfully rejected by Kyverno while we figured the problem. I sent a patch upstream to fix this problem.

We finally had everything in place, reached stage 5, and we were able to disable PSP. We unloaded the PSP controller from the kubernetes apiserver, and deleted every individual PSP definition. Everything was very smooth in this last step of the migration.

This whole PSP project, including the maintain-kubeusers refactor, the outage, and all the different migration stages took roughly three months to complete.

For me there are a number of valuable reasons to learn from this project. For one, the scale is something to consider, and test, when evaluating a new architecture or software component. Not doing so can lead to service outages, or unexpectedly poor performances. This is in the first chapter of the SRE handbook, but we got a reminder the hard way 🙂

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

Kubecon and CloudNativeCon 2024 Europe summary

Mon, 01 Apr 2024 09:00:00 +0000

This blog post shares my thoughts on attending Kubecon and CloudNativeCon 2024 Europe in Paris. It was my third time at this conference, and it felt bigger than last year’s in Amsterdam. Apparently it had an impact on public transport. I missed part of the opening keynote because of the extremely busy rush hour tram in Paris.

On Artificial Intelligence, Machine Learning and GPUs

Talks about AI, ML, and GPUs were everywhere this year. While it wasn’t my main interest, I did learn about GPU resource sharing and power usage on Kubernetes. There were also ideas about offering Models-as-a-Service, which could be cool for Wikimedia Toolforge in the future.

Back to the Wikimedia Foundation!

Tue, 13 Feb 2024 09:30:00 +0000

In October 2023, I departed from the Wikimedia Foundation, the non-profit organization behind well-known projects like Wikipedia and others, to join Spryker.

However, in January 2024 Spryker conducted a round of layoffs reportedly due to budget and business reasons. I was among those affected, being let go just three months after joining the company.

Fortunately, the Wikimedia Cloud Services team, where I previously worked, was still seeking to backfill my position, so I reached out to them. They graciously welcomed me back as a Senior Site Reliability Engineer, in the same team and position as before.

Although this three-month career “detour” wasn’t the outcome I initially envisioned, I found it to be a valuable experience. During this time, I gained knowledge in a new tech stack, based on AWS, and discovered new engineering methodologies. Additionally, I had the opportunity to meet some wonderful individuals. I believe I have emerged stronger from this experience.

Returning to the Wikimedia Foundation is truly motivating. It feels privileged to be part of this mature organization, its community, and movement, with its inspiring mission and values.

In addition, I’m hoping that this also means I can once again dedicate a bit more attention to my FLOSS activities, such as my duties within the Debian project.

My email address is back online: aborrero@wikimedia.org. You can find me again in the IRC libera.chat server, in the usual wikimedia channels, nick arturo.

OpenTofu: handcrafted include-file mechanism with YAML

Thu, 14 Dec 2023 15:00:00 +0000

I recently started playing with Terraform/OpenTofu almost on a daily basis.

The other day I was working with Amazon Managed Prometheus (or AMP), and wanted to define prometheus alert rules on YAML files.

I decided that I needed a way to put the alerts on a bunch of files, and then load them by the declarative code, on the correct AMP workspace.

I came up with this code pattern that I’m sharing here, for my future reference, and in case it is interesting to someone else.

The YAML file where I specify the AMP workspace, and where the alert rule files live:

---
alert_files:
  my_alerts_production:
    amp:
      workspace: "production"
    files: "alert_rules/production/*.yaml"
  my_alerts_staging:
    amp:
      workspace: "staging"
    files: "alert_rules/staging/*.yaml"

Note the files entry contains a file pattern. I will later expand the pattern using the fileset() function.

Each rule file would be something like this:

---
name: "my_rule_namespace"
rule_data: |
  # this is prometheus-specific config
  groups:
    - name: "example_alert_group"
      rules:
      - alert: Example_Alert_Cpu
        # just arbitrary values, to produce an example alert
        expr: avg(rate(ecs_cpu_seconds_total{container=~"something"}[2m])) > 1
        for: 10s
        annotations:
          summary: "CPU usage is too high"
          description: "The container average CPU usage is too high."

I’m interested in the data structure mutating into something similar to this:

---
alert_files:
  my_alerts_production:
    amp:
      workspace: "production"
    alerts_data:
      - name: rule_namespace_1
        rule_data: |
          # actual alert definition here
          [..]
      - name: rule_namespace_2
        rule_data: |
          # actual alert definition here
          [..]

  my_alerts_staging:
    amp:
      workspace: "staging"
    alerts_data:
      - name: rule_namespace_1
        rule_data: |
          # actual alert definition here
          [..]
      - name: rule_namespace_2
        rule_data: |
          # actual alert definition here
          [..]

This is the algorithm that does the trick:

locals {
  alerts_config = {
    for x, y in {
      for k, v in local.config.alert_files :
      k => {
        amp : (v.amp),
        files : fileset("", v.files)
      }
      } : x => {
      amp : (y.amp),
      alertmanager_data : [
        for z in(y.files) :
        yamldecode(file(z))
      ]
    }
  }
}

Because the declarative nature of the Terraform/OpenTofu language, I needed to implement 3 different for loops. Each loop reads the map and transforms it in some way, passing the result into the next loop. A bit convoluted if you ask me.

To explain the logic, I think it makes more sense to read it from inside out.

First loop:

    for k, v in local.config.alert_files :
    k => {
        amp : (v.amp),
        files : fileset("", v.files)
    }

This loop iterates the input YAML map in key-value pairs, remapping each amp entry, and expanding the file globs using the fileset() into a temporal files entry.

Second loop:

    for x, y in {
        # previous fileset() loop
      } : x => {
      amp : (y.amp),
      alertmanager_data : [
        # yamldecode() loop
      ]
    }

This intermediate loop is responsible for building the final data structure. It iterates the previous fileset() loop to remap it calling the next loop, the yamldecode() one. Note how the amp entry is being “rebuilt” in each remap (first loop and this one), otherwise we would lose it!

Third loop:

    alertmanager_data : [
        for z in(y.files) :
        yamldecode(file(z))
    ]

And finally, this is maybe the easiest loop of the 3, we iterate the temporal file entry that was created in the first loop, calling yamldecode() for each of the file names generated by fileset().

The resulting data structure should allow you to easily create resources later in a for_each loop.

New job at Spryker

Mon, 20 Nov 2023 07:44:00 +0000

Last month, in October 2023, I started a new job as a Senior Site Reliability Engineer at the Germany-headquartered technology company Spryker. They are primarily focused on e-commerce and infrastructure businesses.

I joined this company with excitement and curiosity, as I would be working with a new technology stack: AWS and Terraform. I had not been directly exposed to them in the past, but I was aware of the vast industry impact of both.

Suddenly, things like PXE boot, NIC firmware, and Linux kernel upgrades are mostly AWS problems.

This is a full-time, 100% remote job position. I’m writing this post one month into the new position, and all I can say is that the onboarding was smooth, and I found some good people in my team.

Prior to joining Spryker, I was a Senior SRE at the Wikimedia Cloud Services team at the Wikimedia Foundation. I had been there since 2017, for 6 years. It was a privilege working there.

Wikimedia Hackathon 2023 Athens summary

Wed, 31 May 2023 12:11:00 +0000

During the weekend of 19-23 May 2023 I attended the Wikimedia hackathon 2023 in Athens, Greece. The event physically reunited folks interested in the more technological aspects of the Wikimedia movement in person for the first time since 2019. The scope of the hacking projects include (but was not limited to) tools, wikipedia bots, gadgets, server and network infrastructure, data and other technical systems.

My role in the event was two-fold: on one hand I was in the event because of my role as SRE in the Wikimedia Cloud Services team, where we provided very valuable services to the community, and I was expected to support the technical contributors of the movement that were around. Additionally, and because of that same role, I did some hacking myself too, which was specially augmented given I generally collaborate on a daily basis with some community members that were present in the hacking room.

The hackathon had some conference-style track and I ran a session with my coworker Bryan, called Past, Present and Future of Wikimedia Cloud Services (Toolforge and friends) (slides) which was very satisfying to deliver given the friendly space that it was. I attended a bunch of other sessions, and all of them were interesting and well presented. The number of ML themes that were present in the program schedule was exciting. I definitely learned a lot from attending those sessions, from how LLMs work, some fascinating applications for them in the wikimedia space, to what were some industry trends for training and hosting ML models.

Despite the sessions, the main purpose of the hackathon was, well, hacking. While I was in the hacking space for more than 12 hours each day, my ability to get things done was greatly reduced by the constant conversations, help requests, and other social interactions with the folks. Don’t get me wrong, I embraced that reality with joy, because the social bonding aspect of it is perhaps the main reason why we gathered in person instead of virtually.

That being said, this is a rough list of what I did:

Helped review the status of Migrate bldrwnsch from Toolforge GridEngine to Toolforge Kubernetes
Helped the maintainer of the bodh Toolforge tool get it working with a reverse proxy and started conversations about how to facilitate the use case.
Discussed persistent storage options in Toolforge, which resulted in some conversations within the WMCS team.
Had several debates with several folks on what computing abstractions we are providing, including Toolforge as a PaaS, raw Kubernetes access, or even if we should continue offering virtual machines to the community.
Patched toolforge-weld to support some stuff required by jobs-framework-cli. Also made a release of it.
Started a patch for jobs-framework-cli to use toolforge-weld. Which is still unfinished as of this writing.
Debated about how to host ML models, what to do with GPUs etc.
Suggested fellow SRE Riccardo in working on cloud-private subnet: introduce new domain to integrate with Netbox.
Uncovered a flavor definition problem in our Openstack deployment.
Reviewed many Toolforge account requests (most of them not related to the hackathon though), some quota requests and similar things.

The hackathon was also the final days of Technical Engagement as an umbrella group for WMCS and Developer Advocacy teams within the Technology department of the Wikimedia Foundation because of an internal reorg.. We used the chance to reflect on the pleasant time we have had together since 2019 and take a final picture of the few of us that were in person in the event.

It wasn’t the first Wikimedia Hackathon for me, and I felt the same as in previous iterations: it was a welcoming space, and I was surrounded by friends and nice human beings. I ended the event with a profound feeling of being privileged, because I was part of the Wikimedia movement, and because I was invited to participate in it.

Kubecon and CloudNativeCon 2023 Europe summary

Thu, 27 Apr 2023 10:47:00 +0000

This post serves as a report from my attendance to Kubecon and CloudNativeCon 2023 Europe that took place in Amsterdam in April 2023. It was my second time physically attending this conference, the first one was in Austin, Texas (USA) in 2017. I also attended once in a virtual fashion.

The content here is mostly generated for the sake of my own recollection and learnings, and is written from the notes I took during the event.

The very first session was the opening keynote, which reunited the whole crowd to bootstrap the event and share the excitement about the days ahead. Some astonishing numbers were announced: there were more than 10.000 people attending, and apparently it could confidently be said that it was the largest open source technology conference taking place in Europe in recent times.

It was also communicated that the next couple iteration of the event will be run in China in September 2023 and Paris in March 2024.

More numbers, the CNCF was hosting about 159 projects, involving 1300 maintainers and about 200.000 contributors. The cloud-native community is ever-increasing, and there seems to be a strong trend in the industry for cloud-native technology adoption and all-things related to PaaS and IaaS.

The event program had different tracks, and in each one there was an interesting mix of low-level and higher level talks for a variety of audience. On many occasions I found that reading the talk title alone was not enough to know in advance if a talk was a 101 kind of thing or for experienced engineers. But unlike in previous editions, I didn’t have the feeling that the purpose of the conference was to try selling me anything. Obviously, speakers would make sure to mention, or highlight in a subtle way, the involvement of a given company in a given solution or piece of the ecosystem. But it was non-invasive and fair enough for me.

On a different note, I found the breakout rooms to be often small. I think there were only a couple of rooms that could accommodate more than 500 people, which is a fairly small allowance for 10k attendees. I realized with frustration that the more interesting talks were immediately fully booked, with people waiting in line some 45 minutes before the session time. Because of this, I missed a few important sessions that I’ll hopefully watch online later.

Finally, on a more technical side, I’ve learned many things, that instead of grouping by session I’ll group by topic, given how some subjects were mentioned in several talks.

On gitops and CI/CD pipelines

Most of the mentions went to FluxCD and ArgoCD. At that point there were no doubts that gitops was a mature approach and both flux and argoCD could do an excellent job. ArgoCD seemed a bit more over-engineered to be a more general purpose CD pipeline, and flux felt a bit more tailored for simpler gitops setups. I discovered that both have nice web user interfaces that I wasn’t previously familiar with.

However, in two different talks I got the impression that the initial setup of them was simple, but migrating your current workflow to gitops could result in a bumpy ride. This is, the challenge is not deploying flux/argo itself, but moving everything into a state that both humans and flux/argo can understand. I also saw some curious mentions to the config drifts that can happen in some cases, even if the goal of gitops is precisely for that to never happen. Such mentions were usually accompanied by some hints on how to operate the situation by hand.

Worth mentioning, I missed any practical information about one of the key pieces to this whole gitops story: building container images. Most of the showcased scenarios were using pre-built container images, so in that sense they were simple. Building and pushing to an image registry is one of the two key points we would need to solve in Toolforge Kubernetes if adopting gitops.

In general, even if gitops were already in our radar for Toolforge Kubernetes, I think it climbed a few steps in my priority list after the conference.

Another learning was this site: https://opengitops.dev/.

On etcd, performance and resource management

I attended a talk focused on etcd performance tuning that was very encouraging. They were basically talking about the exact same problems we have had in Toolforge Kubernetes, like api-server and etcd failure modes, and how sensitive etcd is to disk latency, IO pressure and network throughput. Even though Toolforge Kubernetes scale is small compared to other Kubernetes deployments out there, I found it very interesting to see other’s approaches to the same set of challenges.

I learned how most Kubernetes components and apps can overload the api-server. Because even the api-server talks to itself. Simple things like kubectl may have a completely different impact on the API depending on usage, for example when listing the whole list of objects (very expensive) vs a single object.

The conclusion was to try avoiding hitting the api-server with LIST calls, and use ResourceVersion which avoids full-dumps from etcd (which, by the way, is the default when using bare kubectl get calls). I already knew some of this, and for example the jobs-framework-emailer was already making use of this ResourceVersion functionality.

There have been a lot of improvements in the performance side of Kubernetes in recent times, or more specifically, in how resources are managed and used by the system. I saw a review of resource management from the perspective of the container runtime and kubelet, and plans to support fancy things like topology-aware scheduling decisions and dynamic resource claims (changing the pod resource claims without re-defining/re-starting the pods).

On cluster management, bootstrapping and multi-tenancy

I attended a couple of talks that mentioned kubeadm, and one in particular was from the maintainers themselves. This was of interest to me because as of today we use it for Toolforge. They shared all the latest developments and improvements, and the plans and roadmap for the future, with a special mention to something they called “kubeadm operator”, apparently capable of auto-upgrading the cluster, auto-renewing certificates and such.

I also saw a comparison between the different cluster bootstrappers, which to me confirmed that kubeadm was the best, from the point of view of being a well established and well-known workflow, plus having a very active contributor base. The kubeadm developers invited the audience to submit feature requests, so I did.

The different talks confirmed that the basic unit for multi-tenancy in kubernetes is the namespace. Any serious multi-tenant usage should leverage this. There were some ongoing conversations, in official sessions and in the hallway, about the right tool to implement K8s-whitin-K8s, and vcluster was mentioned enough times for me to be convinced it was the right candidate. This was despite of my impression that multiclusters / multicloud are regarded as hard topics in the general community. I definitely would like to play with it sometime down the road.

On networking

I attended a couple of basic sessions that served really well to understand how Kubernetes instrumented the network to achieve its goal. The conference program had sessions to cover topics ranging from network debugging recommendations, CNI implementations, to IPv6 support. Also, one of the keynote sessions had a reference to how kube-proxy is not able to perform NAT for SIP connections, which is interesting because I believe Netfilter Conntrack could do it if properly configured. One of the conclusions on the CNI front was that Calico has a massive community adoption (in Netfilter mode), which is reassuring, especially considering it is the one we use for Toolforge Kubernetes.

On jobs

I attended a couple of talks that were related to HPC/grid-like usages of Kubernetes. I was truly impressed by some folks out there who were using Kubernetes Jobs on massive scales, such as to train machine learning models and other fancy AI projects.

It is acknowledged in the community that the early implementation of things like Jobs and CronJobs had some limitations that are now gone, or at least greatly improved. Some new functionalities have been added as well. Indexed Jobs, for example, enables each Job to have a number (index) and process a chunk of a larger batch of data based on that index. It would allow for full grid-like features like sequential (or again, indexed) processing, coordination between Job and more graceful Job restarts. My first reaction was: Is that something we would like to enable in Toolforge Jobs Framework?

On policy and security

A surprisingly good amount of sessions covered interesting topics related to policy and security. It was nice to learn two realities:

kubernetes is capable of doing pretty much anything security-wise and create greatly secured environments.
it does not by default. The defaults are not security-strict on purpose.

It kind of made sense to me: Kubernetes was used for a wide range of use cases, and developers didn’t know beforehand to which particular setup they should accommodate the default security levels.

One session in particular covered the most basic security features that should be enabled for any Kubernetes system that would get exposed to random end users. In my opinion, the Toolforge Kubernetes setup was already doing a good job in that regard. To my joy, some sessions referred to the Pod Security Admission mechanism, which is one of the key security features we’re about to adopt (when migrating away from Pod Security Policy).

I also learned a bit more about Secret resources, their current implementation and how to leverage a combo of CSI and RBAC for a more secure usage of external secrets.

Finally, one of the major takeaways from the conference was learning about kyverno and kubeaudit. I was previously aware of the OPA Gatekeeper. From the several demos I saw, it was to me that kyverno should help us make Toolforge Kubernetes more sustainable by replacing all of our custom admission controllers with it. I already opened a ticket to track this idea, which I’ll be proposing to my team soon.

Final notes

List of sessions I attended on the first day:

List of sessions I attended on the second day:

List of sessions I attended on third day:

The videos have been published on Youtube.