Running Kubernetes in Production:
A Million Ways to Crash Your Cluster
HENNING JACOBS
@try_except_
2018-12-05
Image
Image
4
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15.000
employees in
Europe
> 70%
of visits via
mobile devices
> 24
million
active customers
> 300.000
product choices
~ 2.000
brands
17
countries
Black
Friday
2018
> 4,200
orders per minute
6
SCALE
100Clusters
373Accounts
7
DEVELOPERS USING KUBERNETES
8
46+ cluster
components
INCIDENTS ARE FINE
10
INCIDENT #1: CUSTOMER IMPACT
11
INCIDENT #1: IAM RETURNING 404
12
INCIDENT #1: NUMBER OF PODS
13
LIFE OF A REQUEST (INGRESS)
Node Node
MyApp MyApp MyApp
EC2 network
K8s network
TLS
HTTP
Skipper Skipper
ALB
14
ROUTES FROM API SERVER
Node Node
MyApp MyApp MyApp
Skipper
ALBAPI Server
Skipper
15
API SERVER DOWN
Node Node
MyApp MyApp MyApp
Skipper
ALBAPI Server
Skipper
OOMKill
16
INCIDENT #1: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...
17
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: Never
containers:
18
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Ingress to stay “healthy” during API server problems
• Fix Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
19
INCIDENT #2: CLUSTER DOWN
20
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
21
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
22
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
24
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
25
INCIDENT #3: API LATENCY SPIKES
26
INCIDENT #3: CONNECTION ISSUES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
Master Node
API Server
etcd
etcd-member
27
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
28
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
29
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
30
INCIDENT #4: IMPACT
Ingress
5XXs
31
INCIDENT #4: CLUSTER DOWN?
32
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
34
CLUSTER UPGRADE
FLOW
35
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
36
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playground clusters. 3
alpha Main infrastructure cluster (important to us). 1
beta
Product clusters for the rest of the
organization (prod/test). 90+
37
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
38
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
39
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
40
INCIDENT #4: LESSONS LEARNED
• Automated e2e tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
41
INCIDENT #5: IMPACT
[4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied
one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 ..
[5:01 PM] Alice: Now it does not start the build step at all
[5:02 PM] John: +1
[5:02 PM] John: Failed to create builder pod: …
[5:02 PM] Pedro: +1
[5:04 PM] Damien: +1
[5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which
results in many problems…
...
42
INCIDENT #5: IMPACT
43
INCIDENT #5: A VERY INNOCENT PULL REQUEST
44
INCIDENT #5: WHAT HAPPENED
• Deployment caused rebuild with latest stable Go version
• Library for signature verification was incompatible with Go 1.10,
causing all verification checks to fail during runtime.
• Lack of unit/smoke tests and alerting for one component
• "Near miss": outage could have had large impact
45
INCIDENT #6: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume
"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
46
INCIDENT #6: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
47
INCIDENT #6: CPU THROTTLING
48
INCIDENT #6: WHAT HAPPENED
Scaled down IAM provider
to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
49
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
SLACK
CPU
Memory
Node
"Slack"
50
DISABLING CPU THROTTLING
[Announcement] CPU limits will be disabled
⇒ Ingress Latency Improvements
kubelet … --cpu-cfs-quota=false
51
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
52
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
53
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•
github.com/zalando-incubator/kubernetes-on-aws
54
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
55
DOCKER.. (ON GKE)
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
WELCOME TO
CLOUD NATIVE!
57
58
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
59
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report
https://github.com/hjacobs/kube-ops-view
61
OTHER TALKS
• Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017
• Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018
• Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency
and Latency - HighLoad++ 2018
We need more failure talks!
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018

  • 1.
    Running Kubernetes inProduction: A Million Ways to Crash Your Cluster HENNING JACOBS @try_except_ 2018-12-05
  • 4.
    4 ZALANDO AT AGLANCE ~ 4.5billion EUR revenue 2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 24 million active customers > 300.000 product choices ~ 2.000 brands 17 countries
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    11 INCIDENT #1: IAMRETURNING 404
  • 12.
  • 13.
    13 LIFE OF AREQUEST (INGRESS) Node Node MyApp MyApp MyApp EC2 network K8s network TLS HTTP Skipper Skipper ALB
  • 14.
    14 ROUTES FROM APISERVER Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper
  • 15.
    15 API SERVER DOWN NodeNode MyApp MyApp MyApp Skipper ALBAPI Server Skipper OOMKill
  • 16.
    16 INCIDENT #1: INNOCENTMANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  • 17.
    17 INCIDENT #1: FIXEDCRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never containers:
  • 18.
    18 INCIDENT #1: LESSONSLEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Ingress to stay “healthy” during API server problems • Fix Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  • 19.
  • 20.
    20 INCIDENT #2: MANUALOPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  • 21.
    21 INCIDENT #2: RTFM %etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  • 22.
    22 Junior Engineers areFeatures, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
  • 23.
  • 24.
    24 INCIDENT #2: LESSONSLEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  • 25.
    25 INCIDENT #3: APILATENCY SPIKES
  • 26.
    26 INCIDENT #3: CONNECTIONISSUES ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ... Master Node API Server etcd etcd-member
  • 27.
    27 INCIDENT #3: STOPTHE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  • 28.
    28 INCIDENT #3: CONFIRMATIONFROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  • 29.
    29 INCIDENT #3: LESSONSLEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    35 CLUSTER LIFECYCLE MANAGER(CLM) github.com/zalando-incubator/cluster-lifecycle-manager
  • 36.
    36 CLUSTER CHANNELS github.com/zalando-incubator/kubernetes-on-aws Channel DescriptionClusters dev Development and playground clusters. 3 alpha Main infrastructure cluster (important to us). 1 beta Product clusters for the rest of the organization (prod/test). 90+
  • 37.
    37 E2E TESTS ONEVERY PR github.com/zalando-incubator/kubernetes-on-aws
  • 38.
    38 RUNNING E2E TESTS(BEFORE) Control plane nodenode branch: dev Create Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  • 39.
    39 RUNNING E2E TESTS(NOW) Control plane nodenode Control plane nodenode branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  • 40.
    40 INCIDENT #4: LESSONSLEARNED • Automated e2e tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with previous configuration • Apply new configuration • Run end-to-end & conformance tests
  • 41.
    41 INCIDENT #5: IMPACT [4:59PM] Marc: There is a error during build - forbidden: image policy webhook backend denied one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 .. [5:01 PM] Alice: Now it does not start the build step at all [5:02 PM] John: +1 [5:02 PM] John: Failed to create builder pod: … [5:02 PM] Pedro: +1 [5:04 PM] Damien: +1 [5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which results in many problems… ...
  • 42.
  • 43.
    43 INCIDENT #5: AVERY INNOCENT PULL REQUEST
  • 44.
    44 INCIDENT #5: WHATHAPPENED • Deployment caused rebuild with latest stable Go version • Library for signature verification was incompatible with Go 1.10, causing all verification checks to fail during runtime. • Lack of unit/smoke tests and alerting for one component • "Near miss": outage could have had large impact
  • 45.
    45 INCIDENT #6: IMPACT Errorduring Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail
  • 46.
    46 INCIDENT #6: CREDENTIALSQUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
  • 47.
  • 48.
    48 INCIDENT #6: WHATHAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough
  • 49.
    49 CPU/memory requests "block"resources on nodes. Difference between actual usage and requests → Slack SLACK CPU Memory Node "Slack"
  • 50.
    50 DISABLING CPU THROTTLING [Announcement]CPU limits will be disabled ⇒ Ingress Latency Improvements kubelet … --cpu-cfs-quota=false
  • 51.
    51 A MILLION WAYSTO CRASH YOUR CLUSTER? • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
  • 52.
    52 MORE TOPICS • GracefulPod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
  • 53.
    53 RACE CONDITIONS.. • Switchto the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • github.com/zalando-incubator/kubernetes-on-aws
  • 54.
    54 TIMEOUTS TO APISERVER.. github.com/zalando-incubator/kubernetes-on-aws
  • 55.
  • 56.
  • 57.
  • 58.
    58 OPEN SOURCE Kubernetes onAWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  • 59.
  • 60.
  • 61.
    61 OTHER TALKS • Nordstrom:101 Ways to Crash Your Cluster - KubeCon 2017 • Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018 • Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - HighLoad++ 2018 We need more failure talks!
  • 62.
    QUESTIONS? HENNING JACOBS HEAD OF DEVELOPERPRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k