FALLACIES OF
DISTRIBUTED
COMPUTING WITH
KUBERNETES ON
AWS
Raffaele Di Fazio
05.10.2017
WHOAMI
3
ZALANDO
15 markets
6 fulfillment centers
21 million active customers
3.6 billion € net sales 2016
200 million visits per month
13,000 employees in Europe
4
ZALANDO TECHNOLOGY
HOME-BREWED,
CUTTING-EDGE
& SCALABLE
technology solutions
>1,800
employees from
tech locations
+ HQs in Berlin6
77
nations
HISTORIA MAGISTRA
VITAE
Photo by Jace Grandinetti on Unsplash
6
Node
KUBERNETES ARCHITECTURE
kubectl
Master
Node
API Server
Scheduler
Controller
Manager
Skipper Kube2IAM
Kubelet
etcd
Logging agent ….
WHAT HAPPENS WHEN
THE API SERVER IS
NOT RUNNING?
Photo by Thomas Kvistholt on Unsplash
8
“A FEW MORE” 404s THAN USUAL...
9
Thanks to Ashley McNamara for the picture
WHY?
Photo by Ricardo Gomez Angel on Unsplash
11
“A FEW” MORE PODS…
12
KILLING KUBERNETES’ API SERVER
Too much memory
API server
OOMKilledLots of pods
13
SYSTEM VIEW - TRAFFIC FLOW
https://github.com/zalando/skipper
ALB
Node Skipper Node Skipper
MyApp MyApp MyApp
Service Service
K8S network
EC2 network
TLS
HTTP
14
WHAT REALLY HAPPENED
15
WHAT REALLY HAPPENED
• All routes removed:
• No routes to the applications deployed inside the cluster
• Healthcheck “unhealthy” because of no connection to API server
• => All nodes were unhealthy in the ELBv2
WHAT HAPPENS WHEN
ALL THE TARGETS IN
AN ELBv2 ARE
UNHEALTHY?
Photo by Sandro Katalina on Unsplash
17
WHAT ABOUT THE ELBv2?
18
WHAT ABOUT THE ELBv2?
If no Availability Zone contains a healthy target, the load
balancer nodes route requests to all targets.
KUBERNETES API
SERVER AVAILABILITY
AND CONTROL LOOPS
Photo by chuttersnap on Unsplash
20
MISTAKES WERE MADE
21
22
WE ARE NOT ALONE
• … a test that simulated the failure of a single apiserver node
disrupted the cluster in a way that negatively impacted the
availability of running workloads
• ... helped us identify that the disruption was likely related to an
interaction between the various clients that connect to the
Kubernetes apiserver (like calico-agent, kubelet, kube-proxy, and
kube-controller-manager) and our internal load balancer’s
behavior during an apiserver node failure.
• Source: Kubernetes at GitHub
23
HOW WE FIXED IT
24
HOW WE FIXED IT
• Do not change the healthcheck in case of API server failures
• Do not drop the routes in case of API server failures
• => Delete when you are really sure you want to delete!
FALLACIES OF
DISTRIBUTED
COMPUTING
Photo by chuttersnap on Unsplash
26
8 FALLACIES OF DISTRIBUTED COMPUTING
• The network is reliable.
• Latency is zero.
• Bandwidth is infinite.
• The network is secure.
• Topology doesn't change.
• There is one administrator.
• Transport cost is zero.
• The network is homogeneous.
27
28
THE FALLACIES OF CLOUD COMPUTING
• The API call you will make will succeed.
• The next API call you will make will succeed.
• Deleting resources is the same as adding new.
• Your cloud provider will have no outages.
• The dependencies between your services are clear.
MAKING YOUR SYSTEM
RESILIENT
Photo by Aaron Barnaby on Unsplash
30
WHEN MAKING API CALLS
• Every API call can fail
• Retry (with backoff)
• Circuit breakers
• Fallbacks
• Don’t scale down / delete resources fast!
• Deal with rate limiting
• Deal with “weird” values due to a broken cloud provider feature
31
TEST ALL THE THINGS
• Continuous integration tests
• Continuous deployment of cluster updates
• Load tests
• Chaos tests
32
CONTINUOUS INTEGRATION TESTS
• Test the interactions between components
• For every configuration change we run extensive e2e tests
33
CONTINUOUS INTEGRATION TESTS
34
CONTINUOUS
DEPLOYMENT
OF CLUSTER
UPDATES
35
LOAD TESTING
• Lots of request to the API server
• Lots of pods running
• Write/reads to the data storage (etcd)
• => what matters: observe the impact on running applications
36
CHAOS TESTING
• Random shutdown of Kubernetes components
• https://github.com/linki/chaoskube
• http://chaostoolkit.org/
• https://github.com/asobti/kube-monkey
• http://principlesofchaos.org
• Random shutdown of nodes (EC2 Instances)
• https://github.com/Netflix/chaosmonkey
37
MORE ON CHAOS TESTING
• Netflix’s principles of Chaos Engineering
• http://principlesofchaos.org
• Chaos Engineering free ebook ->
38
THAT’S NOT ALL
• You think Kubernetes the hard way is hard
• The hard part was never only the setup
• Sometimes you will have to break things to learn
• …setup a healthy post mortem culture and learn from
mistakes!
THAT’S IT
Photo by Dhruva Reddy on Unsplash
QUESTIONS?
Raffaele Di Fazio
@x0rg

Fallacies of distributed computing with Kubernetes on AWS