24 Jul 2017, 10:05

Deploying a real persistent/durable etcd cluster inside kubernetes

EDIT (26 July 2017) Updated the post with a better solution (since I was wrong and also etcd 3.2 accepts peerURLs containing domain names).

When using an etcd cluster to store important key value data you’ll probably prefer data persistency over availability. If more than half of your etcd cluster members goes down you’ll prefer to wait for them to return back accepting a loss of availability instead of recreating a new etcd cluster restoring a backup of your data that will probably be an old version of the data at the moment of the disaster.

As an example in stolon the stolon cluster data is saved inside a store like etcd or consul. Restoring the stolon cluster data from a backup could lead to bad behaviors since the stolon cluster state (that contains different information and the primary one is which postgres is the master/primary) won’t be in sync wiTh the real stolon cluster state).

Stolon was architected to be seamlessy deployed inside a k8s cluster so it becomes logical to also deploy the store (etcd or consul) inside k8s.

Today there’re different ways to deploy an etcd cluster inside k8s but, as I’m going to explain in this post, these don’t meet the above requirement:

Why these above options fail to meet our requirements?

Etcd operator

Let’s start with etcd-operator. Don’t take me wrong, I think the operator concept is really good and etcd-operator can be a good way to manage an etcd cluster inside k8s, it’s just that, in its current shape, etcd-operator doesn’t meet the above requirement.

As of today (24 July 2017) etcd operator architecture and logic manages a set of etcd pods, so it acts as a sort of custom controller for an etcd cluster. But if a pod executing an etcd member dies the etcd-operator will delete it, remove the related etcd peer from the etcd cluster and create a new one.

Why does it just don’t wait for the pod to come back?

One of the reason is that it doesn’t use persistent volumes so the data won’t be available if the pod is scheduled on another k8s node

BUT, also if using persistent volumes, this won’t work for another important reason: etcd-operator generated pods names have an increasing numbers. An headless service resolve pod names to their ip address. So if a pod is dead, etcd-operator will delete it and generate a new replacement pod with a different name. But the etcd cluster has an internal list of the peerURLs and since the new pod hostname is changed the other nodes cannot communicate with it.

As a possible solution the etcd-operator could issue an update member command updating the peer url. This could seem to work but what’ll happen if more than half nodes are stopped at the same time? They’ll be replaced with new pods with different hostnames and won’t join the cluster leaving it unquorate. But you cannot issue a peer update command to an unquorate cluster ending in a probably unfixable state (I haven’t found a way to force a peerURLs update since the data looks stored inside the etcd store, so no file to manually update like with consul).

So when the etcd-operator detects that a cluster has lost more than half of its nodes it cannot just wait for them to come back alive (since they don’t have persistent data and no stable peer address) but the only thing it can do is to drop the current cluster and create a new cluster restoring its data from a backup (if configured to do so). If this is enabled you’ll end up with an etcd cluster that under the hoods is restored with old data. So you’ve to be wise and know if this is a correct approach depending on what you’re storing inside your etcd kv store.

Using this posts as a basic start point I’d like to talk with the etcd-operator maintainers if the etcd-operator logic could be changed in a future to achieve a cleaner etcd persistency.

EDIT (25 July 2017) Opened etcd-operator issue #1323

Statefulset

So another solution could be a stateful set with N replicas managing an etcd cluster like done with the above helm chart. What are the problems with it?

  • Every time a pod fails (or is deleted for example when manually doing a stateful set upgrade or automatically using k8s 1.7 statefulset RollinUpdate updateStrategy ) a preStop hook is executed to remove the node from the cluster and when the pod is restarted it’ll rejoin the cluster. This is probably done to handle cluster resize (for example from 5 to 3 nodes) but has caused to me different problem while testing it, for example, trying to delete more than one pod at a time will make the replacement pods fails to start forever waiting for the initial cluster size of nodes to be ready (impossible).

  • This statefulset won’t work with etcd 3.2 that doesn’t accept an hostname is a listen url.

I’ll open a PR to fix this last problem but I don’t like the idea of a pre stop hook that will remove the etcd member from the cluster and all the problems it’ll cause.

A possible solution

While I’d really like to have the etcd-operator features with a persistent etcd cluster, currently I have to rely other solutions.

Updated k8s stateful set

Based on the stateful set provided by the above helm chart I changed it to avoid etcd member leaving/rejoining the cluster. This way wil block automatic cluster scaling from N to M nodes if you change the statefulset replicas value. Since I’m not interested in this option I’m fine with it. Anyway you can obtain it scaling just chaning the replicas number and running (before when decreasing and after when increasing) some etcdctl commands to remove/add members from the cluster.

Try it

You can find an example here: https://github.com/sgotti/k8s-persistent-etcd

You can use the resources/etcd.yaml file of the above repo to create a 3 node etcd cluster inside a minikube node (it contains 3 services and 3 stateful sets). Inside a production deployment you should edit the resource adapting it to your needs (for example changing the persistent volume claims template definition to match your architecture).

Your clients should use the dns names ${SET_NAME}-${i}.${SET_NAME} (with i from 0 to replicas - 1) (for example etcd-0.etcd etcd-1.etcd etcd-2.etcd) or they equivalent FQDN (based on your clustger global domain and namespace) as the etcd endpoints. You could also create a dedicated client service with a label selector that will choose all the pods in the statefulset cluster and use its cluster ip.

Helm chart

coming soon. I’d like to also try to add a pre/post hook to automatically scale the cluster instead of adding this logic in the pod template command.


Old solution

NOTE: this isn’t the best solution and was born with the (wrong) assumption that you won’t be able to specify a domain name in the etcd peerURLs in etcd 3.2. Since this isn’t true the single statefulset solution above is cleaner.

This solution uses stable ip addresses instead of stable pod names to achieve persistent etcd peerURLs using a service per pod and 3 statefulsets with replicas = 1.

You can find it here: https://github.com/sgotti/k8s-persistent-etcd/blob/master/old