December 11, 2017

Day 11 - Scaling your on-duty team

By: Damien Pacaud (@damienpacaud)

Edited By: (@bmarsteau)

Our tech team at teads.tv is mostly based in France where labour law and legislation provide quite a strict set of rules and boundaries for working out of office hours.

For this reason we’ve had to adapt and give some thought to our on-duty team organization as we grew from a start-up to a scale-up.

Intro

Scaling your on-duty team is crucial for most of the fast-growing startups that operate at a global level. The internet never sleeps, and even with the best design for resilience, one day, your system will go down. At teads, we deliver outstream video advertising for the biggest content publishers in the world. Any downtime has important repercussions on our revenue but also on the publisher’s revenue. We decided to carefully think about scaling our on-duty team in order to minimize the downtime when a system goes down. That story is below.

Our problem

In a few years, we’ve scaled from a growing startup operating with a few pizza teams into a company where more than 100 developers on 3 different locations deliver new features on a daily basis. We’ve been able to do so by implementing our own version of the “Spotify model” and it has given us the ability to stay agile while growing the tech team. Applying the same recipe to the on-duty team was a challenge, to say the least. Initially, the on-duty team was composed of a few developers that had been with teads since the very beginning and that were very knowledgeable on every part of the platform. We relied on their knowledge, availability and on the fact that they helped build most of the system. As we grew, the system became larger and more complex. The handful of developers keeping the revenue safe overnight were now unable to keep up with the needed knowledge to solve a problem.

First step : Growing the on-duty team

We started looking for people to add to the on-duty team and ideally have someone from each of our feature teams be part of the rotation. This was our way of implementing “you built it, you run it” in a country with strict labour laws. It meant growing that team to 12 people and that’s when we hit the first wall. We tried growing the team while having a few visible production incidents (S3 Service Disruption in us-east-1, anyone ?) and of course, no one was voluntarily applying to be on duty.

spounge bob

Besides, being on duty once every twelve weeks seems counterproductive as it is spread on too long a timespan. By the time you are back on duty a lot of systems have changed and it is difficult to remember good practices.

Lost battle: trying to be ready

One of the main reason nobody was applying to the on-duty rotation was the lack of documentation for how to react when an incident arises. We tried to tackle this problem and for a few months we set up meetings, put knowledgeable people in a room and ask them to kindly document the steps to take when incidents happen.

meeting

This was too large of a mission, even for a highly motivated team. Soon, meetings were skipped, and documentation was not improving.

At this point, we started thinking about the problem in a different way.

Enter on-duty pairing

The first decision we took was to have two persons on-duty at the same time for a week-long shift. We tried to wisely choose pairs for mutually exclusive skills set and experience. We will for example pair a back-end developer with a data-oriented developer. This allows to cover most systems on the critical chain.

The benefits that we see with the on-duty pairing are: It’s much easier to bounce ideas off someone when a problem is impacting production and you (or your pair) do not know how to fix it. Sometimes while on-duty, the incident runs so deep that a critical business decision must be taken. It’s much easier to share the responsibility of such a decision in the middle of the night. We accept that this may slow down the decision process as there will be back-and-forth between the pair. In the rare event of someone not waking up to the PagerDuty calls, there is a backup. Interestingly enough, we had never experienced someone not waking up until we started pairing. This brought the question that pairing may lower each individual’s sense of alert because there is a backup but in the end we feel it has more benefits than downsides.

teamwork

We implemented this change in a few weeks and so far we are quite happy with it. The team has scaled to 12 developers, coming from all feature teams, and the rotation goes smoothly.

Escalation ?

The traditional way of dealing with increasing complexity is to have an escalation policy. We chose not to implement this and have PagerDuty automatically wake up both pairing developers at the same time. This automates the decision of waking-up another human being and makes PagerDuty responsible for it. We don’t want to be responsible for this hard decision so we let the robot do it.

robot

Escalation usually also solves the “I need an expert on [insert any well known distributed system here] and I need her right now” problem. Putting them on escalation policies is great if you have a big enough pool of experts on each of the systems that you use. For us this meant that a few persons would be on call every other week. We thought this was not acceptable and decided that we could solve this by : Telling the on-duty team members we know they will do their best to recover the issue Giving them the confidence that, as engineers, they will find a solution Automating as much as we can routine maintenance operations (taking a bad cassandra node out of the ring, decommissioning and replacing a Kafka broker…)

Post-incident & Playbook

Soon after the incident, we gather everyone from the on-duty team in a room for a blameless, fact-oriented, post-mortem. We aim to leave the room after one hour having filled our very simple post-incident template. Summary of the issue How to reply to such an issue (should it rise again) Action plan

This process allows us to document our interventions and ensure, should the same incident happen, we have a solution to mitigate its effect in a timely manner.

Feedback

After a few months, we are quite happy with this new on-duty rotation. It has proven useful many times and we now have more documentation than ever on how to react to our alerts. The post-incident ritual also acts as a team bonding meeting and we are thinking of creating more rituals specifically for the on-duty team (on top of each individual’s feature team rituals).

The biggest complexity that we encountered since launching was organizing the Christmas rotation period with pairs. It’s always a challenge to find one person available during those holidays, so trying to find two is double the fun.

No comments :