How to scale your DevOps from 100 servers to 1,000+

Codemotion Amsterdam 2019 had a huge variety of talks across more than a dozen tracks. One of the ones that stood out for me was this talk on DevOps, given by Pat Hermens of Coolblue. Coolblue is one of the biggest online retailers in the Netherlands. They generated revenues of €1.3bn in 2018. Since they were founded 20 years ago, they have seen exponential growth. This is reflected both in their revenues but also in their development team which doubles in size every 18 months and currently employs 240 developers.

The challenge of scale

Many people might naively wonder: what is it with scale? Why are 1,000 servers so much harder to handle than 10 servers? Pat shared a very pertinent quote from Edsger Dijkstra:

Four Stories of DevOps scaling

Pat shared four stories with us to illustrate the other requirements to ensure successful scale-up: responsibility, autonomy, ownership and failure.

Responsibility

In the past, Coolblue used a hub and spoke model for deploying code. The Hosting and Deployment team (effectively DevOps) sat in the centre with each development team going through them for any decisions/knowledge about deployment. This model began to be a blocker since all requests had to go through the one team. As a result, informal knowledge sharing began to happen.

Autonomy

Within bounds, autonomy is essential for scaling. All of Coolblue’s systems are called Vanessa-X. Their modern systems such as Vanessa-de-Prix and Vanessa-Longstocking, are web applications using Serilog, Splunk and DataDog to enable real-time data logging, dashboards and audit trails in an easy-to-integrate fashion. However, the company is still heavily reliant on Vanessa-Optimus-Prime. This is the original system and is a monolithic desktop application based on Delphi (which shows how old it is!). The system runs on thousands of machines across the company and is still central to how the rest of the system works.

Ownership

Ownership is sometimes scary. People feel exposed if they have to take ownership of important decisions. At Coolblue, the build environment is based on Team City. But who actually owns the environment? Well, actually teams own their own unique build environment. The first thing that happens before any build is the build.ps script is called. Each team can configure this script as they choose. As a result, pretty much any build configuration is feasible. And no one else even need know what you are trying!

Failure

One of Pat’s favourite books is “Failing Forward”, by John C. Maxwell. Core to Maxwell’s view is that what matters is how failure is accepted and what changes it triggers. Coolblue owns a fleet of delivery vehicles. Recently they added electric bicycles to the fleet. During the trial phase for the bicycles, everyone in the team suddenly received a Slack notification late one afternoon. One of the key aspects of Coolblue’s infrastructure is the dashboard that monitors their services. If anything goes wrong, it sends out a Slack notification. On this occasion, the on-call team was able to quickly spot that two processed were hanging. They terminated and restarted these and within minutes all was happy again.

Conclusions

Coolblue has been able to scale up pretty effectively. In part, this is down to embracing the “faster to master” checklist. But it is also down to how their company culture embraces the four key concepts of responsibility, autonomy, ownership and failure. Get these right, and scaling becomes much easier.

We help tech communities to grow worldwide, providing top-notch tools and unparalleled networking opportunities.