Let me start with the TL;DR:

If production breaks, it is not the fault of a single person but a faulty process.

The story, all names, characters, and incidents portrayed in this blog post are fictitious. No identification with actual persons (living or deceased), places, buildings, and products is intended or should be inferred.

We are professionals, nothing of this would ever happen to us.

Question 1

One of our customers using Shopware called me and said, they would like to have a trailing / on every URL. I thought that is no problem, opened an SSH terminal and added a few lines to the .htaccess on production. Tested it and worked. Two hours later the customer called and told us, that it is not possible to put anything into the cart anymore.

What happened?

Shopware uses its API to add items to the cart and the API doesn’t like a trailing /.

How could that have been mitigated?

On our “safety” checklist we have a few things:

  • git
  • deployment pipeline

We don’t have tests though.

But tests wouldn’t have helped. Why you ask? Because some dumb idiot (me – this is a fictional story!) accessed the production system, ignored all processes and broke the system.

So the simple answer to “How could this have been mitigated” is: Don’t allow access to the production system. (But it is important to state the downside of this rule: You can’t fix anything fast).

Question 2

Customer calls: “No secure connection to the server is possible”. Ok, obviously something with the SSL certificate is not working. Yesterday I read an article about an expired R3 certificate by Let’s Encrypt, which breaks one or another connection. And checking the certificate shows: Yes, it uses the expired R3 intermediate certificate.

Easy fix: Create a new certificate. Connect via SSH on the server (no, that was not the problem! :D) and run certbot for a new certificate. Testing, done.

A little time later the telephone rings again, this time the admin asks, why all websites are down. I told him, I updated the certificate and tested the site after it, it is online – still.

What happened?

The certificate contained 12 domains before I updated it. After the update only 1 was in it.

How could that have been mitigated?

The bad thing: If I hadn’t access to production I couldn’t have tried to fix it in the first place.

The easiest to avoid this are two things:

  1. Know what you are doing (I’m a developer not an admin).
  2. Use a TLS monitoring service. The one we are using is oh dear.
  3. Automate your certification updates (ok, this was a special case)

Question 3

Customer calls, their e-commerce site is down. Investigation starts. Login with SSH is already full of errors – hard disk full.

We already had a discussion with the admin, that the old Shopware version we use has a cache leak and we need to update(!!), but it is end of November and the customer prefers to do it next year. Ok. The cache writes about 90GB a week, we have 160GB disk, about 120GB free, so it is no problem, if we clean the cache each week once.

But now it happened, the cache flooded the disk and we don’t now why. First thing: Clean the cache and the store is online again. And now check what happened. We only have 40GB disk free. Why? Good question. With each deployment we create a new cache, which means we “archive” the old cache and don’t delete it. So with each deployment our disk space decreased and we didn’t know this.

So we installed a cronjob which not only deletes the cache ones a week, but once each night – just to be sure – and not only the current cache but all caches of all releases. And warm the current cache up.

How could that have been mitigated?

Simple answer: Hard disk monitoring. You can monitor a lot, but free disk space and used inodes* is a good start point. Used memory and CPU utilisation can be the next – but I’m not expert no this topic.

  • inodes are the entries of files in the file system, so you can find each file on the hard disk. It can happen (e.g. if you write TONs of small session or cache data) that you have a free disk, but you are out of inodes, then you still can’t create new files, because the file system can’t remember the position anymore. Like a database when you run out of AUTO_INCREMENT ids.

Things you want to have (and are not always easy to sell – I know)

Software, tipps and services we use(d) to mitigate our risk:

And some ending notes:

  1. Think hard about what you want to offer your customers for free, because great services cost constant money and customers should pay for it, not you. Yea I know, obvious, isn’t it?
  2. Developers want to play. If they can’t they use production. Talk with your developers about ways to harden your system. With logging, monitoring, tests, better processes.

And the most important thing you always need to keep in mind, when shit hits the fan:

If production breaks, it is not the fault of a single person but a faulty process.

It is not the responsibility of a developer to make up bad business decisions. But it is our all responsibility to tell people the weak points of our processes if you spot them.