The history of the pets vs cattle terminology is muddy, most link to a presentation Bill Baker from Microsoft made in 2006 around scaling SQL server.
Way back then in the before times, we called ourselves sysadmins and treat our servers like pets
For example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world.
In the new world, servers are numbered or maybe uuids, like cattle in a herd.
For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.
Why am I telling you this rather gruesome story? Kubernetes deals with that right? and saves us from the tyranny
And you're right, it does. All you're computers are called nodes and abstracted and given arbitrary names, autoscaling groups and such will automatically detect the sick in your flock, take them out, and bring a replacement in. all while seamlessly (ish) rescheduling the workload that was on the failed computer
And Kubernetes takes that a step further, your workload also has unique names
Like the physical servers your workload failures can be detected, and replaced seamlessly
So wheres the pet?
well..
Whats the first thing we do with a brand new Kubernetes cluster?
Hint: it's not deploying your application
Look familiar? yeah, we had to do a load of 'things' just to make this cluster able to start running our workloads
And it's worth noting that with a trend towards more and more features being being 'out of tree' that is to say they're optional add-ons and don't ship with core Kubernetes, examples of this are things like flex volumes, and basically all the Kubernetes sig projects that many find essential is only exasperating this issue
<click> That might work for when you've got a single cluster, but what about when you've got a single cluster <click> But what about when you've got dev <click> integration <click> staging <click> qa that your app needs to run on
Or worse, when you need separation between your teams or products
Maybe you've automated that, bash, ansible, terraform, whatever you like, cool good on you
However you'll find it won't be long before theres an updated version perhaps patching a vulnerability you care about and you may be stuck trying to test every single app across your estate
This is what we're used to calling day 2 operations, we used to call it BAU or business as usual, and it's where reality catches up with our idealistic good intentions
You'll quickly find that clusters are running various versions, given the rate of change in the community its unrealistically to run :latest everywhere confidently without breaking production and disrupting your operational teams.
Permutations of seemingly common tool choices, some teams might use kong, others nginx, another apache, all for good reasons I'm sure
Seemingly infinite possibilities appear across the estate
Sad times
Congratulations, you're now the proud owner of a pet shop, or if you managed to automate the creation
You can call it a pet factory, but it's a headache
But so what, how does this hurt you might ask?
Maybe you like pets?
Well, presuming of course you're in cloud, your world could roughly be summarized into tiers Apps, well these are things that your board room know about, and can probably name, so think your public website, shopping cart system, customer service apps, online chat interfaces, email system etc. These are all implicitly providing some value in of themselves to your end customers.
Infrastructure, with cloud this is all commodity thankfully, the days where anyone in your business caring about the challenges of physically racking up hardware, not overloading the weight in the cabinet, taking pride in how well they've routed cables have hopefully passed; and you're consuming infrastructure, hopefully you've codified this but even if you're in to ClickOps, making sure its running is not your problem. No one in your business is concerned with hardware failures, patching routers every-time theres a critical vulnerability, testing the UPS and the generators regularly, upgrading the HVAC when you add more servers. "YAWN-orarma" as my 15 year old would say and curse me for repeating. Your interactions with any of this is a few clicks or lines of code and some infra is available to you with an SLA
If only the story ended there
But sandwiched between those is a grey layer, of all the operational enablers, its where your 'devops' or 'SRE' team live. So think log aggregation, certificate issues, security policies, monitoring, service mesh and others. These are things you do because of all sorts of reasons ranging from risk mitigation to emotion and technically unqualified opinion or just without foresight of what was round the corner in 6 months. All of this while technically fascinating for people like me to stand and stroke my beard at they are delivering absolutely zero business value, unless of course your business is building those products.
and who'd want to get into that business!
And thats not all! Recruitment...
You might think you want a devops right? oh no wait, devops with Kubernetes experience, maybe a CKA? oh yeah, its on AWS, and we use linkerd and in some places istio, no not the current version, or even the same version everywhere. a mix of pod security policy, kyverno and OPA for policy, some terraform, helm, jenkins, github action soup going on, all in a mono-repo apart from all that stuff that isn't.
We're well outside the remit of commodity skills and back to hunting unicorns.
Sure you'll find some victims. sorry...
I mean candidates; that you'll hire, well now you've got one hell of an onboarding issue before they can do anything useful and help your business move forwards faster than it did without them.
And if you hired smart people they'll come with experience and their own opinions of what worked for them before, so your landscape gets bigger and bigger and more complex
I did some googling, this is what the CNCF landscape looked way back in 2017.
Choices, choices as far as the eye can see.
Have you seen it recently?
This has got a bit out of hand, I'd say someone aught to have a word but I suspect that'd just make things worse by adding yet another thing
I can't possibly think of a faster way to go from enthusiastic engineers playing with new exciting tech
To deeply unhappy ones trying to fix something at 4am
and before they can do anything meaningful they've got an orienteering exercise to switch mental context to whatever the intended permutation of things it is they're looking at.
Meanwhile your business value delivering apps are offline, or worse at breach
Rewind a minute we didn't want any of these things, how did we get here? What can we do about that?
So with that I'd like to reintroduce Graeme who is surely going to fix all this mess