The DevOps Engineer's Handbook
167 points 23 comments 3 days ago
readthenotes1
Sad to see test coming only after build
popalchemist
Perhaps that is just the most logical way to organize the material and not an accurate indicator of the role of testing in day to day practice.
drewcoo
And only before packaging.
It's as if when cooking there is only one good time to taste the soup.
ChoHag
[dead]
Glancing through this looks like a good starting point. Their reading list seems solid:
https://octopus.com/devops/reading-list/
Google's SRE book changed my career, although I know it's a little out of date now, it's well worth reading for the concepts involved, IMO.
This though:
> With DevOps, if you can automate it, you should automate it.
Everyone at some level in this field understands this. I would even go farther and say that almost everything is automatable depending on how much effort you're willing to put into it. However, lots of bad or overwhelmed devops shops I've consulted seem to be stuck in this insane hell-loop of manual processes not ever giving them "time" or "priority" to automate some of these processes and get them off the treadmill. Usually it takes a fair amount of heroics to get out of that, but I have specific approach to such situations that I've been using successfully for a few years now.
it's always important to remember "devops" is a completely loaded term that can mean drastically different things depending on organization.
> With DevOps, if you can automate it, you should automate it.
I couldn't agree less with this. At this point the whole "DevOps" industry is fueled by consultancies who make a great living from convincing business leaders that this is true. Focusing on defining clear processes for recurring events and building the fundamental building blocks that allows you to automate when it's absolutely needed should be the method, not spending more time writing Terraform.
> Focusing on defining clear processes for recurring events and building the fundamental building blocks that allows you to automate when it's absolutely needed should be the method, not spending more time writing Terraform.
For one, I don't really consider terraform "automation," and more IaC, but I'll digress - this is all well and good in mature organizations with robust processes and aligned leadership. In practice, however, and what I find most often, is you will find very small "devops" shops in companies that aren't necessarily "tech" sized 50-300 people with a devops team of 3-5 people (if they're lucky) that the organization, or sometimes even themselves, see as glorified IT sysadmins. They're always seen as an expense, usually critically understaffed, and if you leave teams like this to decide on their own what is "necessary" to automate you're going to get weird/misaligned/dysfunctional results, and even moreso if you let the business decide this, which is what usually happens, and they don't really give a crap if some poor former-sysadmin has to spend 12 hours a day clicking buttons in aws console as long as they get what they needed (actually have seen a guy making 150k to basically do just this).
So what happens, like I said, is teams get into this hell-loop of manual task after manual task, which not only requires large amounts of mental bandwidth to keep track of or keep up to date all the documentation or playbooks surrounding these manual tasks (if you're lucky to get even that), you have to deal with the inevitable mistakes and errors that are common when doing things strictly manually, which eats up a ton of unnecessary time and thus $$.
I agree though most devops consultants are terrible, and the industry is driven by this, however, this is the specific niche I've carved out for myself, coming in after big terrible crappy consultant that basically just pitches a brittle jenkins CI setup and some basic terraform and charges you $250k for their time. I actually really enjoy doing it too, and the challenges and issues are almost always unique to the org, even if the patterns are similar - so it's always interesting.
So, long story short, unless you have a super robust process and mature system, it's usually just a lot easier to default to "automate" and come up with reasonable exceptions when it doesn't make sense to do so, rather than the other way around.
Git ops (declarative infra and config) and containers, make automation really, really easy, and completely eliminate all sorts of classes of problems. We push code to prod dozens of times a day without any issues, for months and months at a time. Typically backed by only a single devops engineer to keep everything humming along or building new automation. The automation is the clear process, spitting out messages regularly to email or group chat somewhere. And provides the audit trail.
I worked at a traditional finance company and we had a team of 8 people in traditional operations and another 30 people doing manual testing around the clock to support about 20 developers, 10 network staff, plus another 20-30 managers or leads and security. We could only deploy once a week and there were always issues with "final check out" on sunday morning when hotfixes had to go in or config was modified.
, sorry to laugh but I’ve worked a lot of fintech and this is so painfully on point. I’ve also experienced the nirvana of a mature gitops system - getting there is really painful though (IMO)
the funny thing is, the fintech company in the example you gave likely sees nothing wrong with this. I’ve seen cases where the release cycle is once a month or longer, similar team sizes, and they don’t think they have an issue and would probably laugh at you or look at you weird if you mentioned ci/cd.
True. That's why IT Services companies have such massive practices dedicated to DevOps. Its a great annuity business for them.
> not spending more time writing Terraform.
Also, there's another bit of nuance to that, as well as your overarching point about "automation isn't free," in that writing Terraform/Tofu isn't usually the long pole in that tent: debugging the raging PoS most certainly is (along with its associated https://xkcd.com/303/ of waiting for the "plan, attempt apply, puke, goto 1" loop)
And, in almost the exact same vein: writing any automation carries with it two downstream bits of work: monitoring the automation and having enough context to debug it when (WHEN) it falls over
People don’t know about https://dagger.io/, which solves this.
My experience has been it's not a lack of knowledge it's a combination of inertia, cargo culting, and give-a-shit
There are so many great tools that solve so many problems but life is filled with trade-offs and many people don't value the same trade-offs that I do, so they just bash their head against Terraform (or $other_legacy_tool) because "it's what we use"
I was really hoping that Earthly or Dagger were going to catch on due to the enormous number of folks that complain about not being able to run GitHub Actions (or GLCI) locally, on top of bitching about yaml alllllllllll the fucking time. But, same problem, IMHO: inertia is so strong
The fundamental issues is that devops guys don't have a budget with which to buy tools like Dagger (or Earthly), so the market is limited to companies that have tech-literate management - very small.
It's somewhat this, a lot the fact that a huge, unbelievable chunk of "devops" guys are former sysadmins pigeonholed into devops because every organization thought that was a natural progression, so the odds of finding a devops engineer that is very good at writing go or javascript is kind of a unicorn, at least in my experience (I have to hire sometimes). They're usually fairly proficient with scripting languages, but sometimes not even that. Since terraform/HCL/YAML are more configuration languages with a lot less "logic" in them, it's more comfortable for a lot of people with that background, especially when they're already used to tools like ansible/helm/etc.
uh-huh: https://github.com/earthly/earthly/blob/v0.8.15/LICENSE (MPLv2, just like TF used to be) https://github.com/dagger/dagger/blob/v0.15.2/LICENSE (Apache 2)
IIRC they went fully open source because they couldn't make it as a for profit company.
Selling dev tools is ferociously hard, as partially evidenced by this thread talking about how changing anyone's development flow/tooling/process is also ferociously hard
I would guess dev tooling usually also falls into the "nice to have," or as my former CEO used to say "vitamins vs painkillers"
Ha you got me there. Well in that case: cursed be the devops guys and their devilish inertia & groupthink.
I completely agree, not every task deserves to be automated
I'm really curious, can you share even the broad strokes of how you approach that? I feel stuck in that loop all the time.
I probably made it sound fancier than it is, but in a consulting situation by the time someone is willing to listen to what I say they've likely been experiencing considerable pain and are willing to take advice, so it's a little different than when you are stuck there. which is basically, biting off the smallest pieces you can that give the largest value you can (to me, the value I use to measure this is roughly by time spent per week estimates). So if it takes me 20 mins to automate a task that is spending 3 hours a week of devops time, of course this is a super easy target. Not all tasks are like this and people are bad at estimating how much time they spend on stuff like this, so it requires a diligent approach and honest introspection (one of the hardest parts to me, people are wildly delusional sometimes). This involves a lot of meetings, alignment, etc.
The time I've been stuck in these situations, it's mostly about inflicting or bringing notice to enough pain that the business backs off on some tighter deadlines to give more time to automate the tasks that will free up the most bandwidth or time - and it's a lot of small bites. I typically will start (if I can) by insisting anything new that makes sense to be automated (making a basic estimate of time per week, and effort to automate, also something you need to factor in is long term maintenance to use as a pitch to sell to management) and stick rigidly to that until it starts taking over other legacy processes in sort of a slow strangler pattern. This can look wildly different depending on the infrastructure setup and needs and how much firefighting you're doing - which there is usually a lot of. So sometimes that first step is just putting out all the fires or gaining enough visibility/monitoring to ensure you know what fires need to immediately be jumped on and which ones don't, and most importantly, proving that to the business.
Unfortunately though, leadership buy in (to me) is the hardest part almost always which is why I say "inflicting pain" the way I do, it sounds bad, but if they do not feel pain you will never, ever get priority to do anything, because IME like I said "devops" guys to most businesses (and even a lot of technical people) are glorified sysadmins that cost and demand way too much.