What’s the worst way you ever broke production?

@RacerX@lemm.ee · 1 year ago

What’s the worst way you ever broke production?

@SorteKanin@feddit.dk · edit-2 1 year ago

Pretty run of the mill for me, so not that bad: Pushed a long-running migration during peak load hours that locked an important table for an extended period of time, effectively taking our site offline.

Also consider !ask_experienced_devs@programming.dev :)

@Mahi1204@lemmy.world · edit-2 1 year ago

removed by mod

@sloppy_diffuser@sh.itjust.works · 1 year ago

Accidentally announced a /12 of IPv6 on a bad copy-paste of a /127.

Started appending a verification line after interface configs to make sure I never missed a trailing character again.

Took 3 months for anyone to notice (circa 2015).

@hperrin@lemmy.world · 1 year ago

I fixed a bug and gave everyone administrator access once. I didn’t know that bug was… in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, “admin”.

Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

@Clent@lemmy.world · 1 year ago

UPDATE ON articles SET status = 0 WHERE body LIKE ‘%…%’

On master production server, running myisam, against a text column, millions of rows.

This causes queries to stack because table locks

Rather than waiting for the query to finish. a slave was promoted to master.

Lesson: don’t trust mysqladmin to not do something bad.

EmasXP · 1 year ago

Table locks can be a real pain. You know you need to do the change, but the system is constantly running queries towards it. Now days it’s a bit easier with algorithm=inplace and lock=none, but in the good old days you were on your own. Your only friend was luck. Large migrations like that still gives me shivers

@zubumafu_420@infosec.pub · 1 year ago

Early in my career as a cloud sysadmin, shut down the production database server of a public website for a couple of minutes accidentally. Not that bad and most users probably just got a little annoyed, but it didn’t go unnoticed by management 😬 had to come up with a BS excuse that it was a false alarm.

Because of the legacy OS image of the server, simply changing the disk size in the cloud management portal wasn’t enough and it was necessary to make changes to the partition table via command line. I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod. Everything went smoothly except on the moment I had to shut down and delete the newly created VM, I instead shut down the original prod VM because they had similar names.

Put everything back in place, and eventually resized the original prod VM, but not without almost suffering a heart attack. At least I didn’t go as far as deleting the actual database server :D

@lightnsfw@reddthat.com · 1 year ago

I did my research, planned the procedure and fallback process, then spun up a new VM to test it out before trying it on prod

Went through a similar process when I was resizing some partitions on my media server. On the test run I forgot to specify G on the new size so it defaulted to MB when I resized it. Resulting in a 450gb partition going down to 400mb. I was real glad I tested that out first.

@marito@lemmy.world · 1 year ago

I tried to change ONE record in the production db but I forgot the WHILE clause, ended up changing over 2 MILLION records instead. Three hour production shutdown. Fun times.

Call me Lenny/Leni · 1 year ago

Forgive me, but that’s a figure of speech I’ve never heard before. What does it mean?

@RacerX@lemm.ee · 1 year ago

By breaking production, I’m referring to a situation where someone, most likely in a technical job, broke a system that was intended to be responsible for the operation for some kind of service. Most of the responses here, which have been great to read, are about messing up things like software, databases, servers and other hardware.

Stuff happens and we all make mistakes. It’s what you take away from the experience that matters.

Call me Lenny/Leni · 1 year ago

I don’t work with a lot of systems, but one thing instantly comes to mind.

@FigMcLargeHuge@sh.itjust.works · 1 year ago

Was doing two deployments at the same time. On the first one, I got to the point where I had to clear the cache. I was typing out the command to remove the temp folder, and looked down at the other deployment instructions I had in front of me, and typed the folder for the prod deployments and hit enter, deleting all of the currently installed code. It was a clustered machine, and the other machine removed it’s files within milliseconds. When I realized what I had done, I just jumped up from my desk and said out loud “I’m fired!!” over and over. Once I calmed down, I had to get back on the call and ask everyone to check their apps. Sure enough they were all failing. I told them what I had done, and we immediately went to the clustered machine and files were gone there too. It took about 8 hours for the backup team to restore everything. They kept having to go find tapes to put in the machine, and it took way longer than anyone expected. Once we got the files restored, well we determined that we were all back to the previous day, and everyone’s work from that night was all gone, so we had to start the nights deployments over. I got grilled about it, and had to write a script to clear the cache from that point on. No more manually removing files. The other thing that came out of this for the good was no more doing two deployments at the same time. I told them exactly what happened and that when you push people like this, mistakes get made.

@Albbi@lemmy.ca · 1 year ago

Broke teller machines at a bank by accidentally renaming the server all the machines were pointed to. Took an hour to bring back up.

@pastermil@sh.itjust.works · 1 year ago

I acidentally destroyed the production system completely thru improper partition resize. We got the database snapshot, but it’s in that server as well. After scrambling around for half a day, I managed to recover some of the older data dumps.

So I spun up the new server from scratch, restored the database with some slightly outdated dump, installed the code (which was thankfully managed thru git), and configured everything to run all in an hour or two.

The best part: everybody else knows this as some trivial misconfiguration. This happened in 2021.

@Albbi@lemmy.ca · 1 year ago

Was wondering if anybody here had made the news.

@necrobius@lemm.ee · 1 year ago

Create a database,
Have organisation manually populated it with lots of records using a web app,
accidentally delete database.

All in between the backup window.

Rob Bos · 1 year ago

Plugged a serial cable into a UPS that was not expecting RS232. Took down the entire server room. Beyoop.

@lud@lemm.ee · 1 year ago

That’s a common one I have seen on r/sysadminds.

I think APC is the company with the stupid issue.

@mojofrododojo@lemmy.world · 1 year ago

Took down the entire server room

ow, goddamn…

@Appoxo@lemmy.dbzer0.com · 1 year ago

You don’t have two unrelated power inputs? (UPS and regular power)

Rob Bos · 1 year ago

This was 2001 at a shoestring dialup ISP that also did consulting and had a couple small software products. So no.

@theluddite@lemmy.ml · 1 year ago

This is nowhere near the worst on a technical level, but it was my first big fuck up. Some 12+ years ago, I was pretty junior at a very big company that you’ve all heard of. We had a feature coming out that I had entirely developed almost by myself, from conception to prototype to production, and it was getting coverage in some relatively well-known trade magazine or blog or something (I don’t remember) that was coming out the next Monday. But that week, I introduced a bug in the data pipeline code such that, while I don’t remember the details, instead of adding the day’s data, it removed some small amount of data. No one noticed that the feature was losing all its data all week because it still worked (mostly) fine, but by Monday, when the article came out, it looked like it would work, but when you pressed the thing, nothing happened. It was thankfully pretty easy to fix but I went from being congratulated to yelled at so fast.

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍 · edit-2 1 year ago

Accidentally deleted an entire column in a police department’s evidence database early in my career 😬

Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets. Spent two days rebuilding that.

@aksdb@lemmy.world · 1 year ago

And if you couldn’t reconstruct, you still had backups, right? … right?!

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍 · 1 year ago

Oh sweet summer child

FartsWithAnAccent · 1 year ago

What the fuck is a “backups”?

@z00s@lemmy.world · 1 year ago

He’s the guy that sits next to fuckups

SuperDuper · 1 year ago

deleted an entire column in a police department’s evidence database

Based and ACAB-pilled

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍 · edit-2 1 year ago

deleted by creator

Futs · 1 year ago

Advertised an OS deployment to the ‘All Wokstations’ collection by mistake. I only realized after 30 minutes when peoples workstations started rebooting. Worked right through the night recovering and restoring about 200 machines.