You just finished setting up all your services and it works fine - how do you now prepare for eventual drive failure?

Kaldo · 2 years ago

You just finished setting up all your services and it works fine - how do you now prepare for eventual drive failure?

Eskuero · 2 years ago

My docker containers are all configured via docker compose so I just tar the .yml files and the outside data volumes and backup that to an external drive.

For configs living in /etc you can also backup all of them but I guess its harder to remember what you modified and where so this is why you document your setup step by step.

Something nice and easy I use for personal documentations is mdbooks.

Kaldo · edit-2 2 years ago

Ahh, so the best docker practice is to always just use outside data volumes and backup those separately, seems kinda obvious in retrospect. What about mounting them directly to the NAS (or even running docker from NAS?), for local networks the performance is probably good enough? That way I wouldn’t have to schedule regular syncs and transfers between “local” device storage and NAS? Dunno if it would have a negative effect on drive longevity compared to just running a daily backup.

Adam · 2 years ago

If you’ve got a good network path NFS mounts work great. Don’t forget to also back up your compose files. Then bringing a machine back up is just a case of running them.

@RegalPotoo@lemmy.world · 2 years ago

Infrastructure as code/config as code.

The configurations of all the actual machines is managed by Puppet, with all its configs in a git repo. All the actual applications are deployed on top of Kubernetes, with all the configurations managed by helmfile and also tracked in git. I don’t set anything up - I describe how I want things configured, and the tools do the actual work.

There is a “cold start” issue in my scheme - puppet requires a server component that runs on Kubernetes but I can’t deploy onto kubernetes until the host machines have had their puppet manifests applied, but at that point I can just read the code and do enough of the config by hand to bootstrap everything up from scratch if I have to

HeartyBeast · 2 years ago

carefully configured services on your rpi

I have a back up on an SD Card waiting for the day the SD Card fails. Slot it in and reboot

@desentizised@lemm.ee · 2 years ago

I recently “upgraded” one of my raspberrys SD cards to an industrial grade one. Seems to me like those are a lot slower but for that particular use case it doesnt matter to me. What matters is that the card doesn’t die. It runs noticeably cooler when lots of data is being written to it so I feel like I must be onto something there.

@ikidd@lemmy.world · edit-2 2 years ago

I run everything on a 2 node proxmox cluster with ZFS mirror volumes and replication of the VMs and CTs between them, run PBS with hourly snapshots, and sync that to multuple USB drives I swap off site.

The docker VM can be ZFS snapshotted before major updates so I can rollback.

@twei@feddit.de · 2 years ago

You should get another node, otherwise when node1 fails node2 will reboot itself and then do nothing because it has no quorum

@ikidd@lemmy.world · 2 years ago

pvecm expected 1

@twei@feddit.de · 2 years ago

I know, but every time I had to do that it felt like it’s a jank solution. If you have a raspberry pi or smth like that you can also set it up as a qdevice.

…and if you’re completely fine with how it is you can also just leave it like it is

@ikidd@lemmy.world · 2 years ago

So since I now had a “quorate” cluster again, I thought I’d try out HA. I’d always been under the impression that unless you had a shared storage LUN, you couldn’t HA anything. But I thought I’d trigger a replication and then down the 2nd node just as a test. And lo and behold, the first node brought up my OPNsense VM from the replicated image about 2 minutes after the second node lost contact, and internet starts working again.

I’m really excited about having that feature working now. This was a good night, thank you.

@twei@feddit.de · 2 years ago

If you need another thing to do, you could try to make your opnsense HA and never have your internet stop working while rebooting a node. It’s pretty simple to set up, you might finish it in 1-2 evenings. Happy clustering!

@ikidd@lemmy.world · 2 years ago

I’ll look into that. I did see the option in opnsense once upon a time but never investigated it.

@ikidd@lemmy.world · 2 years ago

So I started to write a reply that said basically that I was OK doing that manually, but thought that “hell, I have a PBS box on the network that would do that fine”. So it took about 3 minutes to install the corosync-qdevice packages on all three and enable it. Good to go.

Thanks for the kick in the ass.

@CarbonatedPastaSauce@lemmy.world · 2 years ago

I actually run everything in VMs and have two hypervisors that sync everything to each other constantly, so I have hot failover capability. They also back up their live VMs to each other every day or week depending on the criticality of the VM. That way I also have some protection against OS issues or a wonky update.

Probably overkill for a self hosted setup but I’d rather spend money than time fixing shit because I’m lazy.

@surewhynotlem@lemmy.world · 2 years ago

HA is not redundancy. It may protect from a drive failure but it completely ignores data corruption issues.

I learned this the hard way when my cryptomator decided to corrupt some of my files, and I noticed but didn’t have backups.

@CarbonatedPastaSauce@lemmy.world · 2 years ago

That’s why I also do backups, as I mentioned.

rentar42 · 2 years ago

yeah, there’s a bunch of lessons that tend to only be learned the hard way, despite most guides mentioning them.

similarly to how RAID should not be treated as a backup.

@simpleslipeagle@lemmynsfw.com · 2 years ago

My server has a raid1 mdadm boot drive. And an 8 dive raid6 with zfs. It’s been running for 14 years now. The only thing that I haven’t replaced over it’s lifetime is the chassis. In fact the proc let out the magic smoke a few weeks ago, after some new parts it’s still going strong.

@DLSantini@lemmy.ml · 2 years ago

Pre…pare…? What’s that? Some sorta fruit?

@drkt@feddit.dk · 2 years ago

configs are backed up I can spin up a new container in minutes, I just accept the manual labor. It’s probably a good thing to clean out the spiders and skeletons every now and then.

@friend_of_satan@lemmy.world · 2 years ago

I’ve had a complete drive failure twice within the last year (really old hardware) and my ansible + docker + backup made it really easy to recover from. I got new hardware and was back up and running within a few hours.

All of your services setup should be automated (through docker-compose or ansible or whatever) and all your configuration data should be backed up. This should make it easy to migrate services from one machine to another, and also to recover from a disaster.

@Decronym@lemmy.decronym.xyz · edit-2 2 years ago

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters	More Letters
Git	Popular version control system, primarily for code
HA	Home Assistant automation software
~	High Availability
LXC	Linux Containers
NAS	Network-Attached Storage
Plex	Brand of media server package
RAID	Redundant Array of Independent Disks for mass storage
RPi	Raspberry Pi brand of SBC
SBC	Single-Board Computer
SSD	Solid State Drive mass storage

[Thread #287 for this sub, first seen 18th Nov 2023, 10:35] [FAQ] [Full list] [Contact] [Source code]

@ehrenschwan@feddit.de · 2 years ago

I use duplicati for docker containers. You just host it in docker and attach all the persistent volumes from the other containers to it, then you can set up backup jobs for each.

lemmyvore · 2 years ago

Install Debian stable with the ssh server included.
Keep a list of the packages that were installed after (there aren’t many but still).
All docker containers have their compose files and persistent user data on a RAID1 array.
Have a backup running that rsyncs once a day /etc, /home/user and /mnt/array1/docker to another RAID1 to daily/, from daily/ once a week rsync to weekly/, from weekly/ once a monthb timestamped tarball to monthly/. Once a month I also bring out a HDD from the drawer and do a backup of monthly/ with Borg.

For recovery:

Reinstall Debian + extra packages.
Restore the docker compose and persistent files.
Run docker compose on containers.

Note that some data may need additional handling, for example databases should be dumped not rsunced.

dr_robot · 2 years ago

My configuration and deployment is managed entirely via an Ansible playbook repository. In case of absolute disaster, I just have to redeploy the playbook. I do run all my stuff on top of mirrored drives so a single failure isn’t disastrous if I replace the drive quickly enough.

For when that’s not enough, the data itself is backed up hourly (via ZFS snapshots) to a spare pair of drives and nightly to S3 buckets in the cloud (via restic). Everything automated with systemd timers and some scripts. The configuration for these backups is part of the playbooks of course. I test the backups every 6 months by trying to reproduce all the services in a test VM. This has identified issues with my restoration procedure (mostly due to potential UID mismatches).

And yes, I have once been forced to reinstall from scratch and I managed to do that rather quickly through a combination of playbooks and well tested backups.

@subtext@lemmy.world · 2 years ago

Dang I really like your idea of testing the backup in a VM… I was worried about how I’d test mine since I only have the one machine, but a VM on my desktop or something should do just fine.

@CameronDev@programming.dev · 2 years ago

I rsync my root and everything under it to a NAS, will hopefully save my data. I wrote some scripts manually to do that.

I think the next best thing to do is to doco your setup as mich as possible. Either by typed up notes, or ansible/packer/whatever, any documentation is better than nothing if you have to rebuild.

@foggy@lemmy.world · 2 years ago

I have a 16tb USB HDD that syncs to my NAS whenever my workstation is idle for 20 minutes.

@darvocet@infosec.pub · 2 years ago

I run history and then clean it up so i have a guide to follow on the next setup. It’s not even so much for drive failure but to move to the newer OS versions when available.

The ‘data’ is backed up by scripts that tar folders up and scp them off to another server.

@emax_gomax@lemmy.world · 2 years ago

I use docker so don’t really have to worry about reproducibility of the Services or configurations. Docker will fetch the right services and versions. I’ve documented the core configurations so I can set them back up relatively easily. Anything custom I haven’t documented I’ll just have to remember or find I need to reset up.