or... how I stopped worrying and learned to love containers...
So I have been meaning to write about my hobby "datacenter" for a couple years now. I finally have it running in a state I'm happy with, and I'm ready to back up the write up with some code. I'll also go into the tools I chose and why I chose them. I did a lot of research and much trial/error to get to this point and I'd like to pay if forward to all the great open source software and documentation that's gotten me this far. Thanks to everyone that has contributed to every piece of software I've used and love. I can only hope that some of my experience will help others that are heading down the same path I have chosen.
I've been through many iterations of my "at home" computing stacks since the late 1990's: the gradual progression of tower PCs running Linux on bare metal, adding more storage, moving into virtual machines, purchasing a NAS and moving into managed switches, etc etc.
Here's where I'm at today with my hardware:
- Router: 1U SuperMicro, 4x Intel ethernet + IPMI - pfSense
- Compute: 3x 1U SuperMicros with 12 Core CPU, 24GB RAM
- Storage: 5 Bay Synology (one hot spare)
- Network: Cisco SG300-24 Managed Switch
- Form factor: 19" Server rack with UPS
- Wifi: 3x Eero mesh
I run Proxmox as my virtualization solution on each of the three compute-dedicated Supermicros. I have toyed with the idea of PXE booting something like CoreOS to the bare metal, but I'm not ready to go there yet. We'll get to that later.
I am so glad I found Proxmox. VMWare licensing got difficult for us cheapskates doing stupid small things at home... and I wanted to find a light management interface that didn't consume 25% of my total resource footprint. Citrix Zen was close, but it still required Windows for a management client. Proxmox is light and clean, has great HA features. Basically a wonderful GUI wrapping around KVM on top of the Debian OS that we're all familiar with... System and Service upgrades via apt?? #yespls!
Infrastructure Node Provisioning: iPXE, tftpboot and http
I set up PXE boot and tftpboot to run from my Synology NAS. So this part is a bit of a hack. There actually are a couple Terraform plugins out there for automating privisioning of VMs in Proxmox. I've tried them. I'm probably using them wrong, because they want to work off clones or templates and usually time out while that process is happening. In lieu of learning Go and working to change those plugins to meet my needs, I hacked together some shell scripts that work with the Proxmox API to provision VMs.
My hack: I apply my own (sequentially incrementing) MAC address pattern to newly provisioned Proxmox VMs with the ending suffix based on the next available VM ID. I have a range of DHCP addresses assigned to this MAC address block on my pfSense DHCP Server. With PXE/tftpboot you can force certain IP addresses to use a particular boot image. This allows me to set up a known block of (10-20) possible nodes that will be serving as possible application backends for purposes of configuring the HAProxy server on my pfSense router, as well as a common block for a SSH config with the public keys already set for this known set of hosts. I named mine "node" with the sequential interval of VM ID's from Proxmox (ie. node101, node102, node103). This greatly simplifies SSH connections to nodes for troubleshooting issues.
I'm currently using a very Docker-friendly RancherOS as my base image, as it is the first (sorry CoreOS) that I've been able to successfully apply a consistently working
cloud-config from a web server (again, the Synology, with its internal web server hosting cloud-config.yml files).
As RancherOS is already taking care of the Docker side of things, I just need to cloud-config start containers for Consul, the NFS persistent storage mounts and the Nomad clients (yes, I know... I'm ahead of my self, but the Nomad and Consul stuff is up next).
Services and Orchestration: Nomad, Terraform
I tried to get into Docker Swarm... I really did. Kubernetes? All of those moving parts! I'd just done a slew of professional work around Chef and my dev tools like
test-kitchen had me very fond on the simplicity of Hashicorp tools like Vagrant. Then along comes Nomad. I'll have to be honest, it wasn't great when I started working with it, but it has matured quickly. At version 0.7.1 it fits most of my needs and its integration with other Hashicorp tools like Consul and Vault are amazing. Simplification is key here. Like Hashi's other offerings, Nomad runs is a single executable and acts as a server, client or both. Because of this, automated client node deployments are a snap: one artifact, one config file. Who needs configuration management tools these days?
I currently handle the my entire service stack provision via Terraform. Trivial to launch a directory full of Nomad HCL definitions with a small Terraform script and a single
terraform apply. As I work towards more CI/CD across my entire application stack, this part will go away - as the build pipeline will handle the Nomad deployment.
Service Discovery, keys and secrets: Consul and Vault
Yeah, I might be ringing Hashicorp's bell a bit much here, but you can't beat the value of integrating these tools. When you run a Consul client on Nomad nodes, services and service checks are automatically registered with the Consul cluster. Nomad also will handily apply Consul keyvault values and Vault secrets into your Nomad job/group/task environment variables or configurations via the templating engine.
Metrics, Monitoring and Alerting: Prometheus, Grafana and Slack
Once we have all of our stuff up and running, we want to make sure it stays that way. Right? .... Right? I'd rather do something else with my weekends like
play video games all day going outside and enjoying nature. So instead of staring at a screen all day, I make robots check stuff and let me know when things go bad.
I use a combination of node-exporter and CAdvisor to expose node metrics that Prometheus will scrape. Grafana visualization are used with their alerting feature when certain metrics cross thresholds that I'd like to know about - it gets chirped to Slack. Consul-Alerts also hits up Slack via a webhook when anything goes wrong with nodes and services from Consul's perspective.
The nuts and bolts
Here's where we get into the details of my lab stack, top to bottom. Two of the cornerstone components in my lab are my pfSense router and my Synology NAS. I treat my pfSense router as a source-of-truth-for-all-things-Networking and the Synology as the source-of-truth-for-all-things-Persistent-Storage. Therefore, through Synology's Docker service, I run persistent data services like MySQL, Consul, Vault (with Consul storage backend) and the like.
pfSense handles the DHCP, DNSForwarding and HAProxy services for the entire network. DHCP directs netboot clients the Synology NAS for PXE/tftpboot because its the thing that stores the kernel and boot images. All known service backends are registered with pfSense's HAProxy package through a virtual IP that is in the same subnet as the internal LAN. pfSense's DNSForwarder resolves all service.service DNS lookups to the HAProxy virtual IP (ie.
http://couchpotato.service). One of the Nomad nodes carries the wonderful Traefik Proxy that reads the service information from Consul and automatically applies backend connections to the name-translated frontend. Makes sense to me, but yeah, I get it... I'll put up a diagram as soon as I can.
Unless things have changed recently, Traefik only handles HTTP connections... so any TCP specific rules need their own custom front and backend rules on pfSense's HAProxy service.
I know that was a quick fly-by of what I've done to make my lab serve my needs. It certainly doesn't fall into the realm of "best practice" with all the DevOps-ness going around these days. At the end of the day, its the perfect learning tool at a known one-time expense enabling an a wide range of experimental iterations inhibited only by one's available time.
Oh yeah, if you made it this far... or if TL;DR'd your way here... here's the source code for my implementation of my little piece of the cloud: [coming soon]
Hope it helps. If not, hit me up for questions in the comments. I'd love to help.
Keep on hackin' on...
Subscribe to repulsor.blog
Get the latest posts delivered right to your inbox