Waku Staging Deployment
Status: Review Date: Jan 25, 2022
Background
We have a single node running Waku in our Digital Ocean account. We would like a more robust system to deploy multiple nodes and all associated infrastructure that is not reliant on manual configuration.
Problem
Requirements
- Run multiple nodes, each bootstrapped with go-waku binary
- Each node must have a public IP address and DNS record for discoverability by clients.
- Nodes must be set up for observability through go-waku’s built in Prometheus metrics and logs should be collected and indexed. Metrics must be pushed to some managed service like DataDog
- All infrastructure should be declarative and stored in a git repository for easy, reproducible, deployments.
Non-Goals
- Full CI/CD support for changes to go-waku if/when we decide to fork it
- High uptime guarantees
- Creating a framework to deploy many other services
Possible Solutions
Pure Terraform
We can do everything we need with just terraform provisioners. We could start with local state (deploying from a developer’s machine and keeping state local). Then we can add things like remote state (with the state stored in Digital Ocean’s S3 compatible object store), maybe move the deployment to CI, and then eventually just do everything from Terraform Enterprise straight from a Git repository.
The basic Terraform plan would look something like this:
variable "digitalocean_token" {
description = "API Token for DigitalOcean"
}
variable "digitalocean_region" {
description = "Region to deploy to"
default = "nyc1"
}
variable "tld" {
description = "Top level domain"
default = "xmtp.dev" # Do we own this? Can we?
}
variable "num_instances" {
description = "Number of instances to deploy"
type = number
default = 2
}
variable "digitalocean_volume_size" {
description = "We will attach permanent volume for persistent data. You can provide size of that volume here"
default = "100"
}
variable "initial_ssh_key" {
description = "Public SSH Key for server admin"
}
variable "private_key" {
description = "Private key for connecting to the droplet to bootstrap"
}
variable "datadog_api_key" {
description = "API Key for DataDog agent"
}
terraform {
required_providers {
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 2.0"
}
}
}
# Configure the DigitalOcean Provider
provider "digitalocean" {
token = var.digitalocean_token
}
resource "digitalocean_ssh_key" "xmtp_staging" {
name = "Staging SSH Key"
public_key = var.initial_ssh_key
}
# Add volume for persistent data
resource "digitalocean_volume" "waku" {
count = var.num_instances
region = var.digitalocean_region
name = "waku_${count.index}"
size = var.digitalocean_volume_size
description = "Persistent data for node ${count.index}"
}
# Create a set of Waku node instances
resource "digitalocean_droplet" "waku_node" {
count = var.num_instances
image = "ubuntu-16-04-x64"
name = "waku_node_${count.index}"
region = var.digitalocean_region
size = "s-4vcpu-8gb"
ssh_keys = ["${digitalocean_ssh_key.xmtp_staging.id}"]
ipv6 = true
# Attach persistent volume
volume_ids = ["${digitalocean_volume.waku[count.index].id}"]
connection {
host = self.ipv4_address
user = "root"
type = "ssh"
private_key = file(var.private_key)
timeout = "2m"
}
provisioner "remote-exec" {
inline = [
# Install datadog agent
"DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=\"${var.datadog_api_key}\" bash -c \"$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)\"",
"apt-get update",
"apt-get install -y golang-go",
"git clone https://github.com/status-im/go-waku",
"cd go-waku",
"make",
# Should probably do some sort of log rotation.
# Process monitoring would be nice too. If this thing crashes, won't restart etc.
"./build/waku &> /var/log/waku"
]
}
}
# Add floating ip for safe system updates
resource "digitalocean_floating_ip" "waku_node" {
count = var.num_instances
droplet_id = digitalocean_droplet.waku_node[count.index].id
region = digitalocean_droplet.waku_node[count.index].region
}
# Enable IPv4 (with floating ip-address)
resource "digitalocean_record" "waku_node_ipv4" {
count = var.num_instances
domain = var.tld
type = "A"
name = "waku-node-${count.index}"
value = digitalocean_floating_ip.waku_node[count.index].ip_address
}
# Enable IPv6 (non floating ip-address)
resource "digitalocean_record" "waku_node_ipv6" {
count = var.num_instances
domain = var.tld
type = "AAAA"
name = "waku-node-${count.index}"
value = digitalocean_droplet.waku_node[count.index].ipv6_address
}
One thing that is notably missing above is a way for the nodes to discover one another and connect. Will have to look into the best way to enable private networking in DO, and bootstrap that data to the nodes.
If desired, we could have the nodes run Docker locally and expose a Docker service of go-waku rather than running the application directly on the VM.
| Pros | Cons |
|---|---|
| Simplest option. That code above is pretty close to all we need | Remote-exec provisioner has the potential to be a little flaky |
| No infrastructure required. Can be run locally | Even with floating IP, will be some downtime when upgrading node. Probably not a big deal, since that won't happen often. |
| Running everything on Digital Ocean droplets is be the cheapest reasonable hosting option | Cumbersome if we were to need to deploy many more backend services this way |
| Straightforward upgrade path to remote state and Terraform Enterprise, which would be a production-ready solution | Automating interactions with running instances is more complicated than k8s. Kubectl, and the whole suite of K8s APIs, are a very powerful tool we would not have access to. |
Hashicorp Packer
Hashicorp Packer is a tool used to build and provision VMs in a declarative way. We would define a source image, add the required steps to bootstrap go-waku, and the push that image up to Digital Ocean. The image ID could then be used as an input to the Digital Ocean Droplet Terraform resource to actually provision machines based on the image.
This would remove the need for the not-so-pretty provisioner configuration in the above Terraform config, but would still use most of the other Terraform configuration.
| Pros | Cons |
|---|---|
| Hashicorp top to bottom, so Terraform integration should be easy | Would likely wind up as throwaway code, once we move to a full Kubernetes cluster setup |
| Less moving parts than many other alternatives. Just VMs, no Docker or container orchestration | More complicated than pure terraform setup |
| Faster and more reliable node starts than remote-exec provisioner | |
| Portable configuration should we choose to move off of Digital Ocean | |
Kubernetes
One day, we are going to have to create a high quality reference setup with publicly available templates for deploying our nodes on Kubernetes, since we know this will be a popular type of deployment for node operators. If we start on Kubernetes we can do some of this work now and dog-food this setup in staging.
We would provision a Kubernetes cluster using Terraform, and then define the application via Helm V3 templates. Maybe investigate storing the helm templates in Flux if we really wanted to go all the way.
| Pros | Cons |
|---|---|
| We are going to have to do this one day | By far the most complicated and hardest to manage |
| Possible to setup very high quality production deployment, with zero downtime deploys, liveness and readiness checks, autoscaling, etc | More expensive, since we would have to pay for master nodes, and a cluster that would likely have considerable spare capacity to handle deployments |
| Easy to expand to heterogenous workloads, where we can run many types of application on the same infrastructure | Most of the benefits come when you are running more than one app |
| Most portable. Easy to switch to Google Cloud or AWS | Requires a more complicated deployment process to make updates to the application or its configuration |
| Strong security framework, where we can restrict access to the nodes and set up access control policies | Difficult to expose static IP from cluster. |
| | Persistent volumes can be difficult to manage, and will be required for our nodes to operate | | | External-DNS can be a pain to configure and manage |
Recommendation
I think we should go for the simplest option to start. That would be pure Terraform. Would be a little bit of work prettying up the Terraform above (maybe breaking up some files and refactoring into modules/plans), possibly adding a load balancer, and improving the bootstrap script to run the application as a service. Anything beyond that is likely not a great use of time at this point.
Questions
- Is DataDog the right vendor for observability? I mostly suggested it because it was what I am most familiar with, and I know it supports the Prometheus metrics exposed by go-waku
- How much do we care about the reliability of the system in the alpha?
- How much work do we want to put into the infrastructure layer for the alpha. How good is "good enough" at this stage?
- How much do we care about security at this stage? The pure Terraform configuration exposes the nodes to the public internet, which opens them up to all kinds of attack. Do we care enough to put them in a private subnet behind a load balancer?