Skip to main content

Waku Staging Deployment

Status: Review Date: Jan 25, 2022

Background

We have a single node running Waku in our Digital Ocean account. We would like a more robust system to deploy multiple nodes and all associated infrastructure that is not reliant on manual configuration.

Problem

Requirements

  1. Run multiple nodes, each bootstrapped with go-waku binary
  2. Each node must have a public IP address and DNS record for discoverability by clients.
  3. Nodes must be set up for observability through go-waku’s built in Prometheus metrics and logs should be collected and indexed. Metrics must be pushed to some managed service like DataDog
  4. All infrastructure should be declarative and stored in a git repository for easy, reproducible, deployments.

Non-Goals

  1. Full CI/CD support for changes to go-waku if/when we decide to fork it
  2. High uptime guarantees
  3. Creating a framework to deploy many other services

Possible Solutions

Pure Terraform

We can do everything we need with just terraform provisioners. We could start with local state (deploying from a developer’s machine and keeping state local). Then we can add things like remote state (with the state stored in Digital Ocean’s S3 compatible object store), maybe move the deployment to CI, and then eventually just do everything from Terraform Enterprise straight from a Git repository.

The basic Terraform plan would look something like this:

variable "digitalocean_token" {
description = "API Token for DigitalOcean"
}

variable "digitalocean_region" {
description = "Region to deploy to"
default = "nyc1"
}

variable "tld" {
description = "Top level domain"
default = "xmtp.dev" # Do we own this? Can we?
}

variable "num_instances" {
description = "Number of instances to deploy"
type = number
default = 2
}

variable "digitalocean_volume_size" {
description = "We will attach permanent volume for persistent data. You can provide size of that volume here"
default = "100"
}

variable "initial_ssh_key" {
description = "Public SSH Key for server admin"
}

variable "private_key" {
description = "Private key for connecting to the droplet to bootstrap"
}

variable "datadog_api_key" {
description = "API Key for DataDog agent"
}

terraform {
required_providers {
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 2.0"
}
}
}

# Configure the DigitalOcean Provider
provider "digitalocean" {
token = var.digitalocean_token
}

resource "digitalocean_ssh_key" "xmtp_staging" {
name = "Staging SSH Key"
public_key = var.initial_ssh_key
}

# Add volume for persistent data
resource "digitalocean_volume" "waku" {
count = var.num_instances
region = var.digitalocean_region
name = "waku_${count.index}"
size = var.digitalocean_volume_size
description = "Persistent data for node ${count.index}"
}

# Create a set of Waku node instances
resource "digitalocean_droplet" "waku_node" {
count = var.num_instances
image = "ubuntu-16-04-x64"
name = "waku_node_${count.index}"
region = var.digitalocean_region
size = "s-4vcpu-8gb"
ssh_keys = ["${digitalocean_ssh_key.xmtp_staging.id}"]
ipv6 = true

# Attach persistent volume
volume_ids = ["${digitalocean_volume.waku[count.index].id}"]

connection {
host = self.ipv4_address
user = "root"
type = "ssh"
private_key = file(var.private_key)
timeout = "2m"
}

provisioner "remote-exec" {
inline = [
# Install datadog agent
"DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=\"${var.datadog_api_key}\" bash -c \"$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)\"",
"apt-get update",
"apt-get install -y golang-go",
"git clone https://github.com/status-im/go-waku",
"cd go-waku",
"make",
# Should probably do some sort of log rotation.
# Process monitoring would be nice too. If this thing crashes, won't restart etc.
"./build/waku &> /var/log/waku"
]
}
}

# Add floating ip for safe system updates
resource "digitalocean_floating_ip" "waku_node" {
count = var.num_instances
droplet_id = digitalocean_droplet.waku_node[count.index].id
region = digitalocean_droplet.waku_node[count.index].region
}

# Enable IPv4 (with floating ip-address)
resource "digitalocean_record" "waku_node_ipv4" {
count = var.num_instances
domain = var.tld
type = "A"
name = "waku-node-${count.index}"
value = digitalocean_floating_ip.waku_node[count.index].ip_address
}

# Enable IPv6 (non floating ip-address)
resource "digitalocean_record" "waku_node_ipv6" {
count = var.num_instances
domain = var.tld
type = "AAAA"
name = "waku-node-${count.index}"
value = digitalocean_droplet.waku_node[count.index].ipv6_address
}

One thing that is notably missing above is a way for the nodes to discover one another and connect. Will have to look into the best way to enable private networking in DO, and bootstrap that data to the nodes.

If desired, we could have the nodes run Docker locally and expose a Docker service of go-waku rather than running the application directly on the VM.

ProsCons
Simplest option. That code above is pretty close to all we needRemote-exec provisioner has the potential to be a little flaky
No infrastructure required. Can be run locallyEven with floating IP, will be some downtime when upgrading node. Probably not a big deal, since that won't happen often.
Running everything on Digital Ocean droplets is be the cheapest reasonable hosting optionCumbersome if we were to need to deploy many more backend services this way
Straightforward upgrade path to remote state and Terraform Enterprise, which would be a production-ready solutionAutomating interactions with running instances is more complicated than k8s. Kubectl, and the whole suite of K8s APIs, are a very powerful tool we would not have access to.

Hashicorp Packer

Hashicorp Packer is a tool used to build and provision VMs in a declarative way. We would define a source image, add the required steps to bootstrap go-waku, and the push that image up to Digital Ocean. The image ID could then be used as an input to the Digital Ocean Droplet Terraform resource to actually provision machines based on the image.

This would remove the need for the not-so-pretty provisioner configuration in the above Terraform config, but would still use most of the other Terraform configuration.

ProsCons
Hashicorp top to bottom, so Terraform integration should be easyWould likely wind up as throwaway code, once we move to a full Kubernetes cluster setup
Less moving parts than many other alternatives. Just VMs, no Docker or container orchestrationMore complicated than pure terraform setup
Faster and more reliable node starts than remote-exec provisioner
Portable configuration should we choose to move off of Digital Ocean

Kubernetes

One day, we are going to have to create a high quality reference setup with publicly available templates for deploying our nodes on Kubernetes, since we know this will be a popular type of deployment for node operators. If we start on Kubernetes we can do some of this work now and dog-food this setup in staging.

We would provision a Kubernetes cluster using Terraform, and then define the application via Helm V3 templates. Maybe investigate storing the helm templates in Flux if we really wanted to go all the way.

ProsCons
We are going to have to do this one dayBy far the most complicated and hardest to manage
Possible to setup very high quality production deployment, with zero downtime deploys, liveness and readiness checks, autoscaling, etcMore expensive, since we would have to pay for master nodes, and a cluster that would likely have considerable spare capacity to handle deployments
Easy to expand to heterogenous workloads, where we can run many types of application on the same infrastructureMost of the benefits come when you are running more than one app
Most portable. Easy to switch to Google Cloud or AWSRequires a more complicated deployment process to make updates to the application or its configuration
Strong security framework, where we can restrict access to the nodes and set up access control policiesDifficult to expose static IP from cluster.

| | Persistent volumes can be difficult to manage, and will be required for our nodes to operate | | | External-DNS can be a pain to configure and manage |

Recommendation

I think we should go for the simplest option to start. That would be pure Terraform. Would be a little bit of work prettying up the Terraform above (maybe breaking up some files and refactoring into modules/plans), possibly adding a load balancer, and improving the bootstrap script to run the application as a service. Anything beyond that is likely not a great use of time at this point.

Questions

  1. Is DataDog the right vendor for observability? I mostly suggested it because it was what I am most familiar with, and I know it supports the Prometheus metrics exposed by go-waku
  2. How much do we care about the reliability of the system in the alpha?
  3. How much work do we want to put into the infrastructure layer for the alpha. How good is "good enough" at this stage?
  4. How much do we care about security at this stage? The pure Terraform configuration exposes the nodes to the public internet, which opens them up to all kinds of attack. Do we care enough to put them in a private subnet behind a load balancer?