Waku Staging Deployment

Status: Review Date: Jan 25, 2022

Background

We have a single node running Waku in our Digital Ocean account. We would like a more robust system to deploy multiple nodes and all associated infrastructure that is not reliant on manual configuration.

Problem

Requirements

Run multiple nodes, each bootstrapped with go-waku binary
Each node must have a public IP address and DNS record for discoverability by clients.
Nodes must be set up for observability through go-waku’s built in Prometheus metrics and logs should be collected and indexed. Metrics must be pushed to some managed service like DataDog
All infrastructure should be declarative and stored in a git repository for easy, reproducible, deployments.

Non-Goals

Full CI/CD support for changes to go-waku if/when we decide to fork it
High uptime guarantees
Creating a framework to deploy many other services

Possible Solutions

Pure Terraform

We can do everything we need with just terraform provisioners. We could start with local state (deploying from a developer’s machine and keeping state local). Then we can add things like remote state (with the state stored in Digital Ocean’s S3 compatible object store), maybe move the deployment to CI, and then eventually just do everything from Terraform Enterprise straight from a Git repository.

The basic Terraform plan would look something like this:

variable "digitalocean_token" {
  description = "API Token for DigitalOcean"
}

variable "digitalocean_region" {
  description = "Region to deploy to"
  default     = "nyc1"
}

variable "tld" {
  description = "Top level domain"
  default     = "xmtp.dev" # Do we own this? Can we?
}

variable "num_instances" {
  description = "Number of instances to deploy"
  type        = number
  default     = 2
}

variable "digitalocean_volume_size" {
  description = "We will attach permanent volume for persistent data. You can provide size of that volume here"
  default     = "100"
}

variable "initial_ssh_key" {
  description = "Public SSH Key for server admin"
}

variable "private_key" {
  description = "Private key for connecting to the droplet to bootstrap"
}

variable "datadog_api_key" {
  description = "API Key for DataDog agent"
}

terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

# Configure the DigitalOcean Provider
provider "digitalocean" {
  token = var.digitalocean_token
}

resource "digitalocean_ssh_key" "xmtp_staging" {
  name       = "Staging SSH Key"
  public_key = var.initial_ssh_key
}

# Add volume for persistent data
resource "digitalocean_volume" "waku" {
  count       = var.num_instances
  region      = var.digitalocean_region
  name        = "waku_${count.index}"
  size        = var.digitalocean_volume_size
  description = "Persistent data for node ${count.index}"
}

# Create a set of Waku node instances
resource "digitalocean_droplet" "waku_node" {
  count    = var.num_instances
  image    = "ubuntu-16-04-x64"
  name     = "waku_node_${count.index}"
  region   = var.digitalocean_region
  size     = "s-4vcpu-8gb"
  ssh_keys = ["${digitalocean_ssh_key.xmtp_staging.id}"]
  ipv6     = true

  # Attach persistent volume
  volume_ids = ["${digitalocean_volume.waku[count.index].id}"]

  connection {
    host        = self.ipv4_address
    user        = "root"
    type        = "ssh"
    private_key = file(var.private_key)
    timeout     = "2m"
  }

  provisioner "remote-exec" {
    inline = [
      # Install datadog agent
      "DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=\"${var.datadog_api_key}\" bash -c \"$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)\"",
      "apt-get update",
      "apt-get install -y golang-go",
      "git clone https://github.com/status-im/go-waku",
      "cd go-waku",
      "make",
      # Should probably do some sort of log rotation.
      # Process monitoring would be nice too. If this thing crashes, won't restart etc.
      "./build/waku &> /var/log/waku"
    ]
  }
}

# Add floating ip for safe system updates
resource "digitalocean_floating_ip" "waku_node" {
  count      = var.num_instances
  droplet_id = digitalocean_droplet.waku_node[count.index].id
  region     = digitalocean_droplet.waku_node[count.index].region
}

# Enable IPv4 (with floating ip-address)
resource "digitalocean_record" "waku_node_ipv4" {
  count  = var.num_instances
  domain = var.tld
  type   = "A"
  name   = "waku-node-${count.index}"
  value  = digitalocean_floating_ip.waku_node[count.index].ip_address
}

# Enable IPv6 (non floating ip-address)
resource "digitalocean_record" "waku_node_ipv6" {
  count  = var.num_instances
  domain = var.tld
  type   = "AAAA"
  name   = "waku-node-${count.index}"
  value  = digitalocean_droplet.waku_node[count.index].ipv6_address
}

One thing that is notably missing above is a way for the nodes to discover one another and connect. Will have to look into the best way to enable private networking in DO, and bootstrap that data to the nodes.

If desired, we could have the nodes run Docker locally and expose a Docker service of go-waku rather than running the application directly on the VM.

Pros	Cons
Simplest option. That code above is pretty close to all we need	Remote-exec provisioner has the potential to be a little flaky
No infrastructure required. Can be run locally	Even with floating IP, will be some downtime when upgrading node. Probably not a big deal, since that won't happen often.
Running everything on Digital Ocean droplets is be the cheapest reasonable hosting option	Cumbersome if we were to need to deploy many more backend services this way
Straightforward upgrade path to remote state and Terraform Enterprise, which would be a production-ready solution	Automating interactions with running instances is more complicated than k8s. Kubectl, and the whole suite of K8s APIs, are a very powerful tool we would not have access to.

Hashicorp Packer

Hashicorp Packer is a tool used to build and provision VMs in a declarative way. We would define a source image, add the required steps to bootstrap go-waku, and the push that image up to Digital Ocean. The image ID could then be used as an input to the Digital Ocean Droplet Terraform resource to actually provision machines based on the image.

This would remove the need for the not-so-pretty provisioner configuration in the above Terraform config, but would still use most of the other Terraform configuration.

Pros	Cons
Hashicorp top to bottom, so Terraform integration should be easy	Would likely wind up as throwaway code, once we move to a full Kubernetes cluster setup
Less moving parts than many other alternatives. Just VMs, no Docker or container orchestration	More complicated than pure terraform setup
Faster and more reliable node starts than remote-exec provisioner

Portable configuration should we choose to move off of Digital Ocean

Kubernetes

One day, we are going to have to create a high quality reference setup with publicly available templates for deploying our nodes on Kubernetes, since we know this will be a popular type of deployment for node operators. If we start on Kubernetes we can do some of this work now and dog-food this setup in staging.

We would provision a Kubernetes cluster using Terraform, and then define the application via Helm V3 templates. Maybe investigate storing the helm templates in Flux if we really wanted to go all the way.

Pros	Cons
We are going to have to do this one day	By far the most complicated and hardest to manage
Possible to setup very high quality production deployment, with zero downtime deploys, liveness and readiness checks, autoscaling, etc	More expensive, since we would have to pay for master nodes, and a cluster that would likely have considerable spare capacity to handle deployments
Easy to expand to heterogenous workloads, where we can run many types of application on the same infrastructure	Most of the benefits come when you are running more than one app
Most portable. Easy to switch to Google Cloud or AWS	Requires a more complicated deployment process to make updates to the application or its configuration
Strong security framework, where we can restrict access to the nodes and set up access control policies	Difficult to expose static IP from cluster.

| | Persistent volumes can be difficult to manage, and will be required for our nodes to operate | | | External-DNS can be a pain to configure and manage |

Recommendation

I think we should go for the simplest option to start. That would be pure Terraform. Would be a little bit of work prettying up the Terraform above (maybe breaking up some files and refactoring into modules/plans), possibly adding a load balancer, and improving the bootstrap script to run the application as a service. Anything beyond that is likely not a great use of time at this point.

Questions

Is DataDog the right vendor for observability? I mostly suggested it because it was what I am most familiar with, and I know it supports the Prometheus metrics exposed by go-waku
How much do we care about the reliability of the system in the alpha?
How much work do we want to put into the infrastructure layer for the alpha. How good is "good enough" at this stage?
How much do we care about security at this stage? The pure Terraform configuration exposes the nodes to the public internet, which opens them up to all kinds of attack. Do we care enough to put them in a private subnet behind a load balancer?

Waku Staging Deployment

Background​

Problem​

Requirements​

Non-Goals​

Possible Solutions​

Pure Terraform​

Hashicorp Packer​

Kubernetes​

Recommendation​

Questions​