Observability

Created	Author(s)	Status
2022-03-22	@neekolas	Draft

Overview

Before we can achieve our availability and performance goals, we need to be able to measure our progress against them. In addition to tracking our goals, we need to be able to swiftly respond to incidents in our network and fix them. Observability gives us the insights needed to do that.

Goals / Non-goals

Goals
- Define the types of SLOs and SLIs we will want to track in the network
- Propose specific measurement solutions that will allow us to track those SLIs
- Propose specific metrics that warrant monitoring and alerting
Non-goals
- Lock us in to particular numbers for SLOs (I don't know if we want 99% or 99.9% availability yet)
- Define all steps of our on-call process
- Define specific run-books for what we should do when a monitor is alerted

Proposed Solution

Availability

Availability SLOs

The first thing we need to do is establish a set of SLOs for what availability guarantees clients should expect. These are specific goals tied to metrics.

Building the exact numbers for each SLO is out of scope for this document, but it will be important to define these clearly to be able to complete the following sections.

I will tentatively propose the following SLOs, without real numbers attached:

X% success rate for LibP2P requests of protocol store (this would be for read requests to the store)
X% success rate for LibP2P requests of relay_messages (this would be for successfully processing a message)
X% success rate for store_attempts
Number of gowaku_connected_peers should equal the intended network size X% of the time. This one may need to wait until we implement the filter protocol on the client, so that we can stop clients from connecting to nodes as full peers.

SLIs/Metrics

We should establish a set of SLIs that can be to used to track whether the service is meeting each SLO.

The current metrics exposed by go-waku are:

gowaku_connected_peers
gowaku_peers_dials
gowaku_node_messages (The number of the messages received)
gowaku_store_messages (The number of messages stored. Currently broken)
gowaku_filter_subscriptions (The number of content filter subscriptions)
gowaku_store_errors (The number of store protocol errors)
gowaku_lightpush_errors (The number of lightpush errors)

This is a good start, but feels inadequate for what we really want to track.

When a message is published, was it successfully forwarded to peers?
When a message is published, did it make it into the store?
When a user requests messages from the store, was it able to successfully receive a response?

So, in addition we should add the following metrics. Each metric would include additional tags that could be used to filter results and compute success rates. All metrics would also include the default tags from DataDog (including the host name, docker container name, etc). That comes for free.

The statuses listed below are borrowed from the Go LibP2P GRPC. We could consider using another set of status codes, since this will all be custom code. LibP2P itself does not offer standard error codes, and Waku has its own set of bespoke error types returned from each RPC call.

Type	Metric Name	Description	Tags
Count	`libp2p_requests`	Number of LibP2P requests handled, with status and protocol	`protocol` (eg. `store`/`filter` ), `status` (`success`/`server_error`/`client_error`/`authorization_error`)
Count	`relay_messages`	The number of relay messages received by the node and their status	`status` (`success`/`server error`/`client_error`/`authorization_error`), `error_code` (eg `none`/`failed_signature_verification`/`any_other_error_code_we_want_to_define`)
Count	`store_attempts`	Attempts to write messages to the store, which is handled asynchronously from the `relay_messages`	`status` (`success`/`error`)
Distribution	`store_rows_returned`	The number of rows returned in a store query response
Distribution	`store_payload_size`	The byte size of the payload of every message in the store

Relay requests are handled differently from other LibP2P requests because they are pubsub and do not return a response or traditional error codes, so we will have to implement them in a slightly different way.

Unmeasurable Things Right Now

Digital Ocean LoadBalancers do not emit metrics to third parties, far as I can tell. It would be really nice for us to be able to track LB metrics for overall connection availability. Right now we have multiple minutes of downtime when we deploy the application. This would not be tracked anywhere other than the Digital Ocean admin panel, because only the loadbalancer knows that the nodes are down and the load balancer doesn't talk to DataDog.

If it makes sense with some of our other RFCs to migrate to AWS or Google Cloud, this problem would go away. But probably not a reason to do the migration in itself.

Monitoring and On-Call

Metrics alone are helpful looking back and tracking our progress, but if we want to establish a SLO and maintain it, we are going to need monitors established that will page a member of the engineering team when a key metric is trending below our error budget.

We currently have no monitors established in DataDog and do not have a PagerDuty account or similar.

Any SLO that is defined should have a monitor attached, and when the SLI is not meeting the SLO for a given period of time an alert should page the current engineer on-call for the network.

Logging and debuggability

Once we are in production, we are going to have to handle a much larger volume of issues from developers. "Why did this message not go through?", "Why did this request fail?", etc. The current logging in go-waku and xmtp-node-go is inadequate to easily find answers to these types of questions. In particular, successfully processed relay messages emit no log lines at all. We are going to want to add significantly more logging to the node software so that these questions can be quickly answered by querying DataDog. Otherwise we will have to painstakingly reproduce these issues in a local environment to figure out what went wrong. Whenever available, we should include the request_id in the log lines.

We may want to increase the logging from the Client SDK as well, to better detect failures that originate in the client. Even if errors are only logged to the developer's console they can still be helpful in debugging with a cooperative bug reporter.

Performance

Performance SLOs

We need to establish an acceptable latency for key operations in the protocol, which we can codify as SLOs for the platform.

For example:

A request to read messages from the store should have a p95 latency of Xms
A lightpush request should have a p95 latency of Xms
A lightpush request should have a p99 latency of Xms
A relay message should be processed with a p95 latency of Xms

Metrics

go-waku currently tracks no performance metrics. We will need to inject new metrics tracking code into the library in order to even know how fast/slow these operations are. These can then be used as the basis for additional monitors and alerts.

We should add the following metrics to our fork of go-waku.

Metric Type	Metric Name	Description	Tags
Distribution	`libp2p_request_latency`	The speed at which a LibP2P request is processed	`protocol` (`store`/`filter`/`lightpush`)
Distribution	`relay_request_latency`	The speed at which a Relay message is processed and enqueued for storage
Distribution	`store_put_latency`	The speed at which a message is inserted into the database from the store

Plan & Timeline

Task	Estimate
Enable availability metrics in `go-waku` and `xmtp-node-go`	1 day
Enable performance metrics in `go-waku` and `xmtp-node-go`	1 day
Add additional logging to `go-waku` and `xmtp-node-go`	0.5 days

Dependencies

Only real dependencies would be new features that we want to develop bespoke metrics for. Otherwise, this can be started immediately.

Alternatives Considered / Prior Art?

One thing I did not include in this plan was distributed tracing using OpenTracing. While it would be very valuable, it seems complicated to implement in something like LibP2P and probably something that we can add after launch. Distributed tracing also breaks down as soon as we decentralize the network, since we only have part of the picture. I do think it will make debuggability much better if we are able to figure out how to do it inside a LibP2P network.

Risks?

Due to the quirks of the LibP2P protocol, we may find that some of these metrics are less reliable than we expected (for example, difficulties distinguishing between client and server errors). Overall seems quite low risk.

Questions

Did I miss any important metrics?
How do we go about deciding acceptable values for SLOs?
Should we use GRPC or HTTP status codes instead of DIYing something for LibP2P?

Appendix

SLO

Service Level Objective. An agreement about a specific metric (SLI). For example, the uptime of a service or the response latency of a class of request.

SLI

Service Level Indicator. The individual metric that is used to measure compliance with a SLO. SLIs are the actual measurements, whereas SLOs are the expectations of what those measurements should be for success.

Observability

Overview​

Goals / Non-goals​

Proposed Solution​

Availability​

Availability SLOs​

SLIs/Metrics​

Unmeasurable Things Right Now​

Monitoring and On-Call​

Logging and debuggability​

Performance​

Performance SLOs​

Metrics​

Plan & Timeline​

Dependencies​

Alternatives Considered / Prior Art?​

Risks?​

Questions​

Appendix​

SLO​

SLI​