At Luno, our production system is composed of several services which talk to one another via RPCs. Our backend services are written in Go and are hosted on Amazon EC2.
After we had grown beyond running more than a couple of servers and our user numbers began growing substantially, it rapidly became more difficult to keep track of the system health. Manually checking log files was impossible. We needed a better way.
We implemented two things: error tracking and monitoring.
We integrated Bugsnag for error tracking. Whenever one of our services encounters an error (e.g. serves an internal server error), we send the error details to Bugsnag. From there, we can see statistics of how many times the error has occurred and how many users are affected, which we then use to file bugs and prioritize fixes. This has helped us uncover many issues and bugs, which would otherwise have been hidden, and has had a direct positive impact on our product quality.
In order to understand and quantify the system health, we needed a monitoring solution. By monitoring system metrics, we can detect when services are performing badly or have broken.
First we tried Amazon CloudWatch. While it worked, we felt it wasn't ideal because there's a lot of friction to create new metrics it has only very limited visualization and data analysis functionality. For example, there is no easy way to compute latency histograms.
We looked for alternatives and found Prometheus. Prometheus works by periodically scraping data from servers and storing it for analysis. You can then write queries to aggregate and compute the metrics you need such as histograms or rates. Most importantly, it's very easy to create a new metric: just define it in your code and Prometheus will pick it up automatically.
Here are some things that we're now monitoring with Prometheus:
- exchange matching engine latency and throughput,
- web and API request latency,
- concurrent exchange connections,
- Bitcoin transaction send and receive latency.
We use Grafana to display graphs from Prometheus on dashboards. Grafana makes it very easy to iterate on queries and to produce nice looking graphs and dashboards.
With error tracking and monitoring setup, we've been able to gain more visibility into the system and better prioritize bug fixes and improvements. Bugsnag, Prometheus and Grafana are useful tools for this purpose.
Interested in site reliability engineering and devops? Do you know a better way to implement monitoring at scale? Join the Luno engineering team!