sre golden signals

As a team we have spent many years troubleshooting performance problems in production systems. Applications have gotten so complex you need a standard methodology to understand performance. Fortunately right now there are a couple of common frameworks we can borrow from:

  • Google SRE Handbook on Monitoring Distributed Systems with the Four Golden Signals: latency, traffic, errors, saturation
  • The USE Method popularized by Brendan Gregg: utilization, saturation and errors
  • The RED Method coined by Tom Wilkie of Grafana: rate, errors and duration

Despite using different acronyms and terms, they fortunately are all different ways of describing the same thing:

  • Response Time (latency or duration) – how long does a transaction take to complete?
  • Throughput (traffic or utilization or rate) – how many transactions are happening over a unit of time (ex: transactions per second or minute)?
  • Error Rate (errors) – if possible try to calculate the rate or percentage of errors instead of only the count
  • Infrastructure (saturation) – this is an entire category of data but includes things like CPU, Memory, I/O, queue depth, etc.

Reports Dashboard

We have taken this data to heart at Speedscale. When we updated the design of our new Dashboard UI, the first thing we wanted to overhaul was the Report page. How do you know your app will perform quickly, efficiently, correctly, and at scale? This Speedscale Report gives you immediate feedback unlike anything else, and allows you to make better release decisions.

sre golden signals

 

Here are some of the key aspects of this report:

  • There is a summary of your goals relative to the actual measurements. This replay failed around Throughput and Success Rate.
  • You can visually see each of the Golden Signal values on the screen. No need to click around and pull data from a variety of sources.
  • These Golden Signals are always available before you release so you can decide whether or not the code is ready for production.

————–

Many businesses struggle to discover problems with their cloud services before they impact customers. For developers, writing tests is manual and time-intensive. Speedscale allows you to stress test your cloud services with real-world scenarios. Get confidence in your releases without testing slowing you down. If you would like more information, schedule a demo today!

If you are interested in seeing how Speedscale works for your environment, please drop us an email at: [email protected]

Landscape

Stress test your APIs with real world scenarios.  Collect and replay traffic without scripting.

Newsletter Signup