System Performance Metrics: QPS, TPS, Concurrency, RT

"It's slow" is not a bug report

A customer might say checkout "spins" for 6 seconds.

Someone might say "add more servers."

Someone else might say "the database is fine."

Someone else might say "works on my machine."

It's like when the replicators on the Enterprise stop working and everyone's guessing—power conduits, plasma relays, maybe the Heisenberg compensators. Turns out they're just full of tribbles. You can't fix performance by guessing; you need to check the diagnostic sensors. None of that matters until you can answer:

What's the p95 response time under load?
What's the current concurrency at peak?
How many requests are we actually receiving (QPS)?
How many real business operations are we completing (TPS)?

You can fix performance with feelings, but it's expensive and it doesn't scale. It's like trying to navigate by asking strangers on the street instead of checking your GPS. The GPS might not always be right, but at least it's giving you actual coordinates.

A humorous incident board showing conflicting guesses vs the need for metrics. — When everyone has an opinion but nobody has numbers.

What these metrics are (student-friendly)

Response Time (RT)

RT is how long a request takes from start to finish. Think of it like the time between when you press the elevator button and when the doors open. Track percentiles because "average" hides pain. Averages can look fine while a small chunk of requests are awful—like saying the average temperature in a room is comfortable when half the room is freezing and the other half is on fire.

Latency distribution showing why average hides p95/p99 pain. — Why averages lie: the long tail of latency pain.

Concurrency

Concurrency is how many requests are being processed at the same time. It's like the number of people currently in the elevator shaft—not waiting in line, but actually in transit. If requests take longer, concurrency rises even if QPS stays flat. It's the difference between a fast elevator that clears quickly and a slow one where people stack up inside.

Requests entering a system and stacking as in-flight when latency rises. — When latency increases, in-flight requests stack up even if arrival rate stays constant.

Queries Per Second (QPS)

QPS is incoming pressure. It's the rate your system is being asked to do things. Think of it as how many people are pressing the elevator button per second. It doesn't tell you if they're getting where they need to go, just how many are trying.

Transactions Per Second (TPS)

TPS is completed work. One "transaction" should represent something meaningful: "checkout completed," "order created," "profile updated." It's the difference between button presses and actual elevator rides. QPS can be high while TPS is low if you're timing out, failing, or stuck—like an elevator that accepts button presses but never actually moves.

Two gauges: QPS (incoming) and TPS (completed), showing divergence under failure. — QPS and TPS diverge when requests fail or timeout—the gap tells you how much work is being lost.

The one relationship you should tattoo on your monitoring dashboard

Little's Law (queueing theory) says:

L = λ × W

Where:

L = average number of things in the system (concurrency / in-flight)
λ = average arrival or completion rate (throughput, often QPS or TPS depending on your boundary)
W = average time in the system (response time)

This isn't some theoretical abstraction. It's the mathematical relationship that explains why your system falls over. If you understand this formula, you understand performance. If you don't, you're flying blind.

Little's Law visual mapping L, λ, W to Concurrency, Throughput, Response Time. — Little's Law: the relationship that explains why systems fail.

Practical translation:

If your p95 latency doubles and QPS is steady, your system needs roughly 2× concurrency capacity to cope.
If your concurrency limit is fixed (threads, connections, CPU), latency "runs away." It's like a traffic jam that gets worse the longer it lasts.

The coffee shop metaphor helps: if people arrive at rate λ (arrivals per minute), spend W minutes inside, then L = λ × W people are in the shop at any moment. If the barista slows down (W increases), more people stack up inside (L increases) even if the arrival rate stays the same.

Coffee shop metaphor mapping arrivals, time inside, and people in shop. — Little's Law in action: the coffee shop that explains your API.

Alternatives (and why they're not enough alone)

Before we dive deeper, let's address the elephant in the room: why not just use what you already have?

CPU, memory, disk, network (resource metrics)
Useful, but they don't tell you what users feel. Your CPU might be at 50%, but if your p95 latency is 5 seconds, users are still having a bad time. Resource metrics are like checking the engine temperature when the car won't start—relevant, but not the whole story.
APM traces only
Traces are great for "why," but you still need the top-line numbers to know "how bad" and "is it getting worse." Traces tell you which function call is slow. Metrics tell you if the entire system is on fire.
Logs
Logs are forensic evidence. Metrics are the smoke alarm. By the time you're reading logs to understand performance, the building is already on fire.

This article is about the smoke alarm.

Layer cake showing metrics vs traces vs logs and what questions they answer. — The observability stack: metrics answer "how bad," traces answer "why," logs answer "what happened."

The demo API we'll measure in multiple languages

We'll implement the same tiny API everywhere. This isn't a production-ready service—it's a teaching tool. Think of it as a skeleton that shows you where the bones go. You can add muscle later.

GET /work?ms=50
Simulates "doing work" by waiting ms milliseconds, then returning JSON. It's intentionally simple so we can focus on the metrics, not the business logic.
GET /metrics
Exposes Prometheus-style metrics so we can compute QPS/TPS/Concurrency/RT. This is the endpoint that makes everything else possible.

Two endpoints: /work and /metrics, showing their roles. — The minimal API: one endpoint that does work, one that exposes metrics.

What we measure

Use a minimal, universal metric set:

http_requests_total{route,method,status} (counter)
http_request_duration_seconds_bucket{route,method,...} (histogram)
http_in_flight_requests{route} (gauge)

Histograms are the standard way to measure latency distributions in Prometheus-land. They're like a histogram in statistics class, except these buckets actually matter for your on-call rotation.

Counter vs Gauge vs Histogram explained with tiny visuals. — Three metric types: counters go up, gauges go up and down, histograms show distributions.

If you use OpenTelemetry, align with semantic conventions for HTTP metrics (names/attributes), so tools agree on what your numbers mean. It's like speaking the same language—you can still communicate if everyone uses different words, but it's a lot harder.

Load testing: measure from the outside first (k6)

Install k6, then run this script. k6 thinks in "virtual users" and scenarios; it's intentionally simple so you can reproduce results. No more "it works on my machine" when it comes to load testing.

k6 generating load to the API and Prometheus scraping metrics. — The load testing loop: k6 generates traffic, Prometheus scrapes metrics, you read the dashboard.

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  scenarios: {
    steady: {
      executor: 'constant-vus',
      vus: 50,
      duration: '30s',
    },
  },
  thresholds: {
    http_req_failed: ['rate<0.01'], // <1% errors
    http_req_duration: ['p(95)<500'], // p95 < 500ms (example target)
  },
};

export default function () {
  http.get('http://localhost:8080/work?ms=50');
  sleep(0.1);
}

This script maintains 50 virtual users, each making requests every 100ms. It's like having 50 people constantly pressing the elevator button. The thresholds tell k6 to fail the test if error rate exceeds 1% or if p95 latency exceeds 500ms. It's your automated quality gate.

Prometheus queries that turn raw metrics into the four numbers

Assume you scrape /metrics into Prometheus. These queries transform raw counters and histograms into the four numbers that matter.

Mapping PromQL queries to QPS, TPS, Concurrency, and p95 latency. — From raw metrics to the four numbers: the PromQL queries that matter.

QPS (requests in)

sum(rate(http_requests_total[1m]))

This calculates the rate of change of the counter over a 1-minute window. It's like measuring how many cars pass a checkpoint per minute.

TPS (completed successful "transactions")

If your "transaction" is "HTTP 2xx response from /work":

sum(rate(http_requests_total{route="/work",status=~"2.."}[1m]))

The status=~"2.." regex matches any 2xx status code. This filters out errors and timeouts, giving you only successful completions.

(If your business transaction spans multiple HTTP requests, emit a separate transactions_total counter from your app. Don't try to derive business logic from HTTP requests—it's like trying to understand a conversation by counting words.)

Concurrency (in-flight right now)

sum(http_in_flight_requests)

This is a gauge, so no rate() needed. It's the current value, right now. Like checking how many people are currently in the elevator.

Response Time (p95)

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket{route="/work"}[5m])) by (le)
)

This is the magic query. histogram_quantile takes your histogram buckets and calculates the 95th percentile. The by (le) groups by the "less than or equal" bucket boundaries. It's like finding the point where 95% of your requests are faster.

Implementations in popular API stacks

These are intentionally "small." You can harden them later (timeouts, cancellations, graceful shutdown, etc.). Think of them as proof-of-concept code that shows you where the metrics go. Production code needs more, but this is where you start.

Same endpoint behavior across Go, Node, Python, Java, C# with identical metrics. — The same API, the same metrics, five different languages. The numbers don't care what language you use.

Go (net/http + Prometheus)

package main

import (
  "fmt"
  "net/http"
  "strconv"
  "time"

  "github.com/prometheus/client_golang/prometheus"
  "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
  inFlight = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{Name: "http_in_flight_requests", Help: "In-flight requests"},
    []string{"route"},
  )
  reqs = prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "http_requests_total", Help: "Total HTTP requests"},
    []string{"route", "method", "status"},
  )
  dur = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
      Name:    "http_request_duration_seconds",
      Help:    "HTTP request duration (seconds)",
      Buckets: prometheus.DefBuckets,
    },
    []string{"route", "method"},
  )
)

func main() {
  prometheus.MustRegister(inFlight, reqs, dur)

  http.Handle("/metrics", promhttp.Handler())

  http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
    route := "/work"
    inFlight.WithLabelValues(route).Inc()
    start := time.Now()

    ms, _ := strconv.Atoi(r.URL.Query().Get("ms"))
    if ms <= 0 {
      ms = 50
    }
    time.Sleep(time.Duration(ms) * time.Millisecond)

    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(http.StatusOK)
    fmt.Fprintf(w, `{"ok":true,"ms":%d}`, ms)

    dur.WithLabelValues(route, r.Method).Observe(time.Since(start).Seconds())
    reqs.WithLabelValues(route, r.Method, "200").Inc()
    inFlight.WithLabelValues(route).Dec()
  })

  http.ListenAndServe(":8080", nil)
}

Go's Prometheus client is straightforward: create your metrics, register them, then increment/observe/decrement as requests flow through. The WithLabelValues calls add dimensions to your metrics—route, method, status. These labels let you slice and dice your data later.

Node.js (Express + prom-client)

import express from 'express';
import client from 'prom-client';

const app = express();

const inFlight = new client.Gauge({
  name: 'http_in_flight_requests',
  help: 'In-flight requests',
  labelNames: ['route'],
});
const reqs = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['route', 'method', 'status'],
});
const dur = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration (seconds)',
  labelNames: ['route', 'method'],
  buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5],
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.send(await client.register.metrics());
});

app.get('/work', async (req, res) => {
  const route = '/work';
  const method = req.method;
  inFlight.labels(route).inc();
  const end = dur.labels(route, method).startTimer();

  const ms = Math.max(parseInt(req.query.ms || '50', 10), 1);
  await new Promise(r => setTimeout(r, ms));

  res.json({ ok: true, ms });

  end();
  reqs.labels(route, method, '200').inc();
  inFlight.labels(route).dec();
});

app.listen(8080);

The startTimer() method is convenient—it returns a function that, when called, records the duration. It's like a stopwatch that automatically logs to your histogram when you stop it.

Python (FastAPI + prometheus_client)

from fastapi import FastAPI, Response
from prometheus_client import Counter, Gauge, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = FastAPI()

in_flight = Gauge("http_in_flight_requests", "In-flight requests", ["route"])
reqs = Counter("http_requests_total", "Total HTTP requests", ["route", "method", "status"])
dur = Histogram("http_request_duration_seconds", "HTTP request duration (seconds)", ["route", "method"])

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.get("/work")
def work(ms: int = 50):
    route = "/work"
    method = "GET"
    in_flight.labels(route).inc()
    start = time.time()

    if ms <= 0:
        ms = 50
    time.sleep(ms / 1000.0)

    dur.labels(route, method).observe(time.time() - start)
    reqs.labels(route, method, "200").inc()
    in_flight.labels(route).dec()
    return {"ok": True, "ms": ms}

FastAPI's type hints make the endpoint parameters clean, and prometheus_client follows the same patterns as the other languages. Python's time.time() gives you seconds since epoch, which matches Prometheus's expectation of seconds.

Java (Spring Boot + Actuator + Micrometer Prometheus)

Spring Boot makes this almost boring (good). You wire Prometheus once and get timers/counters. It's like having a car that tells you your speed, RPM, and fuel level without you having to install separate gauges.

Gradle deps (conceptual):

spring-boot-starter-web
spring-boot-starter-actuator
micrometer-registry-prometheus

Expose metrics at /actuator/prometheus, then add a filter for in-flight if you want that explicit gauge. Spring Boot's auto-configuration handles most of the heavy lifting—you just need to enable the actuator endpoint and add the Prometheus registry.

C# (ASP.NET Core + prometheus-net)

Same story: middleware + /metrics.

Package: prometheus-net.AspNetCore
Add middleware to capture durations and in-flight gauge per route.

ASP.NET Core's middleware pipeline makes it straightforward to add metrics at the framework level, so every request automatically gets measured. It's like having a speed camera on every road—you don't have to remember to measure, it just happens.

(If you want, Part 2 will include full "real" code for Java + C# with routing labels done safely, because cardinality will happily ruin your day. Too many unique label combinations, and Prometheus will eat your memory like a black hole.)

How to improve each metric (without lying to yourself)

Improve Response Time (RT)

RT is usually killed by:

slow downstream calls (DB, cache, third-party APIs)
lock contention
GC pressure / allocations
serialization overhead / payload size
missing timeouts (requests pile up forever)

Fixes that work:

add strict timeouts + cancellation
use connection pooling properly (DB + HTTP)
cache what's expensive and stable
reduce payload size (and stop shipping entire objects "just in case")

A map of levers: latency, capacity, shedding load, and completing work. — The four levers you can actually pull: reduce latency, increase capacity, shed load, or improve completion rate.

Improve Concurrency (handle more simultaneous work)

Concurrency ceilings are often:

thread pools
DB connections
CPU cores
open file descriptors
queue depth

Fixes that work:

async IO (where it matters)
right-size pools (and cap them)
backpressure: reject early instead of timing out late
bulkheads: isolate critical routes

Backpressure is the difference between a system that gracefully degrades and one that falls over. It's like a restaurant that stops taking reservations when full, instead of letting everyone in and then having them wait forever for a table that never opens.

Reject early (429) vs time out late (504), showing system stability difference. — Backpressure: reject early (429) and stay stable, or accept everything and timeout late (504) when overloaded.

Improve QPS (accept more incoming pressure)

You don't "improve QPS" directly. You either:

scale out (more instances)
reduce per-request cost (RT down)
shed load (rate limit, queue, degrade)

QPS is a measure of demand, not a knob you turn. You can't make more people press the elevator button—you can only make the elevator faster or add more elevators.

Improve TPS (complete more real work)

TPS goes up when:

you eliminate retries/timeouts
you reduce transaction cost (fewer DB round trips, better queries)
you batch and pipeline work
you make failure fast and explicit

TPS is the metric that matters for business outcomes. QPS tells you how busy you are. TPS tells you how much work you're actually getting done. It's the difference between looking busy and being productive.

Three boxes showing measurement boundaries: API-only, API+DB, client-to-server. — Define your measurement boundary: API only, API+DB, or end-to-end including clients.

Common trap: you can't optimize what you didn't define

Decide what boundary you're measuring:

If your "system" is the API server only, QPS = incoming HTTP rate.
If your "system" is "API + DB," your throughput is limited by DB concurrency and query time.
If your "system" includes clients, you must measure from the client too.

Little's Law works when you define the box you're measuring. It's like the Heisenberg uncertainty principle, but for performance—you can't measure everything precisely unless you define what "everything" means.