Database Monitoring & Observability

Version: 1.0 Last Updated: 2025-10-15 Owner: Database Operations Status: Production Monitoring Standard


Table of Contents

  1. Overview
  2. DataDog Integration
  3. Key Database Metrics
  4. Critical Alerts
  5. Performance Queries
  6. Troubleshooting Playbooks
  7. Capacity Planning

Overview

Purpose

This document defines production-grade database monitoring for Zixly’s PostgreSQL database (Supabase), covering:

Monitoring Philosophy

Per Global Development Standards:

Fail-fast is superior to fallbacks because it exposes problems immediately for proper fixing rather than hiding them behind degraded functionality.

Applied to Monitoring:


DataDog Integration

Setup

DataDog Agent: Installed on Supabase infrastructure (automatic for Supabase customers).

Metrics Collection:

Dashboard: Zixly - Database Health

Retention: 15 months of metrics (DataDog Pro plan)


Key Database Metrics

Connection Pool Metrics

Metric 1: Active Connections

Query (DataDog):

postgres.connections.active{env:production}

Thresholds:

Why It Matters: Connection exhaustion causes FATAL: remaining connection slots are reserved errors, preventing new API requests.

Mitigation:


Metric 2: Idle Connections

Query (PostgreSQL):

SELECT COUNT(*) AS idle_connections
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';

Threshold:

Why It Matters: Idle connections waste resources. May indicate connection leak in application code.

Investigation:

-- Find long-running idle connections
SELECT pid, usename, application_name, state, state_change, query
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes'
ORDER BY state_change;

Query Performance Metrics

Metric 3: Slow Query Count

Query (DataDog via Prisma middleware):

// lib/prisma-middleware.ts
prisma.$use(async (params, next) => {
  const start = Date.now()
  const result = await next(params)
  const duration = Date.now() - start

  // Log to DataDog if query >100ms
  if (duration > 100) {
    datadogClient.increment('prisma.slow_query', 1, {
      model: params.model,
      action: params.action,
      duration: duration.toString(),
    })
  }

  return result
})

Threshold:

Why It Matters: Slow queries degrade user experience (dashboard load time >2.5s). May indicate missing indexes or inefficient queries.


Metric 4: Query Latency (p95)

Query (DataDog):

avg:postgres.query.time.p95{env:production}

Target: < 100ms (performance requirement)

Thresholds:

Investigation:

-- Find slowest queries (requires pg_stat_statements extension)
SELECT
  query,
  calls,
  mean_exec_time AS avg_ms,
  max_exec_time AS max_ms,
  stddev_exec_time AS stddev_ms
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY mean_exec_time DESC
LIMIT 10;

Table Growth Metrics

Metric 5: Table Size Growth Rate

Query (PostgreSQL):

SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  pg_total_relation_size(schemaname||'.'||tablename) AS size_bytes
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY size_bytes DESC;

DataDog Query:

sum:postgres.table.size{table:financials,env:production}

Thresholds (for time-series tables):

Why It Matters: Large tables slow down queries (sequential scans). Partitioning required at 500+ customers (per System Architecture).


Metric 6: Row Count Growth

Query (PostgreSQL):

SELECT
  schemaname,
  tablename,
  n_live_tup AS estimated_rows,
  n_dead_tup AS dead_rows,
  ROUND(100 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_row_percent
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_live_tup DESC;

Threshold:


Index Efficiency Metrics

Metric 7: Index Usage Rate

Query (PostgreSQL):

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan AS index_scans,
  idx_tup_read AS tuples_read,
  idx_tup_fetch AS tuples_fetched,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan ASC;  -- Least used indexes first

Action: Identify unused indexes (idx_scan = 0) and drop to save disk space.

Example:

-- If index never used after 30 days, consider dropping
DROP INDEX CONCURRENTLY idx_unused_example;

Metric 8: Index Bloat

Query (PostgreSQL):

SELECT
  schemaname,
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  100 * (pg_relation_size(indexrelid) - pg_relation_size(indexrelid, 'main')) / NULLIF(pg_relation_size(indexrelid), 0) AS bloat_percent
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Threshold:

Remediation:

REINDEX INDEX CONCURRENTLY idx_bloated_example;

Replication Lag (High Availability)

Metric 9: Replication Lag (if read replica configured)

Query (PostgreSQL on primary):

SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  sync_state,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Threshold:

Why It Matters: High replication lag means read replica serves stale data. Dashboard queries return outdated metrics.


Disk Space Metrics

Metric 10: Disk Space Utilization

Query (Supabase Dashboard or PostgreSQL):

SELECT pg_size_pretty(pg_database_size('postgres')) AS database_size;

DataDog Query:

avg:system.disk.used{env:production} / avg:system.disk.total{env:production} * 100

Thresholds:

Why It Matters: PostgreSQL requires ~20% free space for VACUUM operations. Disk full = database writes fail.

Mitigation:


Critical Alerts

Alert Configuration (DataDog)

Alert 1: Connection Pool Exhaustion

name: '[P1] PostgreSQL Connection Pool Exhausted'
query: 'avg(last_5m):max:postgres.connections.active{env:production} / max:postgres.max_connections{env:production} > 0.9'
message: |
  🚨 **Connection pool at 90% capacity!**

  Current: % of max connections
  Impact: New API requests will fail with connection errors

  **Actions**:
  1. Check DataDog APM for connection leak (dashboard: Zixly - Database Health)
  2. Identify slow queries holding connections: `SELECT * FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start`
  3. Consider upgrading Supabase plan for more connections

priority: P1
notify:
  - pagerduty
  - slack:#alerts-production

Alert 2: Slow Query Spike

name: '[P2] Database Query Performance Degradation'
query: 'avg(last_15m):avg:postgres.query.time.p95{env:production} > 500'
message: |
  ⚠️ **Query latency exceeded 500ms (p95)**

  Current p95: ms
  Target: <100ms
  Impact: Dashboard load time >2.5s (violates NFR-1.1)

  **Actions**:
  1. Identify slow queries: `SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10`
  2. Check for missing indexes (investigate EXPLAIN ANALYZE output)
  3. Review application code for N+1 query patterns

priority: P2
notify:
  - slack:#alerts-production

Alert 3: Disk Space Warning

name: '[P2] Database Disk Space >80%'
query: 'avg(last_1h):avg:system.disk.used{env:production} / avg:system.disk.total{env:production} * 100 > 80'
message: |
  ⚠️ **Database disk usage exceeded 80%**

  Current: %
  Recommended action: Archive old data or upgrade storage

  **Actions**:
  1. Check table sizes: `SELECT tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC`
  2. Archive financial data >3 years old to S3
  3. Upgrade Supabase plan if needed

priority: P2
notify:
  - slack:#alerts-production

Performance Queries

Query 1: Current Active Queries

Purpose: Identify currently running queries (useful during incidents).

SELECT
  pid,
  now() - query_start AS duration,
  state,
  query,
  wait_event_type,
  wait_event
FROM pg_stat_activity
WHERE state <> 'idle'
  AND pid <> pg_backend_pid()  -- Exclude this query itself
ORDER BY query_start;

Interpretation:


Query 2: Top 10 Slowest Queries (Historical)

Purpose: Identify queries to optimize.

Requires: pg_stat_statements extension (enabled by default in Supabase).

SELECT
  query,
  calls,
  total_exec_time,
  mean_exec_time,
  max_exec_time,
  stddev_exec_time,
  rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Action: For each slow query, run EXPLAIN ANALYZE to identify missing indexes.


Query 3: Table Bloat (Dead Tuples)

Purpose: Identify tables needing VACUUM.

SELECT
  schemaname,
  tablename,
  n_live_tup AS live_rows,
  n_dead_tup AS dead_rows,
  ROUND(100 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_percent,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000  -- Only tables with >1k dead rows
ORDER BY dead_percent DESC;

Action:

VACUUM ANALYZE financials;  -- Manual vacuum if dead_percent >20%

Query 4: Index Hit Rate

Purpose: Ensure queries use indexes (not sequential scans).

SELECT
  'index hit rate' AS metric,
  ROUND(100 * sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit + idx_blks_read), 0), 2) AS percent
FROM pg_statio_user_indexes
UNION ALL
SELECT
  'table hit rate' AS metric,
  ROUND(100 * sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit + heap_blks_read), 0), 2) AS percent
FROM pg_statio_user_tables;

Target: > 99% index/table hit rate (data served from cache, not disk).

If < 95%: Increase shared_buffers (requires Supabase plan upgrade).


Troubleshooting Playbooks

Playbook 1: API Timeout Errors

Symptoms:

Investigation:

  1. Check active queries:
    SELECT * FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;
    
  2. Identify long-running query (duration >5s)
  3. Run EXPLAIN ANALYZE on query to find bottleneck

Resolution:


Playbook 2: Connection Pool Exhausted

Symptoms:

Investigation:

  1. Identify queries holding connections:
    SELECT pid, usename, state, query, state_change
    FROM pg_stat_activity
    ORDER BY state_change
    LIMIT 20;
    
  2. Check for connection leaks in application code (Prisma client not closed)

Resolution:


Playbook 3: Slow Dashboard Load Time

Symptoms:

Investigation:

  1. Profile slow API route:
    EXPLAIN ANALYZE
    SELECT * FROM financials
    WHERE client_kpi_id = 'client_123'
      AND record_date BETWEEN '2025-01-01' AND '2025-12-31';
    
  2. Check if index used (should see “Index Scan” not “Seq Scan”)

Resolution:


Capacity Planning

Growth Projections

Metric Current (MVP) 100 Customers 500 Customers 1,000 Customers
Financials rows 10K 1M 5M 10M
Database size 100 MB 10 GB 50 GB 100 GB
Daily queries 10K 1M 5M 10M
Connections (avg) 10 50 200 500

Upgrade Triggers

Threshold Action
Disk > 70% full Archive old data + upgrade Supabase plan
Connections > 80% Enable PgBouncer + upgrade connection limit
Financials > 10M rows Implement monthly partitioning (see Schema Evolution)
Query p95 > 500ms Add read replica for dashboard queries


Document End

Review Cycle: Monthly (review metrics dashboard, adjust thresholds) Next Review: 2025-11-15 Change History:

Approval: