Monitoreo

“Es un error teorizar antes de poseer datos, uno comienza a deformar los hechos para hacerlos encajar en las teorías, en vez de ...” --Sherlock Holmes

¿Monitoreo?

¿Que?

¿Para Quien?

¿Cómo?

¿Qué es Monitoreo?

Herramientas y procesos para medir y manejar sistemas

Traducción entre valor y métricas generadas por los sistemas

¿Qué falla y por qué?

¿Para Quien?

Negocio

IT

Tipos de Monitoreo | Modelo de madurez

  1. Manual (ejecutado por el usuario)
  2. Reactivo
  3. Proactivo

Monitoreo Manual

Checklists

Scripts simples

Solo lo que ha fallado antes, se soluciona como antes

Enfoque solo en minimizar downtime

Monitoreo Reactivo

Automático con algúnos rastros manuales

Alertas con límites simples

Consolas mostrando estado

Enfoque en Disponibilida(Infraestructura))

Actualizacions reactivas

Nuevas mediciones último paso del despliegue

Proactivo

Nucleo operación

Automatico génerado por manejo de la configuración

Las aplicaciónes tienen instrumentación incluida

Métricas comportamiento applicación y negocio (En contraste con CPU y Disco)

Calidad del Servicio y Experiencia de Usuario

Los productos no se consideran completos si no tienen monitoreo

Enfoque:

Estado

y

Rendimiento

a traves de:

Eventos, Metricas y Logs

Que monitorear

Monitoreo de

  • Negocio
  • Máquina
  • Ambiente
  • Caja Negra

Negocio (Whitebox)

  • Trázabilidad
  • Por donde a pasado
  • Por donde anda

¿Cómo?

Anlisis Logs( ELK, Cloudwatch metrics, mtail)

Instrumentación( Java, Netflix Spectator, JMX, Heartbeat, Heartbeat2 )

Máquina

  • Aplicación Instrumentación (Código|Binaria)
    • Throughput
    • MTTR (Mean Time To respond)
    • Errors (Logs)
  • OS
    • Resource Usage
    • Errors

Ambiente

  • Red
  • Servicios Usados
  • Interacciones entre Servicios

Caja negra

  • Usuarios
    • Origen
    • Cantidad
    • "Tipo"
    • Medio entrada(Navegador, Os, Referer ...)
  • Aplicación (vista por el usuario)
    • Latencia
    • Disponibilidad
    • Appdex
From google SRE Golden Signals (USE): * Latency * Traffic * Errors * Saturation Utilization (Percentage, ) Saturation (Queue depth) Errors (errors/s) Rate (requests/s) Latency (response time, queue/wait time) Worry about Tail (Dont use averages, histogram with exponential buckets) Resolution of measurements: * should go with SLO * If too frequent use sampling and agregates Monitoring system should be simple * Rules should be simple *
Utilization (Percentage, ) Saturation (Queue depth) Errors (errors/s) Rate (requests/s) Latency (response time, queue/wait time)

Operacion

SLA, SLO, SLI Monitoreo Alarmas Every alarm should be actionable, First static (Nagios, Zabbix, DataDog) Alarms when definetely wrong Alarms with lower bounds. Dont use averages, use median with short measuring window or percentiles Anomaly Detection Affected by seasonality Newer: Cloud: DataDog, SignalFx On premise: Prometheus, InfluxDb Visualization: Weave Works Splunk Netsil Testing Postmortems: blameless, learn from failure
Cascading failures "If at first you don't succeed, back off exponentially." Dan Sandler, Google Software Engineer
Despliegues Canary A/B Blue Green
Problem Resolution Strategies(http://www.brendangregg.com/methodology.html): Anti-Methodologies: * Blame someone else: 1. Find a system or environment component you are not responsible for 2. Hypothesize that the issue is with that component 3. Redirect the issue to the responsible team 4. When proven wrong, go to 1 * Street Light: 1. Pick observability tools that are: familiar found on the Internet found at random 2. Run tools 3. Look for obvious issues * Drunk Man: 1 Change things at random until the problem goes away * Random Change: 1 Measure a performance baseline 2 Pick a random attribute to change (eg, a tunable) 3 Change it in one direction 4 Measure performance 5 Change it in the other direction 6 Measure performance 7 Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert 8 Goto step 1 * Passive Benchmarking Anti-Method: 1 Pick a benchmark tool 2 Run it with a variety of options 3 Make a slide deck of the results 4 Hand the slides to management * Traffic Light 1 Open dashboard 2 All green? Assume everything is good. 3 Something red? Assume that's a problem. Methodologies: * USE (Usage Saturation Error) * List Resources * For every resource, check utilization, saturation, and errors. * RED (Rate Errors Duration) * TSA For each thread measure time in each state, investigate from most to least frequent * Ad hoc checklist * Problem statement What makes you think there is a performance problem? Has this system ever performed well? What has changed recently? (Software? Hardware? Load?) Can the performance degradation be expressed in terms of latency or run time? Does the problem affect other people or applications (or is it just you)? What is the environment? What software and hardware is used? Versions? Configuration? * RTFM * Scientific Method * OODA Observe Orient Decide Act * Work Load characterization * Drill-Down * Elimination (Binary search) * Tools Method List Tools(or add more) For each tool list useful metrics For each metric list interpretation Run selected tools and interpret * CPU profile with flame graph * Performance Evaluation Steps State the goals of the study and define system boundaries List system services and possible outcomes Select performance metrics List system and workload parameters Select factors and their values Select the workload Design the experiments Analyze and interpret the data Present the results If necessary, start over * Capacity Planning Process Instrument the system Monitor system usage Characterize workload Predict performance under different alternatives Select the lowest cost, highest performance alternative
Resumen: Preparedness and Disaster Testing Postmortem Culture Automation and Reduced Operational Overhead Structured and Rational Decision Making

Muchas Gracias