Data-cat

November 1 2019

#python #datadog #monitoring #cloud

Table of contents

Data-cat

Deploying DataDog for a large scale infrastructure

Definitions

  • Geographic Regions
  • Stages
  • Applications

Geographic Regions

Matches the definitions of AWS Regions. It can be used for GCP or on-prem datacenter as well.

Stages

Different stages of application deployments, usually: dev, qa, prod.

Applications

A service that provides a distinct business functionality.

Goals

  • having all monitors and dashboards in version control
  • having all monitors templated
  • being able to address smaller parts of the infrastructure

Implementation

4 files represent the DataDog configuration for the whole infrastructure.

  • infrastructure.yaml

It contains the logical grouping of applications into stages and regions. The relations are always N:M. 1 region can contain many stages and many applications in each stage.

  • region.yaml

Defaults for a certain region (region).

  • stage.yaml

Defaults for a certain stage (region, stage).

  • application.yaml

Configuration that is specific for a certain application (region, stage, application).

Generating infrastructure.yaml

I recently discovered Dhall that seems like the perfect fit to write the infrastructure in and than generate the YAML files.

The type safe definitions looks like the following:

let keyValue =
λ(k : Type)
→ λ(v : Type)
→ λ(mapKey : k)
→ λ(mapValue : v)
→ { mapKey = mapKey, mapValue = mapValue }
let ApplicationConfig : Type = { created_at : Text }
let Application = < etcd | postgresql | hadoop >
let Applications = Prelude.Map.Type Application ApplicationConfig
let application = keyValue Application ApplicationConfig
let Stage = < dev | qa | prod >
let Stages = Prelude.Map.Type Stage Applications
let stage = keyValue Stage Applications
let AwsRegion = < us-east-1 | eu-central-1 | eu-west-1 >
let AwsRegions = Prelude.Map.Type AwsRegion Stages
let awsRegion = keyValue AwsRegion Stages

After having these definitions we can create the infrastructure:

in [ awsRegion AwsRegion.us-east-1
[ stage Stage.dev
[ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
, application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
]
, stage Stage.qa
[ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
, application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
]
]
, awsRegion AwsRegion.eu-west-1
[ stage Stage.dev
[ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
, application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
]
]
, awsRegion AwsRegion.eu-central-1
[ stage Stage.dev
[ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
, application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
]
]
]

Generating the YAML:

dhall-to-yaml --file infrastructure.dhall > infrastructure.yaml

Generating the folder structure

python3 gen.py
region: eu-central-1, stage: dev
region: eu-central-1, stage: dev, app: etcd
region: eu-central-1, stage: dev, app: hadoop
region: eu-west-1, stage: dev
region: eu-west-1, stage: dev, app: etcd
region: eu-west-1, stage: dev, app: hadoop
region: eu-west-1, stage: prod
region: eu-west-1, stage: prod, app: etcd
region: eu-west-1, stage: prod, app: hadoop

Templates

Templates folder has the monitor templates.

Example template:

---
name: High CPU load on application_name:{application_name} stage:{stage} {{{{host.name}}}} / {{{{host.ip}}}}
tags:
- application_name:{application_name}
- stage:{stage}
- region:{region}
type: metric alert
query: avg(last_5m):avg:system.load.norm.5{{application_name:{application_name},stage:{stage}}} by {{host}} > {critical_threshold}
message: >-2
High CPU load on application_name:{application_name} stage:{stage} {{{{host.name}}}} / {{{{host.ip}}}} for 5 consecutive minutes on this node.
Url: https://wd-global-prod.datadoghq.com/monitors/{monitor_id}
{slack_notification_channel}
monitor_options:
notify_audit: False
locked: False
timeout_h: 0
silenced: {{}}
include_tags: True
require_full_window: True
new_host_delay: 300
notify_no_data: False
renotify_interval: 0
escalation_message: >-2
CPU load is still damn high.
thresholds:
critical: {critical_threshold}
warning: {warning_threshold}

This gets rendered using Python format and converted to a dict that used to talk to the DataDog API.

Defaults and specifics

Defaults are stage wide settings specifics are specific to a single application (in a region & stage).

Tags alignment

For all of these above to work together nicely there is a dependency on tags being deployed every node, ELB, etc., so that we can reference those in monitors and dashboards.

Deployment

I gave up on Conda and now just using venv from Python.

/usr/local/opt/python3/bin/python3 -m venv venv
. venv/bin/activate.fish #or the shell you are using
pip install --upgrade pip
pip install --upgrade toml pyyaml

Deploying monitors

Deploying a whole stage:

./data-cat/data-cat.py deploy-monitors -r eu-west-1 -s qa

Deploying a single application:

./data-cat/data-cat.py deploy-monitors -r eu-west-1 -s qa -a etcd

Deploying dashboards

Deploying a whole stage:

./data-cat/data-cat.py deploy-dashboards -r eu-west-1 -s qa

Deploying a single application:

./data-cat/data-cat.py deploy-dashboards -r eu-west-1 -s qa -a etcd

What to monitor

Following Brendan Gregg's use method and the suggested things to monitor:

  • CPUs: sockets, cores, hardware threads (virtual CPUs)
  • Memory: capacity
  • Network interfaces
  • Storage devices: I/O, capacity
  • Controllers: storage, network cards
  • Interconnects: CPUs, memory, I/O

How to monitor it (examples):

  • utilization: as a percent over a time interval. eg, "one disk is running at 90% utilization"
  • saturation: as a queue length. eg, "the CPUs have an average run queue length of four"
  • errors: scalar counts. eg, "this network interface has had fifty late collisions"