Data-cat

2019/11/01

#python #datadog #monitoring #cloud

Table of contents

Data-cat

Deploying DataDog for a large scale infrastructure

Definitions

  • Geographic Regions
  • Stages
  • Applications

Geographic Regions

Matches the definitions of AWS Regions. It can be used for GCP or on-prem datacenter as well.

Stages

Different stages of application deployments, usually: dev, qa, prod.

Applications

A service that provides a distinct business functionality.

Goals

  • having all monitors and dashboards in version control
  • having all monitors templated
  • being able to address smaller parts of the infrastructure

Implementation

4 files represent the DataDog configuration for the whole infrastructure.

  • infrastructure.yaml

It contains the logical grouping of applications into stages and regions. The relations are always N:M. 1 region can contain many stages and many applications in each stage.

  • region.yaml

Defaults for a certain region (region).

  • stage.yaml

Defaults for a certain stage (region, stage).

  • application.yaml

Configuration that is specific for a certain application (region, stage, application).

Generating infrastructure.yaml

I recently discovered Dhall that seems like the perfect fit to write the infrastructure in and than generate the YAML files.

The type safe definitions looks like the following:

let keyValue =
        λ(k : Type)
       λ(v : Type)
       λ(mapKey : k)
       λ(mapValue : v)
       { mapKey = mapKey, mapValue = mapValue }

let ApplicationConfig : Type = { created_at : Text }

let Application = < etcd | postgresql | hadoop >
let Applications = Prelude.Map.Type Application ApplicationConfig
let application = keyValue Application ApplicationConfig

let Stage = < dev | qa | prod >
let Stages = Prelude.Map.Type Stage Applications
let stage = keyValue Stage Applications

let AwsRegion = < us-east-1 | eu-central-1 | eu-west-1 >
let AwsRegions = Prelude.Map.Type AwsRegion Stages
let awsRegion = keyValue AwsRegion Stages

After having these definitions we can create the infrastructure:

in  [ awsRegion AwsRegion.us-east-1
        [ stage Stage.dev
             [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
             , application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
             ]
        , stage Stage.qa
             [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
             , application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
             ]
        ]

    , awsRegion AwsRegion.eu-west-1
        [ stage Stage.dev
             [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
             , application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
             ]
        ]
    , awsRegion AwsRegion.eu-central-1
        [ stage Stage.dev
            [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" }
            , application Application.etcd { created_at = "2019-11-04T09:00:00Z" }
            ]
        ]
    ]

Generating the YAML:

dhall-to-yaml --file infrastructure.dhall > infrastructure.yaml

Generating the folder structure

python3 gen.py
region: eu-central-1, stage: dev
region: eu-central-1, stage: dev, app: etcd
region: eu-central-1, stage: dev, app: hadoop
region: eu-west-1, stage: dev
region: eu-west-1, stage: dev, app: etcd
region: eu-west-1, stage: dev, app: hadoop
region: eu-west-1, stage: prod
region: eu-west-1, stage: prod, app: etcd
region: eu-west-1, stage: prod, app: hadoop

Templates

Templates folder has the monitor templates.

Example template:

---
name: High CPU load on application_name:{application_name} stage:{stage} {{{{host.name}}}} / {{{{host.ip}}}}
tags:
  - application_name:{application_name}
  - stage:{stage}
  - region:{region}
type: metric alert
query: avg(last_5m):avg:system.load.norm.5{{application_name:{application_name},stage:{stage}}} by {{host}} > {critical_threshold}
message: >-2
  High CPU load on application_name:{application_name} stage:{stage} {{{{host.name}}}} / {{{{host.ip}}}} for 5 consecutive minutes on this node.
  Url: https://wd-global-prod.datadoghq.com/monitors/{monitor_id}
  {slack_notification_channel}
monitor_options:
  notify_audit: False
  locked: False
  timeout_h: 0
  silenced: {{}}
  include_tags: True
  require_full_window: True
  new_host_delay: 300
  notify_no_data: False
  renotify_interval: 0
  escalation_message: >-2
    CPU load is still damn high.
  thresholds:
    critical: {critical_threshold}
    warning: {warning_threshold}

This gets rendered using Python format and converted to a dict that used to talk to the DataDog API.

Defaults and specifics

Defaults are stage wide settings specifics are specific to a single application (in a region & stage).

Tags alignment

For all of these above to work together nicely there is a dependency on tags being deployed every node, ELB, etc., so that we can reference those in monitors and dashboards.

Deployment

I gave up on Conda and now just using venv from Python.

/usr/local/opt/python3/bin/python3 -m venv venv
. venv/bin/activate.fish #or the shell you are using
pip install --upgrade pip
pip install --upgrade toml pyyaml

Deploying monitors

Deploying a whole stage:

./data-cat/data-cat.py deploy-monitors -r eu-west-1 -s qa

Deploying a single application:

./data-cat/data-cat.py deploy-monitors -r eu-west-1 -s qa -a etcd

Deploying dashboards

Deploying a whole stage:

./data-cat/data-cat.py deploy-dashboards -r eu-west-1 -s qa

Deploying a single application:

./data-cat/data-cat.py deploy-dashboards -r eu-west-1 -s qa -a etcd

What to monitor

Following Brendan Gregg’s use method and the suggested things to monitor:

  • CPUs: sockets, cores, hardware threads (virtual CPUs)
  • Memory: capacity
  • Network interfaces
  • Storage devices: I/O, capacity
  • Controllers: storage, network cards
  • Interconnects: CPUs, memory, I/O

How to monitor it (examples):

  • utilization: as a percent over a time interval. eg, “one disk is running at 90% utilization”
  • saturation: as a queue length. eg, “the CPUs have an average run queue length of four”
  • errors: scalar counts. eg, “this network interface has had fifty late collisions”