Note: this project is discontinued. Experiences from it went into its successor Polscan!

DevOps Trouble Map

Why is so much knowledge about your IT architecture implicit? Why do we need to check what is running during an incident to know about the state of the system? Which components are affected by this Nagios alert? Why does no one ever update the system documentation?

When you care about above questions try "DevOps Trouble Map" (short DOTM) which

doesn't reinvent monitoring, but integrates with Nagios, Icinga & Co.
provides automatic layer 4 system archictecture charts.
maps alerts live into system architecture charts.

Note that the project is pre-alpha right now. Here are some impressions what the code does so far:

Mapping of Nagios alerts to detected services (note the 2nd column in the alert table):

Those Nagios "service check" to "service" mappings are fuzzy logic regular expressions. DOTM brings presets and allows the user to refine them as needed. The fact that those mappings are actually necessary indicates the intrinsic problem of the missing service relation in Nagios, which mixes the concepts of "services" and "service checks". Only with "services" (which we detect based on open TCP ports) we can auto-detect impact.

Additionally to the Nagios node and service states DOTM aggregates the current connection details from the nodes. It remembers old connections to be able to see service usage transitions and create alarms for long unused or suddenly disconnected services. This helps with typical questions like "do we actually still need this X" or uncover a wrong firewall configuration.

Finally those two sources of information are combined into a very simple "graphical" representation:

In this "node graph" color coding indicates Nagios alerts as well as service states only discovered by DOTM.

Server Installation

The DOTM server has the following dependencies

netcat
redis-server
MaxMind GeoIP Lite (optional)
Python2 modules
- redis
- bottle
- requests
- GeoIP

To automatically install the server including its dependencies on Debian/Ubuntu simply run

scripts/install-server.sh

Agent Installation

The DOTM agent has the following dependencies

glib-2.0
libevent2

To automatically install the dotm_node agent simply run

scripts/install-agent.sh

Of course as the agent is to be run on all monitored systems its single binary should be distributed to all nodes using your favourite automation tool.

Software Stack

DOTM will use the following technologies

Simple remote agent "dotm_node" (in C using libevent and glib)
Redis as backend store
Python2 bottle with Jinja templating
JSON backend data access
any jQuery library for rendering

Redis Data Schema

So far the following relations are probably needed:

Right now the following relation namespaces are used in Redis

dotm::nodes (list of node names, resovable via local resolver and to be identical with remote hostname)
dotm::nodes::<node name>
- 'last_fetch' => <timestamp>
- 'fetch_status' => <'OK' or error message>
- 'ips' => <comma separated list of IPs>
- 'service_alerts' => <hash of service name - status tuples>
dotm::connections::<node name>::<port>::<remote node/IP> (hash):
- 'process' => <string>
- 'connections' => <int>
- 'last_connection' => <timestamp>
- 'last_seen' => <timestamp>
- 'direction' => <in/out>
- 'remote_host' => <IP or node name>
- 'remote_port' => <port number or 'high'>
- 'local_port' => <port number>
dotm::services::<node name>::<port> (hash with the following key values):
- 'process' => <string>
- 'last_seen' => <timestamp>
dotm::resolver::ip_to_node::<IP> (string, <node name>)
dotm::checks::nodes::<node name> (key with set expire):
- JSON containing basic status information: { "node": "hostname01" "status": "UP", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "hostname01 status information" }
dotm::checks::services::<node name> (list of service JSONs with set expire):
- List containing all associated node checks: [ { "node": "hostname01", "service": "Service01 name" "status": "OK", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service01 status information" }, { "node": "hostname01", "service": "Service02 name" "status": "CRITICAL", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service02 status information" } ]
dotm::config::* (all preferences, for descriptions check the 'Settings' page)
dotm::state (state info, usually update locks + timestamps)
- last_updated (timestamp)
- update_running (0 or 1)
- last_snapshot (timestamp)
dotm::queue (list of queued backend tasks in JSON)
- {"id": <task key>, "fn": <function name/action>, "args": <function arguments>, "kwargs": <function keywords>}
dotm::queue::result::<uuid4 name> (status and result of the queued task in JSON)
- {"status": <pending/processing/ready>, "result": <result in JSON>}
dotm::history (list of history <timestamps>)

DevOps-Trouble-Map

Commits

Update README.md

Allow configuring SSH connection details as user and SSH key

Merge branch 'master' of https://github.com/lwindolf/DevOps-Trouble-Map

Remove incorrect $()

Moving nc definition where it is really needed.

Fixes setting single scalar config value.

README