Note: this project is discontinued. Experiences from it went into its successor Polscan!
Why is so much knowledge about your IT architecture implicit? Why do we need to check what is running during an incident to know about the state of the system? Which components are affected by this Nagios alert? Why does no one ever update the system documentation?
When you care about above questions try "DevOps Trouble Map" (short DOTM) which
- doesn't reinvent monitoring, but integrates with Nagios, Icinga & Co.
- provides automatic layer 4 system archictecture charts.
- maps alerts live into system architecture charts.
Note that the project is pre-alpha right now. Here are some impressions what the code does so far:
Mapping of Nagios alerts to detected services (note the 2nd column in the alert table):
Those Nagios "service check" to "service" mappings are fuzzy logic regular expressions. DOTM brings presets and allows the user to refine them as needed. The fact that those mappings are actually necessary indicates the intrinsic problem of the missing service relation in Nagios, which mixes the concepts of "services" and "service checks". Only with "services" (which we detect based on open TCP ports) we can auto-detect impact.
Additionally to the Nagios node and service states DOTM aggregates the current connection details from the nodes. It remembers old connections to be able to see service usage transitions and create alarms for long unused or suddenly disconnected services. This helps with typical questions like "do we actually still need this X" or uncover a wrong firewall configuration.
Finally those two sources of information are combined into a very simple "graphical" representation:
In this "node graph" color coding indicates Nagios alerts as well as service states only discovered by DOTM.
The DOTM server has the following dependencies
- netcat
- redis-server
- MaxMind GeoIP Lite (optional)
- Python2 modules
- redis
- bottle
- requests
- GeoIP
To automatically install the server including its dependencies on Debian/Ubuntu simply run
scripts/install-server.sh
The DOTM agent has the following dependencies
- glib-2.0
- libevent2
To automatically install the dotm_node agent simply run
scripts/install-agent.sh
Of course as the agent is to be run on all monitored systems its single binary should be distributed to all nodes using your favourite automation tool.
DOTM will use the following technologies
- Simple remote agent "dotm_node" (in C using libevent and glib)
- Redis as backend store
- Python2 bottle with Jinja templating
- JSON backend data access
- any jQuery library for rendering
So far the following relations are probably needed:
Right now the following relation namespaces are used in Redis
- dotm::nodes (list of node names, resovable via local resolver and to be identical with remote hostname)
- dotm::nodes::<node name>
- 'last_fetch' => <timestamp>
- 'fetch_status' => <'OK' or error message>
- 'ips' => <comma separated list of IPs>
- 'service_alerts' => <hash of service name - status tuples>
- dotm::connections::<node name>::<port>::<remote node/IP> (hash):
- 'process' => <string>
- 'connections' => <int>
- 'last_connection' => <timestamp>
- 'last_seen' => <timestamp>
- 'direction' => <in/out>
- 'remote_host' => <IP or node name>
- 'remote_port' => <port number or 'high'>
- 'local_port' => <port number>
- dotm::services::<node name>::<port> (hash with the following key values):
- 'process' => <string>
- 'last_seen' => <timestamp>
- dotm::resolver::ip_to_node::<IP> (string, <node name>)
- dotm::checks::nodes::<node name> (key with set expire):
- JSON containing basic status information: { "node": "hostname01" "status": "UP", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "hostname01 status information" }
- dotm::checks::services::<node name> (list of service JSONs with set expire):
- List containing all associated node checks: [ { "node": "hostname01", "service": "Service01 name" "status": "OK", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service01 status information" }, { "node": "hostname01", "service": "Service02 name" "status": "CRITICAL", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service02 status information" } ]
- dotm::config::* (all preferences, for descriptions check the 'Settings' page)
- dotm::state (state info, usually update locks + timestamps)
- last_updated (timestamp)
- update_running (0 or 1)
- last_snapshot (timestamp)
- dotm::queue (list of queued backend tasks in JSON)
- {"id": <task key>, "fn": <function name/action>, "args": <function arguments>, "kwargs": <function keywords>}
- dotm::queue::result::<uuid4 name> (status and result of the queued task in JSON)
- {"status": <pending/processing/ready>, "result": <result in JSON>}
- dotm::history (list of history <timestamps>)