GitXplorerGitXplorer
l

DevOps-Trouble-Map

public
5 stars
3 forks
12 issues

Commits

List of commits on branch master.
Verified
285f4464bf91acae77bedca106af8d11c1b3009b

Update README.md

llwindolf committed 6 years ago
Unverified
460d64ec676283eaded9b8516626a8b51121e9d1

Allow configuring SSH connection details as user and SSH key

llwindolf committed 10 years ago
Unverified
5a8f58539869e2281675cc7b5b8893f424129b53

Merge branch 'master' of https://github.com/lwindolf/DevOps-Trouble-Map

committed 10 years ago
Unverified
086c10c72caa21548a83380905e1ed15a0445327

Remove incorrect $()

llwindolf committed 10 years ago
Unverified
3b695d7e571febf46baf6fda3e3fd7de5b502a87

Moving nc definition where it is really needed.

llwindolf committed 10 years ago
Unverified
2d60f1e834a71a1910226fc4f6f52bb0fb49b71e

Fixes setting single scalar config value.

llwindolf committed 10 years ago

README

The README file for this repository.

Note: this project is discontinued. Experiences from it went into its successor Polscan!

DevOps Trouble Map

Why is so much knowledge about your IT architecture implicit? Why do we need to check what is running during an incident to know about the state of the system? Which components are affected by this Nagios alert? Why does no one ever update the system documentation?

When you care about above questions try "DevOps Trouble Map" (short DOTM) which

  • doesn't reinvent monitoring, but integrates with Nagios, Icinga & Co.
  • provides automatic layer 4 system archictecture charts.
  • maps alerts live into system architecture charts.

Note that the project is pre-alpha right now. Here are some impressions what the code does so far:

Mapping of Nagios alerts to detected services (note the 2nd column in the alert table):

Alert Mapping

Those Nagios "service check" to "service" mappings are fuzzy logic regular expressions. DOTM brings presets and allows the user to refine them as needed. The fact that those mappings are actually necessary indicates the intrinsic problem of the missing service relation in Nagios, which mixes the concepts of "services" and "service checks". Only with "services" (which we detect based on open TCP ports) we can auto-detect impact.

Service Mapping

Additionally to the Nagios node and service states DOTM aggregates the current connection details from the nodes. It remembers old connections to be able to see service usage transitions and create alarms for long unused or suddenly disconnected services. This helps with typical questions like "do we actually still need this X" or uncover a wrong firewall configuration.

Connection and Service Tracking

Finally those two sources of information are combined into a very simple "graphical" representation:

Node Graph

In this "node graph" color coding indicates Nagios alerts as well as service states only discovered by DOTM.

Server Installation

The DOTM server has the following dependencies

  • netcat
  • redis-server
  • MaxMind GeoIP Lite (optional)
  • Python2 modules
    • redis
    • bottle
    • requests
    • GeoIP

To automatically install the server including its dependencies on Debian/Ubuntu simply run

scripts/install-server.sh

Agent Installation

The DOTM agent has the following dependencies

  • glib-2.0
  • libevent2

To automatically install the dotm_node agent simply run

scripts/install-agent.sh

Of course as the agent is to be run on all monitored systems its single binary should be distributed to all nodes using your favourite automation tool.

Software Stack

DOTM will use the following technologies

  • Simple remote agent "dotm_node" (in C using libevent and glib)
  • Redis as backend store
  • Python2 bottle with Jinja templating
  • JSON backend data access
  • any jQuery library for rendering

architecture overview

Redis Data Schema

So far the following relations are probably needed:

entity overview

Right now the following relation namespaces are used in Redis

  • dotm::nodes (list of node names, resovable via local resolver and to be identical with remote hostname)
  • dotm::nodes::<node name>
    • 'last_fetch' => <timestamp>
    • 'fetch_status' => <'OK' or error message>
    • 'ips' => <comma separated list of IPs>
    • 'service_alerts' => <hash of service name - status tuples>
  • dotm::connections::<node name>::<port>::<remote node/IP> (hash):
    • 'process' => <string>
    • 'connections' => <int>
    • 'last_connection' => <timestamp>
    • 'last_seen' => <timestamp>
    • 'direction' => <in/out>
    • 'remote_host' => <IP or node name>
    • 'remote_port' => <port number or 'high'>
    • 'local_port' => <port number>
  • dotm::services::<node name>::<port> (hash with the following key values):
    • 'process' => <string>
    • 'last_seen' => <timestamp>
  • dotm::resolver::ip_to_node::<IP> (string, <node name>)
  • dotm::checks::nodes::<node name> (key with set expire):
    • JSON containing basic status information: { "node": "hostname01" "status": "UP", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "hostname01 status information" }
  • dotm::checks::services::<node name> (list of service JSONs with set expire):
    • List containing all associated node checks: [ { "node": "hostname01", "service": "Service01 name" "status": "OK", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service01 status information" }, { "node": "hostname01", "service": "Service02 name" "status": "CRITICAL", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service02 status information" } ]
  • dotm::config::* (all preferences, for descriptions check the 'Settings' page)
  • dotm::state (state info, usually update locks + timestamps)
    • last_updated (timestamp)
    • update_running (0 or 1)
    • last_snapshot (timestamp)
  • dotm::queue (list of queued backend tasks in JSON)
    • {"id": <task key>, "fn": <function name/action>, "args": <function arguments>, "kwargs": <function keywords>}
  • dotm::queue::result::<uuid4 name> (status and result of the queued task in JSON)
    • {"status": <pending/processing/ready>, "result": <result in JSON>}
  • dotm::history (list of history <timestamps>)