A modern approach to analyzing webserver access logs!
- Keep on reading for a detailed, top-down description of analog's features.
- Peruse this notebook for a hands-on introduction using my website's logs as example.
- Consult this grammar for the concise summary of analog's fluent interface.
Analog builds on two technologies that have become ubiquitous when it comes to data processing:
- Notebooks, which provide an effective graphical read-eval-print-loop (REPL);
- Pandas, which handles the low-level aspects of data wrangling with its dataframe abstraction.
Analog then adds:
- Parsing and enriching the raw, textual access logs;
- File management to automatically ingest monthly logs and combine them into a single dataframe;
- A convenient, fluent interface that makes common analysis tasks easy, while seamlessly falling back onto Pandas for more complex tasks.
Many websites have switched to client analytics as a service. While certainly convenient and often free, these services also have a terrible track record when it comes to privacy and hence are entirely exploitative of website visitors. Even when they are self-hosted, the necessary client code adds unnecessary bloat to webpages. It also is far from guaranteed to produce meaningful results because, by the time the code might run, users have already moved on or because they have blocked the client code or JavaScript altogether.
Being more respectful of website visitors and hence removing invasive client analytics is easy enough. But we'd still like to have some insight into how visitors use our websites. Well, there are server access logs! Alas, in most enterprises, those logs feed into larger log analytics and monitoring solutions, which are overkill for an individual or small business using shared hosting. Then there are the ancient AWStats and Webalizer, typically included with the equally ancient cPanel. Finally, there is the actively maintained GoAccess. While pretty nifty, even that tool shows its age: It's written in C and not exactly designed for extensibility or answering ad-hoc queries.
Analog relies on notebooks for graphical REPL and Pandas for low-level data wrangling. It then adds a convenient, fluent interface that makes common analysis tasks easy. It also manages monthly log files, parsing and enriching the raw access logs as needed and automatically combining them into a single dataframe.
Analog stores all data for a website in a dedicated directory. It uses three subdirectories:
-
access-logs
stores monthly access logs in files named likeapparebit.com-Aug-2022.gz
. -
enriched-logs
stores parsed and enriched monthly logs as Parquet files named likeapparebit.com-2022-08.parquet
. -
location-db
stores IP location databases in GeoLite2 format named likecity-2022-07-26.mmdb
. Analog uses the most recent one.
Analog creates three files in its data directory:
-
The combined dataframe, again in Parquet format, is named like
apparebit.com-2018-07-2022-07.parquet
. -
The metadata sidecar file in JSON format has the same name but with a
.json
file extension. -
hostnames.json
caches previous DNS lookups of IP addresses, which are by far the slowest part of ingesting raw access logs.
When running analog from the command line or invoking analog.latest()
, analog
first ingests raw monthly logs that have no corresponding enriched log files.
Then, if there is no combined log covering all monthly log files or one of those
files was just updated, analog creates a new combined log and its metadata
sidecar file.
When using the --clean
command line option or invoking analog.latest()
with
a truthy clean
keyword argument, analog starts by deleting all monthly log
files stored in enriched-logs
, which causes both monthly and combined log
files to be re-generated. You can also deleted these files manually. But
please, do not delete access-logs
or hostnames.json
.
Analog combines properties parsed from the raw access logs, derived from the
original data, and derived from external databases for domain names, IP
locations, and user agents. The SCHEMA
mapping in the
analog.schema
module defines the Pandas schema for the resulting dataframes. It makes use of
several enumerations defined in the
analog.label
module.
Note that analog uses two independent databases of user agents to detect bots
ā matomo and ua-parser.
Each project detects a good number of bots not detected by the other. Hence,
analog's only.bots()
and only.humans()
filters take both into account.
analog also fixes a minor misclassification made by ua-parser.
As of July 13, 2023, the latest version of the ua-parser
package is 0.18.0. It
was released five days before, on July 8, 2023. Since that package saw only two
updates between 2018 and 2022, I did use a forked version, ua-parser-up2date
.
Its latest version is 0.16.1, which was released on December 16, 2022. Looking
at the two packages' update histories for the last couple of years, the original
ua-parser
seems preferable again.
Analog's fluent interface makes use of computed properties as well as methods.
Properties typically distinguish between different types of clauses whereas
methods terminate the clauses. In the grammar below, property and method names
are double quoted. The attribute selector's period is written as <dot>
and
methods are followed by ()
, with parameters listed in between.
The following grammar summarizes the fluent interface. At the top-level, a sentence consists of terms to specify (1) selection, (2) grouping and aggregation, as well as (3) display:
sentence -> selection grouping-and-aggregation display
The selection extracts rows that meet certain criteria. It distinguishes
between three kinds of criteria, namely (1) terms that start with the .only
property and filter based on attributes of the HTTP protocol, (2) terms that
start with the .over
property and filter based on datetime, and (3) terms that
invoke .select()
or .map()
and thus serve as extension points. You can track
the impact of these filters with the .count_rows()
method, which appends the
number of rows to the context's list inside a with analog.fresh_counts()
block. It is an error to call this method outside such a block. Square brackets
containing a slice, select rows by their numbers.
selection ->
| <dot> "only" <dot> protocol selection
| <dot> "over" <dot> datetime selection
| <dot> "filter" (predicate) selection
| <dot> "map" (mapper) selection
| <dot> "count_rows" () selection
| [ <slice> ] selection
| š
The protocol criterion contains several convenience methods that filter
common protocol values. The .has()
method is more general and can filter on
the content_type
, method
, protocol
and status_class
column. Since the
various enumeration constants defined in the label
module
uniquely identify the column, there is no need for also specifying the column
name. In contrast, the .equals()
method generalizes .has()
for columns that
do not have a categorical type and therefore requires the column name. Finally,
the .contains()
method implements a common operation on string-valued data.
protocol ->
| "bots" ()
| "humans" ()
| "GET" ()
| "POST" ()
| ...
| "markup" ()
| ...
| "successful" ()
| "redirection" ()
| "client_error" ()
| "server_error" ()
|
| "not_found" ()
| "equals" (column, value)
| "one_of" (column, value, value, ...)
| "contains" (column, value)
The .bots()
and .humans()
methods categorize requests based on the is_bot
and is_bot2
properties. They concisely capture two different third-party
classifications of the user agent header. Also see the hands-on
notebook.
In contrast to Pandas' expressive and complex operations on times and dates, analog's datetime criterion is much simpler ā and more limited. It selects the day, week, or year ending having the last entry in the log as its last day.
either the last calendar day, month, or year containing the last entry in the log
day, month, or year ending with the end of the log or an arbitrary range
specified by two Python datetimes or Pandas timestamps. If your analysis focuses
on calendar months, you may find that the monthly_slice()
and
monthly_range()
functions in the month_in_year
module
come in handy. Note that all datetimes and timestamps must have a valid
timezone. It defaults to UTC in analog's own code.
datetime ->
| "last_day" ()
| "last_week" ()
| "last_year" ()
| "range" (begin, end_inclusive)
About extensibility: Analog is designed to make common log analysis
steps simple and thereby reduce the barrier to entry when using Pandas for log
analysis. But for implementing uncommon analysis steps, you still need to use
Pandas. In particular, you access the wrapped Pandas dataframe or series through
the .data
property.
Since unwrapping a dataframe, invoking a Pandas method, and then rewrapping the
result is a bit tedious, analog has two extension methods that apply an
arbitrary callable on the wrapped dataframe while also wrapping the result. The
.select()
method takes a predicate producing a boolean series and the .map()
method takes transformation producing another dataframe.
There are three options for grouping and aggregation: a rate and metric,
just a metric by itself, or an explicit bypass of metrics with the .just
property. Requiring explicit bypass arguably is less elegant than just omitting
unnecessary clauses. But it also keeps the implementation simpler and hence won
out.
grouping-and-aggregation ->
| rate <dot> metric
| <dot> metric
| <dot> "just"
A rate is indicated by the .monthly
property. So far, I haven't seen the
need to add more options.
rate -> <dot> "monthly"
Currently supported metrics are (1) the number of requests, (2) the value
counts for a given column, and (3) the unique values for a given column. The
.status_classes()
and .content_types()
methods are convenient aliases for
specific value counts. The unique_values()
method makes little sense as a rate
and hence is only supported without a preceding .monthly
property.
metric ->
| <dot> "requests" ()
| <dot> "content_types" ()
| <dot> "status_classes" ()
| <dot> "value_counts" (column)
| <dot> "unique_values" (column)
About result types: The result of a selection always is another wrapped
Pandas dataframe. However, if the grouping and aggregation is just a metric
without rate, the result of .requests()
is an integer value that terminates
the fluent expression. Other metrics without rate such as .value_counts()
and .unique_values()
produce a wrapped Pandas series. If the grouping and
aggregation includes the rate, the result of .requests()
is a wrapped Pandas
series. Other metrics with rate produce a wrapped Pandas dataframe.
The display formats, prints, or plots the data. The .format()
method
converts the wrapped series or dataframe into lines of text. It terminates the
fluent sentence to return the result. .count_rows()
appends the number of rows
to the context inside a with analog.fresh_counts()
block, whereas square
brackets containing a slice pick rows by their numbers. .show()
displays the
data as text and .plot()
as a graph.
display ->
| <dot> "format" ()
| <dot> "count_rows" () display
| [ <slice> ] display
| <dot> "show" (rows = None) display
| <dot> "plot" (**kwargs) display
| <dot> "also" () sentence
| <dot> "done" ()
| š
Finally, .also()
starts another sentence, as long as the wrapped data is a
dataframe, and .done()
terminates the sentence. Since it returns None
, the
latter method suppresses the display of a series or dataframe in Jupyter
notebooks.
The implementation generally follows the grammar. A class implementing a clause
typically has the same name as the corresponding nonterminal, though the name is
CamelCased and prefixed with Fluent
. All classes representing nonterminals
inherit from the same abstract base class FluentTerm
, which holds the wrapped
state and provides convenient, private methods for creating new subclass
instances. Since, as described above, some statistics result in series instead
of dataframes, that base class and FluentDisplay
are generic.
Three features of the implementation stand out, especially in a notebook:
- Wrapped series and dataframes display as HTML tables in Jupyter, when
invoking
.show()
and when becoming a cell's value. - When the fluent grammar generates new wrapped series, it makes sure that the series have meaningful index and data names.
- Wrapped series and dataframes support slicing by row numbers, so you can throttle the amount of data displayed in a notebook or interactive shell, even when relying on the notebook for doing the displaying.
The main entry point for fluent analysis is:
def analyze(frame: pd.DataFrame) -> FluentSentence: ...
It returns an instance of FluentSentence
. A second function recombines several
wrapped or unwrapped series into a dataframe again, notably for plotting:
def merge(
*series: FluentTerm[pd.Series] | pd.Series,
**named_series: FluentTerm[pd.Series] | pd.Series,
) -> FluentSentence:
The function returns a wrapped dataframe that combines all series given as arguments. For series passed with keyword arguments, it also renames the series to the keywords.
The count_rows()
method supported by FluentSentence
and FluentDisplay
requires a context provides with a list for those counts. You create the context
through a with fresh_counts() as counts
statement.
Happy, happy, joy, joy! š
Ā© 2022 Robert Grimm. Apache 2.0 license. GitHub.