GitXplorerGitXplorer
b

html-parse

public
29 stars
6 forks
2 issues

Commits

List of commits on branch master.
Unverified
fcbdfe4ae4da14e9af0e21fabfe2da178f041970

Add html-parse-length

bbgamari committed a year ago
Unverified
07dc1428ee13b1790e22610608094fd4fdf7a59c

Bump bounds

bbgamari committed a year ago
Verified
4a46e25e518fa477de20bc72bc764697cdff6c3a

Merge pull request #23 from seereason/master

bbgamari committed 2 years ago
Unverified
a69953212077932bbcf570268003a7615017d085

Add \r to the isWhitespace predicate

dddssff committed 2 years ago
Verified
06ffd36bbac23cb913adb2566be1505cc519e946

Merge pull request #22 from bgamari/wip/char-refs

bbgamari committed 2 years ago
Unverified
ebf285da8449c1f347bebac649f76814c20a9028

Drop support for GHC 8.2 and earlier

bbgamari committed 2 years ago

README

The README file for this repository.

html-parse

html-parse is an efficient, reasonably robust HTML tokenizer based on the HTML5 tokenization specification. The parser is written using the fast attoparsec parsing library and can exposes both a native attoparsec Parser as well as convenience functions for lazily parsing token streams out of strict and lazy Text values.

For instance,

>>> parseTokens "<div><h1>Hello World</h1><br/><p class=widget>Example!</p></div>"
[TagOpen "div" [],TagOpen "h1" [],ContentText "Hello World",TagClose "h1",TagSelfClose "br" [],TagOpen "p" [Attr "class" "widget"],ContentText "Example!",TagClose "p",TagClose "div"]

Performance

Here are some typical performance numbers taken from parsing a fairly long Wikipedia article,

benchmarking Forced/tagsoup fast Text
time                 171.2 ms   (166.4 ms .. 177.3 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 171.9 ms   (169.4 ms .. 173.2 ms)
std dev              2.516 ms   (1.104 ms .. 3.558 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Forced/tagsoup normal Text
time                 176.9 ms   (167.3 ms .. 188.5 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 180.7 ms   (177.5 ms .. 183.7 ms)
std dev              4.246 ms   (2.316 ms .. 5.803 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Forced/html-parser
time                 20.88 ms   (20.60 ms .. 21.25 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 20.99 ms   (20.81 ms .. 21.20 ms)
std dev              446.1 μs   (336.4 μs .. 596.2 μs)