Skip to content

f34nk/elixir_html_tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project is not maintained anymore

I do not intend to give a complete analysis here. If something is missing or plain wrong send me a message or feel invited to participate in the forum discussion or create an Issue or submit a PR.

Html tools in Elixir

The landscape of available Elixir packages for html tooling is overseeable but in that sense also very focused. Each library is there for a distinct use case.

Floki Meeseeks Myhtmlex ModestEx TidyEx HtmlSanitizeEx
First Commit Nov 2014 Feb 2017 Aug 2017 Feb 2018 April 2018 July 2015
HTML5 compliant no with default parser; yes with html5ever parser (*) yes with meeseeks_html5ever (*) yes, as a binding to myhtml yes, as a binding to Modest library yes, as a binding to tidy-html5 library no
Can parse XML yes
Supports XPath selectors yes
Supports common CSS selectors yes (22) yes (27) yes (36)
Supports custom CSS selectors non-standard selector implemented yes, flexible Api for custom selectors non-standard selector implemented
Can manipulate nodes yes, but limited yes
Parser return type {tag_name, attributes, children_nodes} Meeseeks.Document {tag_name, attributes, children_nodes} String String String
Use Case parse and select supports HTML and XML; custom selectors; CSS and XPath fast HTML decode/encode pipeable string transformations; provides 16 functions to manipulate HTML corrects and cleans up HTML content by fixing markup errors sanitizer user input

(*) There is also a separate benchmark availbale for Meeseeks vs. Floki Performance.

Test

git clone
mix deps.get

The test folder contains examples of the library features side by side.

mix test

Benchmark

Tested versions:

{:floki, "~> 0.20.0"}
{:meeseeks, "0.7.6"}
{:myhtmlex, "~> 0.2.0"}
{:modest_ex, "~> 1.0.3"}
{:tidy_ex, "~> 1.0.0"}
{:html_sanitize_ex, "~> 1.3.0-rc3"}

Run benchmarks with:

MIX_ENV=prod mix bench

and

MIX_ENV=prod mix benchee

On my AMD FX-8300 Eight-Core Processor, 15 Gb Ram, Ubuntu 14.04, the benchmarks looks something like this:

## FlokiParseBench
bench iterations   average time 
0.2k       50000   50.18 µs/op
0.5k       20000   86.37 µs/op
1k          5000   304.72 µs/op
2k          5000   654.28 µs/op
5k          1000   1585.65 µs/op
10k          500   3843.19 µs/op
50k          100   16846.18 µs/op
100k          50   31044.22 µs/op
200k          20   80808.60 µs/op
350k          10   209489.90 µs/op

## MeeseeksParseBench
bench iterations   average time 
0.2k       20000   74.05 µs/op
0.5k       20000   78.40 µs/op
1k          5000   722.47 µs/op
2k          1000   1525.72 µs/op
5k          1000   2733.66 µs/op
10k          500   4770.79 µs/op
50k          100   11930.73 µs/op
100k         100   18903.71 µs/op
200k          50   31757.00 µs/op
350k          50   60043.98 µs/op

## MyhtmlexParseBench
bench iterations   average time 
0.5k        5000   401.32 µs/op
0.2k        5000   412.80 µs/op
1k          5000   515.46 µs/op
2k          5000   737.43 µs/op
5k          1000   1021.32 µs/op
10k         1000   1644.85 µs/op
50k         1000   2944.80 µs/op
100k         500   4749.36 µs/op
200k         200   7786.63 µs/op
350k         100   18435.59 µs/op

## ModestExParseBench
bench iterations   average time 
1k         10000   181.77 µs/op
0.2k       10000   216.83 µs/op
0.5k       10000   221.71 µs/op
2k          5000   319.47 µs/op
5k          5000   353.81 µs/op
10k         5000   731.99 µs/op
50k         1000   1599.91 µs/op
100k        1000   2951.25 µs/op
200k         500   5285.43 µs/op
350k         100   11944.52 µs/op

## TidyExParseBench
bench iterations   average time 
0.2k       10000   173.74 µs/op
0.5k       10000   201.40 µs/op
1k          5000   307.77 µs/op
2k          5000   442.71 µs/op
5k          1000   1452.07 µs/op
10k         1000   2687.98 µs/op
50k          200   8373.23 µs/op
100k         100   10168.21 µs/op
200k         100   19607.18 µs/op

## HtmlSanitizeExParseBench
bench iterations   average time 
0.2k       10000   173.68 µs/op
0.5k       10000   227.71 µs/op
1k          2000   765.60 µs/op
2k          1000   1791.06 µs/op
5k           500   3970.00 µs/op
10k          200   9017.30 µs/op
50k           50   39859.24 µs/op
100k          20   75973.80 µs/op
200k          10   178685.10 µs/op

Conclusions

The ecosystem of tools is still quite young. There is more to come.

As mentioned in the forum: in this test, Floki does not use the html5 compliant parser, since it is not supported by the latest Erlang version.

Nonetheless, a very rough user guideline could be:

If you are looking for parsing speed of smallish (up to 1kB) html strings, Floki and Meeseeks are the fastest.

Floki offers all common CSS selectors and some limited features to manipulate nodes.

Meeseeks provides a flexible Api for custom selectors. It can also parse XML and supports XPath selectors.

If you are looking for a good performance distribution over many file sizes you can use Myhtmlex. With that you can encode and decode html super fast.

However, if you need to do complex manipulations on the html string you can use ModestEx. With that you get 36 CSS selectors and 16 methods to transform html strings.

For html5 spec accuracy or user input sanitation there are TidyEx amd HtmlSanitizeEx.

All in all, I would say, the focused nature of the tools makes it easy for the user to pick the right tool for the job.

Best, f34nk