emphf

emphf is a minimal perfect hashing library for large-scale key sets focused on speed and low memory usage. A minimal perfect hash function (MPHF) is a data structure that maps injectively a static set of strings of size n to the integer set [0, n-1]. The overall space usage of the MPHFs generated with emphf is 2.61 n bits plus a small constant factor.

The algorithms used in this library are described in the paper Cache-Oblivious Peeling of Random Hypergraphs by Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna.

All the algorithms implemented here construct a MPHF for a given key set using a standard scheme based on the Majewski–Wormald–Havas–Czech (MWHC) construction, that is based on finding a peeling order of a random 3-hypergraph.

compute_mphf_seq computes the peeling order using the standard in-memory linear-time peeling algorithm, implemented with the xor-trick described in the paper. This is the fastest MPHF construction implementation that we know of.
compute_mphf_scan_mmap computes the peeling order using the new algorithm proposed in the paper, using an mmapped temporary file for its data structure, thus in an external-memory setting.
compute_mphf_hem uses a technique described in F. C. Botelho, R. Pagh, and N. Ziviani, “Practical perfect hashing in nearly optimal space”, that splits the key sets in bucket that are small enough that the perfect hash function is easy to compute in memory for each bucket. While the construction is faster than by using compute_mphf_scan_mmap, the resulting data structure takes slightly more space and it is slightly slower.

compute_mphf_seq and compute_mphf_scan_mmap construct the same data structure, but the second should be used when the space needed to construct the MPHF does not fit in main memory. The obtained MPHF data structures can perform lookups significantly faster than other implementations using the same algorithm (see the paper for details).

The script test_all.py executes all the algorithms on a given file (or the standard UNIX dictionary if none is provided) and checks that the generated hash function is indeed minimal and perfect.

The project uses CMake. To build it on Unix systems it should be sufficient to do the following:

$ cmake .
$ make

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
base_hash.hpp		base_hash.hpp
bitpair_vector.hpp		bitpair_vector.hpp
bitstream.hpp		bitstream.hpp
common.hpp		common.hpp
compute_mphf_generic.hpp		compute_mphf_generic.hpp
compute_mphf_hem.cpp		compute_mphf_hem.cpp
compute_mphf_scan.cpp		compute_mphf_scan.cpp
compute_mphf_scan_mmap.cpp		compute_mphf_scan_mmap.cpp
compute_mphf_seq.cpp		compute_mphf_seq.cpp
emphf_config.hpp.in		emphf_config.hpp.in
gen_synthetic_data.cpp		gen_synthetic_data.cpp
hypergraph.hpp		hypergraph.hpp
hypergraph_sorter_scan.hpp		hypergraph_sorter_scan.hpp
hypergraph_sorter_seq.hpp		hypergraph_sorter_seq.hpp
internal_memory_model.hpp		internal_memory_model.hpp
mmap_memory_model.hpp		mmap_memory_model.hpp
mphf.hpp		mphf.hpp
mphf_hem.hpp		mphf_hem.hpp
packed_edge_list.hpp		packed_edge_list.hpp
packed_vector.hpp		packed_vector.hpp
perfutils.hpp		perfutils.hpp
ranked_bitpair_vector.hpp		ranked_bitpair_vector.hpp
test_all.py		test_all.py
test_mphf.cpp		test_mphf.cpp
test_mphf_generic.hpp		test_mphf_generic.hpp
test_mphf_hem.cpp		test_mphf_hem.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emphf

About

Releases 1

Packages

Languages

License

ot/emphf

Folders and files

Latest commit

History

Repository files navigation

emphf

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages