GitXplorerGitXplorer
e

set-similarity-search-benchmarks

public
12 stars
2 forks
0 issues

Commits

List of commits on branch master.
Verified
a567dee9c677d846065d183bded9fa73bbce7dc0

Update README.md

eekzhu committed 6 years ago
Verified
f3b6ede6a3d7def79d3edfd8168d3b57c0030c90

Add new benchmark data sets

eekzhu committed 6 years ago
Verified
a87712ca49ee53c66dc3474345b827015b06b92e

Update README.md

eekzhu committed 6 years ago
Verified
66ae5ccf1672ef82cb2dccfcf0c270e1c8f71882

Update README.md

eekzhu committed 6 years ago
Verified
1cdc838e2240e8f799e4c778ac682ec6264929f9

Update README.md

eekzhu committed 6 years ago
Verified
1918b30d1200136ae11f75a5147eaa3c9a1f9620

Update README.md

eekzhu committed 6 years ago

README

The README file for this repository.

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set Note Number of sets Number of tokens File size Papers
BMS-POS (Source) A set is a purchase in a shop; a token is a product category in that purchase 515,597 1,657 3.8 MB 1
Kosarak (Source) A set is a user; a token is a link clicked by the user 990,002 41,270 13 MB 1
Flickr A set is a photo; a token is a tag or a word from the title 1,680,490 810,660 29 MB 1,4
Netflix (Source) A set is a user; a token is a movie rated by the user 480,189 17,770 166 MB 1
Orkut (Source) A set is a user; a token is a group membership of the user 1,853,285 15,293,693 378 MB 1
Canada-US-UK Open Data
Query Benchmark 1k
Query Benchmark 10k
Query Benchmark 100k
A set is a table column; a token is a data value 745,414 562,320,456 2.52 GB 2
WDC Web Table 2015, English Relational-Only
Query Benchmark 100
Query Benchmark 1k
Query Benchmark 10k
A set is a table column; a token is a data value 163,510,917 184,644,583 4.32 GB 2,3

All data sets follow the same format:

  • Compressed using gzip.
  • First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
  • All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
  • All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

  1. An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
  2. LSH Ensemble: Internet Scale Domain Search, VLDB 2016
  3. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
  4. Spatio-textual similarity joins, VLDB 2012