e

set-similarity-search-benchmarks

public

12 stars

2 forks

0 issues

Commits

List of commits on branch master.

Verified

a567dee9c677d846065d183bded9fa73bbce7dc0

Update README.md

eekzhu committed 6 years ago

Verified

f3b6ede6a3d7def79d3edfd8168d3b57c0030c90

Add new benchmark data sets

eekzhu committed 6 years ago

Verified

a87712ca49ee53c66dc3474345b827015b06b92e

Update README.md

eekzhu committed 6 years ago

Verified

66ae5ccf1672ef82cb2dccfcf0c270e1c8f71882

Update README.md

eekzhu committed 6 years ago

Verified

1cdc838e2240e8f799e4c778ac682ec6264929f9

Update README.md

eekzhu committed 6 years ago

Verified

1918b30d1200136ae11f75a5147eaa3c9a1f9620

Update README.md

eekzhu committed 6 years ago

README

The README file for this repository.

Set Similarity Search Bencmarks

Benchmark data sets for set similarity search algorithms.

Data set	Note	Number of sets	Number of tokens	File size	Papers
BMS-POS (Source)	A set is a purchase in a shop; a token is a product category in that purchase	515,597	1,657	3.8 MB	1
Kosarak (Source)	A set is a user; a token is a link clicked by the user	990,002	41,270	13 MB	1
Flickr	A set is a photo; a token is a tag or a word from the title	1,680,490	810,660	29 MB	1,4
Netflix (Source)	A set is a user; a token is a movie rated by the user	480,189	17,770	166 MB	1
Orkut (Source)	A set is a user; a token is a group membership of the user	1,853,285	15,293,693	378 MB	1
Canada-US-UK Open Data Query Benchmark 1k Query Benchmark 10k Query Benchmark 100k	A set is a table column; a token is a data value	745,414	562,320,456	2.52 GB	2
WDC Web Table 2015, English Relational-Only Query Benchmark 100 Query Benchmark 1k Query Benchmark 10k	A set is a table column; a token is a data value	163,510,917	184,644,583	4.32 GB	2,3

All data sets follow the same format:

Compressed using gzip.
First line of the main file is <number of sets> <number of tokens> and optionally a third number <sum of all set sizes>
All other lines are <set size>\t<1>,<2>,<3>,..., where \t is a tab separator, <1> and so on are tokens.
All tokens are integers, transformed from the original strings using a global ascending frequency order.

Papers in set similarity search using the above data sets:

An Empirical Evaluation of Set Similarity Join Techniques, VLDB 2016
LSH Ensemble: Internet Scale Domain Search, VLDB 2016
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, SIGMOD 2019 (To Appear)
Spatio-textual similarity joins, VLDB 2012