GitXplorerGitXplorer
p

DensitySampler

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
b2cf9b0bac4649a6038be3fc1662d93e514640e0

vertex

pperegilk committed 7 months ago
Unverified
347d2616251f83277f43af34c847de02b7103612

test

pperegilk committed 7 months ago
Unverified
5ed879a5dd6bfd2b67e5d1e56eccbe6e14c45dd0

small changes

pperegilk committed 7 months ago
Unverified
7c53a919bd6c7d632246c2d78509c123650ee205

askllm

pperegilk committed 7 months ago
Unverified
e6f2d5e1228102ba0528732c91ac2b7378d5aaf2

negative

pperegilk committed 8 months ago
Verified
2474533cda3db64a598640567e6c1fc19e334e97

Update experiment.md

pperegilk committed 8 months ago

README

The README file for this repository.

DensitySampler

This is an experimental implementation of density sampling for performing semantic deduplication of large corpora. The main idea is to remove semi-duplicates from large dataset to allow for faster and more accurate training.

The script tries to implement a very efficiant way of calculating this based on the ideas in the following papers:

1. Coleman, Benjamin, and Anshumali Shrivastava. "Sub-linear race sketches for approximate kernel density estimation on streaming data." Proceedings of The Web Conference 2020. 2020.
2. Coleman, Benjamin, Richard Baraniuk, and Anshumali Shrivastava. "Sub-linear memory sketches for near neighbor search on streaming data." International Conference on Machine Learning. PMLR, 2020.
3. Coleman, Benjamin, and Anshumali Shrivastava. "A one-pass distributed and private sketch for kernel sums with applications to machine learning at scale." Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.
4. Coleman, Benjamin, et al. "One-pass diversified sampling with application to terabyte-scale genomic sequence streams." International Conference on Machine Learning. PMLR, 2022.
5. Liu, Zichang, et al. "One-Pass Distribution Sketch for Measuring Data Heterogeneity in Federated Learning." Advances in Neural Information Processing Systems 36 (2024).

The script assumes the following directory structure:

main/
|-- original_corpus/
|-- paths/
|-- embeddings/
|-- normalised_embeddings/
|-- scratch/
|-- density_scores/
|-- final/

The first part creates embeddings. Currently it uses sentence-transformers/all-MiniLM-L6-v2 that creates 384-dimentional multilingual embeddings. This can be replaced with any other encoder-model from HuggingFace. The default model is using L2 normalising on the embeddings already, so we can save these directly to normalised_embeddings/. The script reads the text-field of the jsonlines-file. If your corpus is in parque, please use the utils/convert_parquet_to_jsonlines.pyfirst.

Not that this script takes quite a long time to run even on fast computers. It works on single files, so that it can be easily paralellised.

python create_embeddings.py --input_file myfile.jsonl --paths_dir paths --embeddings_dir embeddings --emb_size 384

Note that for the default model the embeddings are already normalised. If you need to use another model that does not normalise the output, please use the script create_normalised_embeddings.py.

The next step is to create the density scores. This script should take roughly an hour per GB of data.

python create_density_scores.py --embedding_input_folder embeddings --json_output_folder nonormalised_density_scores --nonormalise

TODO:

  • Does not work
  • Batching is weird
  • Not merged with jsonlines file
python create_density_scores.py --input_folder normalised_embeddings --output_folder density_scores --kernel_bandwidth 0.035 --sketch_reps 1000 --sketch_range 20000