GitXplorerGitXplorer
B

spacy-extreme

public
36 stars
2 forks
0 issues

Commits

List of commits on branch master.
Verified
aa17b9a07f5764d999be3df17cb62f9d6435b837

Update main.py

committed 5 years ago
Verified
0dae78bbf8e0eca3bae936875efaac497696a282

Update README.md

committed 5 years ago
Verified
f162cd1dbf42ea3534e31fa8dfee78bbb53cdfbf

Update main.py

committed 5 years ago
Verified
2a8a8ba71f46bf0c0e89ead1cfd1cb90704c6c06

Update README.md

committed 5 years ago
Unverified
1f5191d138e6e6d0b95ac9e6d678dc50fc1b1052

Improved chunker

committed 5 years ago
Verified
e06accfa2c5cbbcb56aed5769d2abf8fbd4a4efa

Update README.md

committed 6 years ago

README

The README file for this repository.

spacy-extreme

An example of how to use spaCy for extremely large files without running into memory issues

Memory issues with spaCy

EDIT: the memory issues with running the spaCy pipeline were fixed in #4486. I will keep this repo online as an educational code snippet of how to efficiently chunk your data, though. The rest of this section can be ignored.

SpaCy is a popular, powerful NLP tool that can process a text and get almost any information out of it that you could need. Unfortunately I started running into issues when multiprocessing a single file of 30GB+: the memory usage kept growing. Even with the simplest base case the issue persists. A 'bug fix' is not available, because it is not clear where the memory is leaking. One would expect that the issue lies in spaCy itself, but that would imply that reloading a spaCy instance should free that memory. But that is not the case. It is hard, then, to find a fix - because it is unclear where to start looking.

Because of that, I figured that there must be another way. The solution lies in the multiprocessing library, and more specifically in one of the parameters for Pool. maxtasksperchild is a parameter that ensures that a single child process will execute only n tasks. After that, it will be killed, its memory freed, and replaced by a new process. That is exactly what we need! The memory grows because more and more data is read by a process. We want to limit the number of batches that a process can process so that its memory usage is being kept in check.

Parsing huge files: how to be lenient on memory?

Another issue that you may be faced with, is processing an enormous file and distributing it over child processes, without running into memory issues. We want to process these large files in batches, which will make processing more efficient. These batches cannot be too small because then the workers will consume the batches too quickly, causing only a few workers to be actively processing batches at a time. In the example code, you will find a Chunker class. This chunker will retrieve file pointers from a file. These are integers representing a position in a file, you can think of it as the cursor position, in bytes. In every step, the cursor moves forward batch_size bytes, and return the position of the cursor. When the child process retrieves a cursor position, it will look it up in the file, and get a batch_sized chunk. This chunk can then be processed. As may be clear, the actual file contents are not retrieved by the first step in the reader process. We do not want to share these huge chunks of data between processes, but the file pointer is just an integer; easily and quickly shared.

Usage

usage: main.py [-h] [-b BATCH_SIZE] [--max-length MAX_LENGTH]
               [-m MAX_TASKS_PER_CHILD] [--min-length MIN_LENGTH]
               [-n N_WORKERS] [--spacy-model SPACY_MODEL]
               fin

Parse HUGE text files with spaCy in parallel without running into memory
issues.

positional arguments:
  fin                   input file.

optional arguments:
  -h, --help            show this help message and exit
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        batch size (in bytes). (default: 1048576)
  --max-length MAX_LENGTH
                        sentences with more than 'max_length' will not be
                        included in the output. (default: None)
  -m MAX_TASKS_PER_CHILD, --max-tasks-per-child MAX_TASKS_PER_CHILD
                        max number of batches that a child process can process
                        before it is killed and replaced. Use this when
                        running into memory issues. (default: 5)
  --min-length MIN_LENGTH
                        sentences with less than 'min_length' will not be
                        included in the output. (default: None)
  -n N_WORKERS, --n-workers N_WORKERS
                        number of workers to use (default depends on your
                        current system).
  --spacy-model SPACY_MODEL
                        spaCy model to use (must be installed). (default:
                        en_core_web_sm)

Best settings

It is hard to tell what the best settings are for a given combination of hardware and the data. On a machine with 384GB of memory and 48 cores, I ran the script with the following settings. Memory consumption never exceeded 78%.

  • -n 24: using 24 cores.
  • --spacy-model en_core_web_lg: the largest Englsih spaCy model
  • -b 50000000: a batch size of 50 MB (50,000,000 bytes). With my data, one such batch was roughly equivalent to 400k sentences
  • -m 5: replace a process after having processed 5 batches. In total each process processes 2M sentences before being replaced

If you do not have a lot of memory available, you will want to set --max-tasks-per-child (-m) to 1 so that an active process is replaced after each batch. In such case, ensure that your batch size is not too small (e.g. not less than 100kB) to maximize efficiency.