This repository contains recipes on how to use Prodigy and Hugging Face for annotating, training, and reviewing document layout datasets. We'll be finetuning a LayoutLMv3 model using FUNSD, a dataset of noisy scanned documents.
This also serves as an illustration of how to design document processing solutions. I attempted to generalize this approach into a framework, which you can read more from my blog.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
install |
Install dependencies |
hydrate-db |
Hydrate the Prodigy database with annotated data from FUNSD |
review |
Review hydrated annotations |
train |
Train FUNSD model |
qa |
Perform QA for the test dataset using a trained model |
clean-db |
Drop all generated Prodigy datasets |
clean-files |
Clean all intermediary files |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
install → hydrate-db → train
|
clean-all |
clean-db → clean-files
|
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/funsd.zip |
URL | FUNSD dataset - noisy scanned documents for layout understanding |