GitXplorerGitXplorer
l

prodigy-pdf-custom-recipe

public
198 stars
20 forks
2 issues

Commits

List of commits on branch master.
Verified
7a9025a5b1949755477ae6f193f4c1ab4037c3b7

Update README and docs (#7)

lljvmiranda921 committed 3 years ago
Unverified
65ffefc4bcbb39fcfc64626219c12e9ca6d22a47

Remove unnecessary imports

lljvmiranda921 committed 3 years ago
Unverified
8793b3dbb818b8fff7388d96c7f7fba84f4e4d8b

Run isort to codebase

lljvmiranda921 committed 3 years ago
Unverified
50651040a25f51a83f9c2d81f925827b100fae60

Add thresholding value for image.qa

lljvmiranda921 committed 3 years ago
Unverified
b3ea9c1c96767bfdecf5b09fe60185f184eaec87

Update README with new commands

lljvmiranda921 committed 3 years ago
Unverified
3765f6cfc9a76472d1cc51dce72f62f8b04bc949

Implement majority of the QA step

lljvmiranda921 committed 3 years ago

README

The README file for this repository.

🪐 spaCy Project: Prodigy recipes for document processing and layout understanding

This repository contains recipes on how to use Prodigy and Hugging Face for annotating, training, and reviewing document layout datasets. We'll be finetuning a LayoutLMv3 model using FUNSD, a dataset of noisy scanned documents.

This also serves as an illustration of how to design document processing solutions. I attempted to generalize this approach into a framework, which you can read more from my blog.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command Description
install Install dependencies
hydrate-db Hydrate the Prodigy database with annotated data from FUNSD
review Review hydrated annotations
train Train FUNSD model
qa Perform QA for the test dataset using a trained model
clean-db Drop all generated Prodigy datasets
clean-files Clean all intermediary files

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all installhydrate-dbtrain
clean-all clean-dbclean-files

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File Source Description
assets/funsd.zip URL FUNSD dataset - noisy scanned documents for layout understanding