GitXplorerGitXplorer
U

community

public
26 stars
8 forks
0 issues

Commits

List of commits on branch main.
Verified
3a227950ec0cd04dcd2b88290686299c4c24a325

docs: add huggingface link to README (#34)

LLaverdeS committed 2 years ago
Verified
e2e838373a737b35512c2b8076ed3f7dcf5a51bf

docs: refinement of repo and README (#29)

LLaverdeS committed 2 years ago
Verified
4b7138bbb224d7064bf578b0bb64bb9cf87288b8

docs: update issue templates (#30)

LLaverdeS committed 2 years ago
Verified
e390504629bddfe4dfb14e78c908ce089fcad2c2

chore: Python 3.8.15 is the new 3.8.14 (#24)

ccragwolfe committed 2 years ago
Verified
e81228460f34931455980328d8734004f61aecd5

docs: add CONTRIBUTING.md and mention it from README.md (#22)

LLaverdeS committed 2 years ago
Verified
dfc2ef09185b889f39150a14c5f4a1d5630b6f9d

docs: Add output_format to processing pipelines spec (#21)

yyuming-long committed 2 years ago

README

The README file for this repository.

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to the Unstructured Community! 😊

We are building an ecosystem of preprocessing pipeline tools for Data Scientists and Data Engineers, so they may quickly work through the challenge of extracting structured data from unstructured raw documents.

☕ Getting Started

Unstructured's open-source packages currently target Python 3.8. If you are using or contributing to Unstructured code, we encourage you to work with Python 3.8 in a virtual environment. You can use the following instructions to get up and running with a Python 3.8 virtual environment with pyenv-virtualenv:

Mac / Homebrew

  1. Install pyenv with brew install pyenv.
  2. Install pyenv-virtualenv with brew install pyenv-virtualenv
  3. Follow the instructions here to add the pyenv-virtualenv startup code to your terminal profile.
  4. Install Python 3.8 by running pyenv install 3.8.15.
  5. Create and activate a virtual environment by running:
pyenv virtualenv 3.8.15 unstructured
pyenv activate unstructured

You can changed the name of the virtual environment from unstructured to another name if you're creating a virtual environment for a pipeline. For example, if you're a creating a virtual environment for the SEC preprocessing, you can run pyenv virtualenv 3.8.15 sec.

Linux

  1. Run git clone https://github.com/pyenv/pyenv.git ~/.pyenv to install pyenv
  2. Run git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv to install pyenv-virtualenv as a pyenv plugin.
  3. Follow steps 3-5 from the Mac/Homebrew instructions.

👐 Contributions

We welcome contributions! See all open issues for bugs, features, and enhancement requests in the community.

When contributing, please follow our Contributing to Unstructured guidelines.

Don't hesitate to reach out us on slack with any questions. Thank you!

📗 Key Concepts

🧱 Bricks

Bricks are the "blocks" or Python functions from which preprocessing pipelines are made, and are organized in the Unstructured library. These collectively form the Swiss Army knife that Python developers can use to extract structured data from raw documents into the format that they want. They may be used independently of any other Unstructured repos under the terms of its license. pip install unstructured and you are good to go.

🔹 Preprocessing pipeline APIs

A preprocessing pipeline API (or just "pipeline API") is a notebook that includes a Python function capable of transforming a raw document to structured data. By following the documented conventions, FastAPI APIs may be auto-generated from a pipeline notebook.

See pipeline-sec-filings for an example repo includes a preprocessing pipeline API and auto-generated FastAPI.

🔩 Developer tools for generating FastAPIs

The unstructured-api-tools library includes the tooling required to create FastAPIs from pipeline notebooks.

🤗 Hugging Face

Hugging Face Spaces offer a simple way to host ML demo apps, models and datasets directly on our organization’s profile. This allows us to showcase our projects and work collaboratively with other people in the ML ecosystem. Visit our space here!