GitXplorerGitXplorer
U

community

public
26 stars
8 forks
0 issues

Commits

List of commits on branch main.
Verified
d564cda85fbb62a0f09dc1d2555d492972c597ee

doc: fix example

ccragwolfe committed 2 years ago
Verified
10188af80d548513a8d740594d540cbe1b71a144

docs: update slack link (#45)

MMthwRobinson committed 2 years ago
Verified
25b8cc79fa26cc7cb2d4e09fbf763b30d5671a59

doc: README update (#44)

ccragwolfe committed 2 years ago
Verified
80023af5536f2e8225b2b6bff3c999cc83f1ee7b

fix: make first image appear in argilla notebook (#39)

MMthwRobinson committed 2 years ago
Verified
6c1b53f5ffb1a8b225fe3a1c9744935ff00b8708

docs: summarization model training with `unstructured` + `argilla` + `transformers` (#38)

MMthwRobinson committed 2 years ago
Verified
c11bfc5c14e2c448a73e24d50a30ebe60376010d

docs: Pipeline spec for compressed files, json responses (#35)

ccragwolfe committed 2 years ago

README

The README file for this repository.

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to the Unstructured Community! 😊

We are building an ecosystem of preprocessing pipeline tools for Data Scientists and Data Engineers, so they may quickly work through the challenge of extracting structured data from unstructured raw documents.

☕ Getting Started

Unstructured's open-source packages currently target Python 3.8. If you are using or contributing to Unstructured code, we encourage you to work with Python 3.8 in a virtual environment. You can use the following instructions to get up and running with a Python 3.8 virtual environment with pyenv-virtualenv:

Mac / Homebrew

  1. Install pyenv with brew install pyenv.
  2. Install pyenv-virtualenv with brew install pyenv-virtualenv
  3. Follow the instructions here to add the pyenv-virtualenv startup code to your terminal profile.
  4. Install Python 3.8 by running pyenv install 3.8.15.
  5. Create and activate a virtual environment by running:
pyenv virtualenv 3.8.15 unstructured
pyenv activate unstructured

You can changed the name of the virtual environment from unstructured to another name if you're creating a virtual environment for a pipeline. For example, if you're a creating a virtual environment for the SEC preprocessing, you can run pyenv virtualenv 3.8.15 sec.

Linux

  1. Run git clone https://github.com/pyenv/pyenv.git ~/.pyenv to install pyenv
  2. Run git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv to install pyenv-virtualenv as a pyenv plugin.
  3. Follow steps 3-5 from the Mac/Homebrew instructions.

👐 Contributions

We welcome contributions! See all open issues for bugs, features, and enhancement requests in the community.

When contributing, please follow our Contributing to Unstructured guidelines.

Don't hesitate to reach out us on slack with any questions. Thank you!

📗 Key Concepts

🧱 Bricks

Bricks are the "blocks" or Python functions from which preprocessing pipelines are made, and are organized in the Unstructured library. These collectively form the Swiss Army knife that Python developers can use to extract structured data from raw documents into the format that they want. They may be used independently of any other Unstructured repos under the terms of its license. pip install unstructured and you are good to go.

🔹 Preprocessing pipeline APIs

A preprocessing pipeline API (or just "pipeline API") is a notebook that includes a Python function capable of transforming a raw document to structured data. By following the documented conventions, FastAPI APIs may be auto-generated from a pipeline notebook.

See pipeline-sec-filings for an example repo includes a preprocessing pipeline API and auto-generated FastAPI.

🔩 Developer tools for generating FastAPIs

The unstructured-api-tools library includes the tooling required to create FastAPIs from pipeline notebooks.

🤗 Hugging Face

Hugging Face Spaces offer a simple way to host ML demo apps, models and datasets directly on our organization’s profile. This allows us to showcase our projects and work collaboratively with other people in the ML ecosystem. Visit our space here!