GitXplorerGitXplorer
U

community

public
26 stars
8 forks
0 issues

Commits

List of commits on branch main.
Verified
e060301592150582929d368c9b0f726165756cf9

docs: Basic install instructions for pyenv-virtualenv (#20)

MMthwRobinson committed 2 years ago
Verified
a1340d3582ab0fc13a632917eef72ec79f234585

fix typo (#19)

qqued committed 2 years ago
Verified
c7c2f293673074970de8337dfb13bcd52a2fc606

chore: Move get_version to the fetch.py module

MMthwRobinson committed 2 years ago
Verified
d5adf09d2161290433d47082541a233af8af8491

chore: minor notebook tweaks (#16)

ccragwolfe committed 2 years ago
Verified
54f46b155de007a0aef646f0733fd2e0d7455800

docs: SEC Sentiment Analysis Example (#15)

MMthwRobinson committed 2 years ago
Verified
50c9d2d60bd166fabeafd2cefee70ee006ecd8dd

docs: add Preprocessing Pipelines spec (#10)

ccragwolfe committed 2 years ago

README

The README file for this repository.

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to the Unstructured Community! 😊

We are building an ecosystem of preprocessing pipeline tools for Data Scientists and Data Engineers, so they may quickly work through the challenge of extracting structured data from unstructured raw documents.

☕ Getting Started

Unstructured's open-source packages currently target Python 3.8. If you are using or contributing to Unstructured code, we encourage you to work with Python 3.8 in a virtual environment. You can use the following instructions to get up and running with a Python 3.8 virtual environment with pyenv-virtualenv:

Mac / Homebrew

  1. Install pyenv with brew install pyenv.
  2. Install pyenv-virtualenv with brew install pyenv-virtualenv
  3. Follow the instructions here to add the pyenv-virtualenv startup code to your terminal profile.
  4. Install Python 3.8 by running pyenv install 3.8.15.
  5. Create and activate a virtual environment by running:
pyenv virtualenv 3.8.15 unstructured
pyenv activate unstructured

You can changed the name of the virtual environment from unstructured to another name if you're creating a virtual environment for a pipeline. For example, if you're a creating a virtual environment for the SEC preprocessing, you can run pyenv virtualenv 3.8.15 sec.

Linux

  1. Run git clone https://github.com/pyenv/pyenv.git ~/.pyenv to install pyenv
  2. Run git clone https://github.com/pyenv/pyenv-virtualenv.git ~/.pyenv/plugins/pyenv-virtualenv to install pyenv-virtualenv as a pyenv plugin.
  3. Follow steps 3-5 from the Mac/Homebrew instructions.

👐 Contributions

We welcome contributions! See all open issues for bugs, features, and enhancement requests in the community.

When contributing, please follow our Contributing to Unstructured guidelines.

Don't hesitate to reach out us on slack with any questions. Thank you!

📗 Key Concepts

🧱 Bricks

Bricks are the "blocks" or Python functions from which preprocessing pipelines are made, and are organized in the Unstructured library. These collectively form the Swiss Army knife that Python developers can use to extract structured data from raw documents into the format that they want. They may be used independently of any other Unstructured repos under the terms of its license. pip install unstructured and you are good to go.

🔹 Preprocessing pipeline APIs

A preprocessing pipeline API (or just "pipeline API") is a notebook that includes a Python function capable of transforming a raw document to structured data. By following the documented conventions, FastAPI APIs may be auto-generated from a pipeline notebook.

See pipeline-sec-filings for an example repo includes a preprocessing pipeline API and auto-generated FastAPI.

🔩 Developer tools for generating FastAPIs

The unstructured-api-tools library includes the tooling required to create FastAPIs from pipeline notebooks.

🤗 Hugging Face

Hugging Face Spaces offer a simple way to host ML demo apps, models and datasets directly on our organization’s profile. This allows us to showcase our projects and work collaboratively with other people in the ML ecosystem. Visit our space here!