GitXplorerGitXplorer
l

TianGong-AI-Unstructure

public
49 stars
26 forks
2 issues

Commits

List of commits on branch main.
Unverified
5fa66fbafd9a9376458340dac0ce31beafe5bcb8

Refactor .gitignore to include *.pkl files and update file paths for education documents

llinancn committed 6 days ago
Unverified
4ad0db9a6c112a7ea03dc1f7873a9cb1a834828b

Refactor code to fetch and process ESG records from PostgreSQL, update file paths for education documents, and add OpenSearch integration

llinancn committed 7 days ago
Unverified
2be9275c3fc04bb8b70866946084265133512506

Refactor .gitignore and requirements.txt, and update file paths for education documents

llinancn committed 13 days ago
Unverified
8463227c57762dfed0184f54449ce972e426cd99

chore: Update pip installation command

llinancn committed a month ago
Unverified
adeb4486d975fb047e441a8331eff7cab3a37839

Refactor code to fix typo in comment and remove unused import

llinancn committed 2 months ago
Unverified
3fcd9bf39d7d005223440d10c0b0ed6f4bdb3ed3

Refactor code to fetch and process ESG records from PostgreSQL, update file paths for education documents, and add OpenSearch integration

llinancn committed 2 months ago

README

The README file for this repository.

TianGong AI Unstructure

Env Preparing

Using VSCode Dev Contariners

Tutorial

Python 3 -> Additional Options -> 3.11-bullseye -> ZSH Plugins (Last One) -> Trust @devcontainers-contrib -> Keep Defaults

Setup venv:

python3.11 -m venv .venv
source .venv/bin/activate

Install requirements:

python.exe -m pip install --upgrade pip

pip install --upgrade pip

pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt --upgrade
sudo apt update
sudo apt install python3.11-dev
sudo apt install -y libmagic-dev
sudo apt install -y poppler-utils
sudo apt install -y libreoffice
sudo apt install -y pandoc

Test Cuda (optional):

nvidia-smi

Auto Build

The auto build will be triggered by pushing any tag named like release-v$version. For instance, push a tag named as v0.0.1 will build a docker image of 0.0.1 version.

#list existing tags
git tag
#creat a new tag
git tag v0.0.1
#push this tag to origin
git push origin v0.0.1

sphinx

sphinx-apidoc --force -o sphinx/source/ src/
sphinx-autobuild sphinx/source docs/

Docker Manually Build

docker build -t linancn/tiangong-ai-unstructure:v0.0.1 .
docker push linancn/tiangong-ai-unstructure:v0.0.1

Nginx config

default file location: /etc/nginx/sites-enabled/default

sudo apt update
sudo apt install nginx
sudo nginx
sudo nginx -s reload
sudo nginx -s stop

Update the verison of tesseract in WSL Shell

remove the old version and add the necessary libraries

sudo apt-get remove tesseract-ocr
sudo apt-get install libpng-dev libjpeg-dev libtiff-dev libgif-dev libwebp-dev libopenjp2-7-dev zlib1g-dev

get the latest version of leptonica by running the following code in sequence

cd
wget https://github.com/DanBloomberg/leptonica/archive/refs/tags/1.84.1.tar.gz
tar -xzvf 1.84.1.tar.gz
cd leptonica-1.84.1
mkdir build
cd build
sudo snap install cmake # get the cmake version 3.28.3 #
cmake ..
make -j`nproc`
sudo make install

get the latest version of tesseract by running the following code in sequence

cd
wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.3.4.tar.gz
tar -xzvf 5.3.4.tar.gz
cd tesseract-5.3.4
mkdir build
cd build
cmake ..
make -j `nproc`
sudo make install

set environment variables

cd
nano ~/.bashrc

add the following content at the end of the file,save the file(Ctrl-O) and exit(Ctrl-X)

export TESSDATA_PREFIX=/usr/local/share/tessdata

activate the settting

source ~/.bashrc

get language models

https://github.com/tesseract-ocr/tessdata/blob/main/chi_sim.traineddata https://github.com/tesseract-ocr/tessdata/blob/main/chi_tra.traineddata https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata

/usr/local/share/tessdata/

check the language models currently in use

tesseract --list-langs

Run in Background

watch -n 1 nvidia-smi
find esg_txt/ -type f | wc -l 
ls -lt esg_txt/ | head -n 10

nohup .venv/bin/python3.11 src/journals/chunk_by_title_sci.py > log.txt 2>&1 &
pkill -f src/journals/chunk_by_title_sci.py

CUDA_VISIBLE_DEVICES=2 nohup .venv/bin/python3.11 src/esg/1_chunk_by_title.py > esg_unstructured.log 2>&1 &
CUDA_VISIBLE_DEVICES=2 nohup .venv/bin/python3.11 src/esg/3_chunk_by_title_pages.py > esg_meta_unstructured.log 2>&1 &

CUDA_VISIBLE_DEVICES=0 nohup .venv/bin/python3.11 src/esg/1_chunk_by_title_0.py > esg_unstructured_0.log 2>&1 &
CUDA_VISIBLE_DEVICES=1 nohup .venv/bin/python3.11 src/esg/1_chunk_by_title_1.py > esg_unstructured_1.log 2>&1 &
CUDA_VISIBLE_DEVICES=2 nohup .venv/bin/python3.11 src/esg/1_chunk_by_title_2.py > esg_unstructured_2.log 2>&1 &
CUDA_VISIBLE_DEVICES=3 nohup .venv/bin/python3.11 src/esg/1_chunk_by_title_3.py > esg_unstructured_3.log 2>&1 &

pkill -f src/esg/1_chunk_by_title_0.py
pkill -f src/esg/1_chunk_by_title_1.py
pkill -f src/esg/1_chunk_by_title_2.py



nohup .venv/bin/python3.11 src/esg/2_embedding_init.py > esg_embedding_log.txt 2>&1 &

nohup .venv/bin/python3.11 src/standards/1_chunk_by_title.py > log.txt 2>&1 &

nohup .venv/bin/python3.11 src/reports/1_chunk_by_title.py > log.txt 2>&1 &
nohup .venv/bin/python3.11 src/reports/2_embedding_init.py > log.txt 2>&1 &