A smol blueprint

A smol blueprint for AI development, focusing on applied examples of RAG, information extraction, analysis and fine-tuning in the age of LLMs. It is a more practical approach that strives to show the application of some of the theoretical learnings from the smol-course as an end2end real-world problem.

🚀 Web apps and microservices included!

Each notebook will show how to deploy your AI as a webapp on Hugging Face Spaces with Gradio, which you can directly use as microservices through the Gradio Python Client. All the code and demos can be used in a private or public setting. Deployed on the Hub!

The blueprint

We want to build a tool that can help us use AI on company documents. In our case, we will be working with the smol-blueprint/hf-blogs dataset, which is a dataset that contains the blogs from the Hugging Face website.

Retrieval Augemented Generation (RAG)
- ✅ Indexing - Indexing a vector search backend
- ✅ Building - Building a RAG pipeline
- 🚧 Monitoring - Monitoring and improving your RAG pipeline
- 🚧 Fine-tuning - Fine-tuning retrieval and reranking models
Information extraction and labeling
- 🚧 Building - Structured information extraction with LLMs
- 🚧 Monitoring - Monitoring extraction quality
- 🚧 Fine-tuning - Fine-tuning extraction models
Agents for orchestration
- 🚧 Orchestration - Building agents to coordinate components

Installation and configuration

Python environment

We will use uv to manage the project. First create a virtual environment:

uv venv --python 3.11
source .venv/bin/activate

Then you can install all the required dependencies:

uv sync --all-groups

Or you can sync between different dependency groups:

uv sync scraping
uv sync rag
uv sync information-extraction

Hugging Face Account

You will need a Hugging Face account to use the Hub API. You can create one here. After this you can follow the huggingface-cli instructions and log in to configure your Hugging Face token.

huggingface-cli login

smol-blueprint

Commits

Merge pull request #2 from huggingface/rag/building

Update dependencies, enhance README, and refine Jupyter notebooks

Merge branch 'main' into rag/building

Merge pull request #1 from huggingface/rag/index-change-dataset

Fix minor formatting issue in `rag/indexing.ipynb` by removing an unnecessary newline in the DataFrame creation line for improved code clarity.

Update dependencies and refine Jupyter notebook for improved functionality

README