📚 Synthetic Instruction Dataset Generation

This repo will allow you to create multilingual synthetic instructions datasets using the MAGPIE method and ollama.

⚠️Important Note: The instruction datasets created here are for educational purposes. However, it is the users' duty to ensure that their use adheres to the terms of the relevant licensing agreements with Meta AI's Llama 3.

🔧 Prerequisites

Git
Python 3.8 or higher
Poetry
ollama

🛠️ Installation

Clone this repo

git clone https://mrm8488/synthetic-instructions-dataset-generation

Install the requirements

poetry install

Download the ollama model

ollama run llama3

Create a server with the ollama model

ollama server llama3

🚀 Example of usage

python src/dataset_gen.py \
--model llama3 \
--lang es \
--num_samples 1000 \
--push_to_hub \
--hf_token <YOUR_HUGGINGFACE_TOKEN>

🪣 Filtering the generated dataset

python src/services/filtering.py \
--filter_lang es \
--push_to_hub \
--hf_token <YOUR_HUGGINGFACE_TOKEN>

🔍 Observations

1. Language: Spanish

1.1 Model: llama3 (llama3-8b-instruct)

The examples generated are very Q&A-like.

1.2 Model: phi3 (phi3-mini and medium)

The examples generated are more instruction-like.

2. Language: Deutsch

2.1 Model: llama3 (llama3-8b-instruct)

The examples tend to be very repetitive.

License

MIT License

Contributions are welcome! 🎉

Acknowledgements

Sebastian Raschka, PhD for his post and base script:

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

magpie-ollama-datagen

Commits

Update readme.md

Merge pull request #2 from mrm8488:add-german-support

Add suport for German and update readme

Merge pull request #1 from davanstrien/patch-1

typo

Enhance readme

README