This repo will allow you to create multilingual synthetic instructions datasets using the MAGPIE method and ollama
.
โ ๏ธImportant Note: The instruction datasets created here are for educational purposes. However, it is the users' duty to ensure that their use adheres to the terms of the relevant licensing agreements with Meta AI's Llama 3.
- Git
- Python 3.8 or higher
- Poetry
- ollama
- Clone this repo
git clone https://mrm8488/synthetic-instructions-dataset-generation
- Install the requirements
poetry install
- Download the
ollama
model
ollama run llama3
- Create a server with the
ollama
model
ollama server llama3
python src/dataset_gen.py \
--model llama3 \
--lang es \
--num_samples 1000 \
--push_to_hub \
--hf_token <YOUR_HUGGINGFACE_TOKEN>
python src/services/filtering.py \
--filter_lang es \
--push_to_hub \
--hf_token <YOUR_HUGGINGFACE_TOKEN>
- The examples generated are very Q&A-like.
- The examples generated are more instruction-like.
- The examples tend to be very repetitive.
MIT License
Contributions are welcome! ๐
Sebastian Raschka, PhD for his post and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/