GitXplorerGitXplorer
m

magpie-ollama-datagen

public
4 stars
3 forks
0 issues

Commits

List of commits on branch main.
Verified
7a7e144a53c9389801ac91c74fd7c3024e0d29ee

Update readme.md

mmrm8488 committed 6 months ago
Verified
b06284c19b0f4a36ed590ec3f15c27155866b693

Merge pull request #2 from mrm8488:add-german-support

mmrm8488 committed 6 months ago
Unverified
fd73bf7f0bfa580a31ec0b4e18a882670282d7d0

Add suport for German and update readme

mmrm8488 committed 6 months ago
Verified
bf942f3e3a9f95a5d0c88fc1a131f52fdda31a70

Merge pull request #1 from davanstrien/patch-1

mmrm8488 committed 7 months ago
Verified
37bf3ab4cad43e0aaefa257801294e0e38e4f740

typo

ddavanstrien committed 7 months ago
Unverified
72dfbc42d3a9af29031c429f3b6da98749673d54

Enhance readme

mmrm8488 committed 7 months ago

README

The README file for this repository.

๐Ÿ“š Synthetic Instruction Dataset Generation

This repo will allow you to create multilingual synthetic instructions datasets using the MAGPIE method and ollama.

โš ๏ธImportant Note: The instruction datasets created here are for educational purposes. However, it is the users' duty to ensure that their use adheres to the terms of the relevant licensing agreements with Meta AI's Llama 3.

๐Ÿ”ง Prerequisites

  • Git
  • Python 3.8 or higher
  • Poetry
  • ollama

๐Ÿ› ๏ธ Installation

  1. Clone this repo
git clone https://mrm8488/synthetic-instructions-dataset-generation
  1. Install the requirements
poetry install
  1. Download the ollama model
ollama run llama3
  1. Create a server with the ollama model
ollama server llama3

๐Ÿš€ Example of usage

python src/dataset_gen.py \
--model llama3 \
--lang es \
--num_samples 1000 \
--push_to_hub \
--hf_token <YOUR_HUGGINGFACE_TOKEN>

๐Ÿชฃ Filtering the generated dataset

python src/services/filtering.py \
--filter_lang es \
--push_to_hub \
--hf_token <YOUR_HUGGINGFACE_TOKEN>

๐Ÿ” Observations

1. Language: Spanish

1.1 Model: llama3 (llama3-8b-instruct)

  • The examples generated are very Q&A-like.

1.2 Model: phi3 (phi3-mini and medium)

  • The examples generated are more instruction-like.

2. Language: Deutsch

2.1 Model: llama3 (llama3-8b-instruct)

  • The examples tend to be very repetitive.

License

MIT License

Contributions are welcome! ๐ŸŽ‰

Acknowledgements

Sebastian Raschka, PhD for his post and base script:

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/