GitXplorerGitXplorer
p

voicecraft-tools

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
c92a3e40a23328c922ce8261ed3547865eded3b8

formatting of numbers

pperegilk committed 9 months ago
Unverified
a466180f0e65dd97b9d3101a67c09b73eba8352f

first

pperegilk committed 9 months ago
Verified
97b77344d4b723d61a707225afe87bcf4de6f72a

Update README.md

pperegilk committed 9 months ago
Verified
b7decf327a2b66ee9d1c87ed6b8fca519be87346

Initial commit

pperegilk committed 9 months ago

README

The README file for this repository.

VoiceCraft Tools

Some tools for VoiceCraft. The first tool inspects the dataset.

 python inspect_dataset.py --dataset_name pere/nst-voicecraft

This should give you some examples of tokenization:

Original text: Mytene om tusser og troll levde i beste velgående til langt inn i forrige århundre <PERIOD>
Tokenized text: ['m', '', 't', 'ɛ', 'n', 'a', '_', '', 'm', '_', 't', 'ʉ', 's', 's', 'ə', 'r', '_', '', 'ɡ', '_', 't', 'r', 'ɔ', 'l', '_', 'l', 'ɛ', 'v', 'd', 'a', '_', '', '_', 'b', 'ə', 's', 't', '', '_', 'v', 'ɛ', 'l', 'ɡ', '', 'a', 'n', 'n', 'a', '_', 't', '', 'l', '_', 'l', 'ɑ', 'ŋ', 't', '_', 'ɪ', 'n', '_', '', '_', 'f', 'ɔ', 'r', 'r', '', 'ɡ', 'a', '_', 'ɔ', 'r', 'h', 'ʉ', 'n', 'n', 'r', 'a', '_', 'p', '', 'r', '', '', 'd']

It should also give you a status like this:

Total number of rows in the dataset: 227,240
Total word count in the dataset: 2,191,506
Total token count in the dataset: 12,400,734
Average words per row: 9.64
Average tokens per row: 54.57