GitXplorerGitXplorer
N

hindi2vec

public
219 stars
27 forks
4 issues

Commits

List of commits on branch master.
Verified
0fa0b769c455b5c6601e1f63ae49cb322404ba79

Update TODO and Reorder Downloads

NNirantK committed 6 years ago
Unverified
5f25174810ea3871e03ed11a09a21c2f514838eb

Set theme jekyll-theme-dinky

NNirantK committed 6 years ago
Verified
22938843dfdde637889422840208157560e0db41

Add a separate Downloads with BBC Hindi Data

NNirantK committed 7 years ago
Verified
545dd0f9bd8cbd3dfb82981cd6f7864455626612

Add Idea dump

NNirantK committed 7 years ago
Verified
62001feb42d6808d544f25a314008271f5d41886

Clearer writing (hopefully)

NNirantK committed 7 years ago
Verified
928f67323f38aa66ace568e3cfd793ebd2e46747

Extract Word Embedding from Language Model

NNirantK committed 7 years ago

README

The README file for this repository.

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)

  • To the best of my knowledge on September 18, 2018

Update: nlp-for-hindi uses sentencepiece instead of the word based spacCy tokenizer which I use. On those tokens, the measured perplexity for that LM is ~35. I encourage you to check that work out as well.

Downloads

TODO

  • [x] Language modeling based on wikipedia dump
  • [x] Release Language Models: Hindi Language Model
  • [x] Create Text classification Datasets: BBC Hindi
  • [ ] Benchmark text classification with FastText

Idea Dump

  • [ ] Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
  • [ ] MTL tasks for training and inference using custom heads
  • [ ] Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

FastAI Installation

This version of the notebook uses fastai lib's v0.7, used in their Part 2 v2 course in Summer 2018. The best way to install it via conda as mentioned here

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.