GitXplorerGitXplorer
N

hindi2vec

public
219 stars
27 forks
4 issues

Commits

List of commits on branch master.
Unverified
f4ec3c461cd668abf97a867c4c47a7b83cb244b6

Add Hi Wiki LM Experiment

iinvalid-email-address committed 6 years ago
Unverified
31960f4ce78fc32e95a73b95fb07c9d235563b4d

Setting up data

iinvalid-email-address committed 6 years ago
Verified
2d064d25d4e37e33b41b35c975018e01ef40244e

Add nlp-for-hindi

NNirantK committed 6 years ago
Verified
698e364fd1a8d5d433ca07f4582d575723d2cf45

Add Installation Instructions

NNirantK committed 6 years ago
Unverified
0103b9a1607851738921c76f5dfcbb8cd124dd7b

Add Hindi2Vec Logo

NNirantK committed 6 years ago
Verified
b18171308fd96608f006ebffd4d97e9f5c187232

Fix formatting snafu

NNirantK committed 6 years ago

README

The README file for this repository.

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)

  • To the best of my knowledge on September 18, 2018

Update: nlp-for-hindi uses sentencepiece instead of the word based spacCy tokenizer which I use. On those tokens, the measured perplexity for that LM is ~35. I encourage you to check that work out as well.

Downloads

TODO

  • [x] Language modeling based on wikipedia dump
  • [x] Release Language Models: Hindi Language Model
  • [x] Create Text classification Datasets: BBC Hindi
  • [ ] Benchmark text classification with FastText

Idea Dump

  • [ ] Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
  • [ ] MTL tasks for training and inference using custom heads
  • [ ] Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

FastAI Installation

This version of the notebook uses fastai lib's v0.7, used in their Part 2 v2 course in Summer 2018. The best way to install it via conda as mentioned here

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.