GitXplorerGitXplorer
a

arabish

public
34 stars
11 forks
2 issues

Commits

List of commits on branch master.
Unverified
9e883a6b5eac54be3c676858f048bd4d8d698c4e

different rule table for the start of the word

aamasad committed 11 years ago
Unverified
7af11119a988861c2b35065370fb3cd3788735c7

manually edit the mapping file

aamasad committed 11 years ago
Unverified
f1deded893c11b2e7f150798981890a0314c0964

Sort

aamasad committed 11 years ago
Unverified
5849816e452417c619c897ab53b43cf4fda5705f

add arabish test

aamasad committed 11 years ago
Unverified
59755e9176fa020ce235abb46e3f810fd5510845

ensure_ascii false

aamasad committed 11 years ago
Unverified
4830df04269c9cdac2ba6f809b6f5c80d2713168

seperate out mapping gen

aamasad committed 11 years ago

README

The README file for this repository.

Arabish (beta)

Arabic transliteration in Python. Similar to Yamli.com, Google Ta3reeb, and Microsoft Maren.

Why

Because there isn't an open source transliteration project available. And it's not that hard!
I'm sure with there are some corner cases that makes it harder and harder to reach the 100% accuracy but it seems it's fairly easy to get the 80%.

Approach

  1. Given a list of simple mappings between one or two english letters representing a single arabic letter
  2. Append to english letter keys in the mapping vowels to simply ignore the Harakaat.
  3. Given an english word phonatically representing an arabic word.
  4. Construct the set of all possible arabic words (valid or not) using a recursive search algorithm.
  5. Use word frequency to get the most likely word to occur out of the list.

Current state

I'm very pleased, even surprised with the initial results. With a better training corpus and some simple tweaking to the rules we can get at least up to 80% accuracy of Yamli or similar services. The current training corpus is a frequency list based on words from opensubtitles.org. And is mostly classical arabic.

See TODO.txt