GitXplorerGitXplorer
s

Part-of-Speech-Tagger

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Verified
8d6d1be7c41bd0df8729d0db1cafe2ef505af243

Update README.md

ssp1999 committed 4 years ago
Verified
8209b6f712cf45a4389662805e6562733b21cfe2

Add files via upload

ssp1999 committed 4 years ago
Verified
8a29e41a54a8354e5b772052251baa895923541b

Add files via upload

ssp1999 committed 4 years ago
Verified
04af69ae957a73fb174dadfed3eab19933973a55

Add files via upload

ssp1999 committed 4 years ago
Verified
18c56368cb6947a27af7581475926e31adc43434

Initial commit

ssp1999 committed 4 years ago

README

The README file for this repository.

Part-of-Speech-Tagger

Implemented a Part Of Speech Tagger using Support Vector Machine(SVM), Hidden Markov Model(HMM) and Bi-directional Long-Short Term Memory(Bi-LSTM)

All the results, detailed error analysis, strengths and weakness of the models and references are also present in the Report.pdf

Support Vector Machine

  • Instructions for running the code

    SVM_PoS_tagger.ipynb contains the implementation of SVM

    Run the code from https://colab.research.google.com/drive/178a6M4J3lt-twzGr1Nv-Y2EliW77NEt9?usp=sharing

    It can also be run as a python script from svm.py in the Support Vector Machine directory.

    No need to download any dependencies if running from the colab file.

  • Results

    • Per-POS accuracy vs Relative Frequency

      per-POS accuracy train per-POS accuracy test

    • Accuracy

      Model Test Accuracy(%) Train Accuracy(%)
      HMM 83.25 83.36
    • Feature Engineering

      Features Selected Accuracy(%)
      Word length, capitalisation, upper-case, lower-case, isNumeric 40
      Prefix and suffix for nouns, verbs, adjectives and adverbs 55
      Word stems using PorterStemmer 60
      Tag of Previous Word 65
      Pre-trained Glove word-embeddings (1 lakh 50-dimensional vectors) 83
      Pre-trained word2vec embeddings (10 crore 300-dimensional vectors) 90
      Including the features of previous 3 words and following 3 words 95

Hidden Markov Model

  • Instructions for running the code

    The file main.ipynb contains the implementation of HMM. Use jupyter notebook to access it.

    Following utilities are to be installed: nltk, pandas, seaborn, matplotlib, sklearn, scikit-learn, tqdm

    Also make sure to input the following commands in jupyter notebook:

    nltk.download('brown')
    nltk.download('universal_tagset')
    

    Hidden Markov Model directory has a copy of all the images, plots and data produced from main.ipynb

  • Results

    • Per-POS accuracy vs Relative Frequency

      per-POS accuracy train per-POS accuracy test

    • Accuracy

      Model Test Accuracy(%) Train Accuracy(%)
      HMM 96.01 97.35

Bi-LSTM

  • Baseline Model Architecture

    Layer (type) Output Shape # Param
    embedding (Embedding) (None, 180, 300) 14944800
    bidirectional (None, 180, 64) 85248
    time_distributed (None, 180, 13) 845

    Total params: 15,030,893
    Trainable params: 86,093
    Non-trainable params: 14,944,800

  • Instructions for running the code

    BiLSTMBaseline.ipynb contains the implementation of baseline code for HMM

    Run the baseline code from https://colab.research.google.com/drive/1lhBd-gxsXNVeQJtBXoLZ7HABI5yYttrw?usp=sharing

    The CNN based code is available in the following two formats:

    1. CNNBiLSTMCRF.ipynb for jupyter notebooks and
    2. main.py can be run as a python script

    The following utilities are to be installed for the CNN based code: pytorch, nltk, torchvision, numpy, seaborn, pandas, matplotlib, tqdm, scikit-learn, sklearn

    Also download the glove embedding from http://nlp.stanford.edu/data/wordvecs/glove.6B.zip and extract it in the source directory:

    Execute the following commands for the CNN based code

    nltk.download('brown')
    nltk.download('universal_tagset')
    

    The results of running the CNN based code are stored in the Bi-LSTM folder.

  • Results

    • Per-POS accuracy vs Relative Frequency

      per-POS accuracy train per-POS accuracy test

    • Accuracy

      Model Test Accuracy(%) Train Accuracy(%)
      Bi-LSTM Baseline 79.38 78.39
      Bi-LSTM-CNN 87.15 87.19

References

  1. Leon Bottou “UNE APPROCHE THEORIQUE DE L’APPRENTISSAGE CONNEXIONNISTE ET AP-PLICATIONS A LA RECONNAISSANCE DE LA PAROLE” PhD thesis (1991)
  2. Xuezhe Ma and Eduard Hovy “End-to-end sequence labeling via bi-directional lstm-cnns-crf” (2016)
  3. Jesus Gimenez and Llus Marquez "Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited" (2003)
  4. Mathieu Blondel, Akinori Fujino, Naonori Ueda Large-scale Multiclass Support Vector Machine Training via Euclidean Projection onto the Simplex (2014)