GitXplorerGitXplorer
t

zh-word-freq

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
62f670d2e99401906305524a37eead89d6db1deb

add -a for cut_all

committed 3 years ago
Unverified
8e477f0601eda14a3f38291e97237a7f98a4812f

add simple argument parsing

committed 5 years ago
Unverified
c6efd33a0744523c5b54937bfba3e4deef7e37b0

update readme

committed 5 years ago
Unverified
eb3c53e0f3b2b37e9da230f1a0f84c7618d72d8d

update readme

committed 5 years ago
Unverified
4f8b293f3a0b5c41c633f356df86406465a808f1

add README

committed 5 years ago
Unverified
631030fcfb20efa3acf1b4cfda01317dd6ade6ec

initial commit

committed 5 years ago

README

The README file for this repository.

Chinese word frequency

This simple Python script uses jieba to count all the words in a file and display the most frequent ones.

Currently, it's very basic and therefore not flexible.

Requirements

Do pip install jieba.

Usage

usage: seg.py [-h] [-a] file

positional arguments:
  file        the text file to process

optional arguments:
  -h, --help  show this help message and exit
  -a, --all   find all possible words

Output

(from 1984 by George Orwell)

温斯顿       	587
可能        	325
没有        	323
知道        	258
想         	242
党         	233
奥         	218
布兰        	210
已经        	206
这种        	187
...

Currently, if you don't want to see these words, or you want to see more than ten, just edit the code. In the future, there will be a default stop list and you'll be able to specify your own additional stop lists. See Current issues and goals for corresponding issues.

Current issues and goals

  • [x] missing argument just errs out
  • [ ] add option for (multiple) stop lists
  • [ ] add option to set delimiter
  • [ ] add minimum frequency
  • [ ] handle standard input

Have fun Chinese word frequencying!

Sources

  • Chinese word list copy-pasted (with modification) from stopwords-zh