語料庫
- CWB中文词库! — http://www.cwbbase.com/
- CC-CEDICT — 開放授權的漢英詞典。
- Japanese-English News Article Alignment Data (JENAAD)
- 星際譯王 — 搜集了許多的漢英詞典。
- http://www.dict.org/
- http://freedict.org/en/
- 字典總匯
- XDXF 字典庫 — http://xdxf.revdanica.com/down/
- 維基百科:資料下載 — 可作為平行語料庫來源。
- WordNet — 普林斯頓大學維持的 WordNet (字網) 詞典。
- 知網 — 董振東創建的電腦用語意素分解辭典。
- Mozilla 計畫所收集的繁簡對照表 — http://moztw.org/docs/big5/
- CALPER 語料庫蒐集 — http://calper.la.psu.edu/corpus_portal/chinese_corp_guide_trad.php
以下文章修改自 Bob's Blog — http://blog.163.com/wanglaoji_uibe/blog/static/138471014201062343316179/
- 北語語言資訊處理研究所CCRL 漢語檢索通
- 北京大學《人民日報》標注語料庫:http://www.icl.pku.edu.cn
- 北京語言大學的語料庫:http://www.blcu.edu.cn/kych/H.htm
- 清華大學的漢語均衡語料庫TH-ACorpus:http://www.lits.tsinghua.edu.cn/ainlp/source.htm
- 山西大學的語料庫: http://www.sxu.edu.cn/homepage/cslab/sxuc1.htm
- 臺灣中研院的語料庫:
- 近代漢語標記語料庫:http://www.sinica.edu.tw/Early_Mandarin/
- 古漢語語料庫:http://www.sinica.edu.tw/ftms-bin/ftmsw3
- 臺灣南島語典藏:http://www.ling.sinica.edu.tw/Formosan/
- 閩南語典藏:http://southernmin.sinica.edu.tw/
- 漢籍電子文獻:http://www.sinica.edu.tw/~tdbproj/handy1/
- 香港城市大學的LIVAC共時語料庫:http://www.rcl.cityu.edu.hk/livac/
- 浙江師範大學的歷史文獻語料庫: http://lib.zjnu.net.cn/xueke/hyywzx/xkjj.htm
- 中國科學院計算所的雙語語料庫:http://mtgroup.ict.ac.cn/corpus/query_process.php
- 中文語言資源聯盟:http://www.chineseldc.org/xyzy.htm
下面是加州大學陶虹印老師介紹的一些語料庫。
- The Singapore Corpus of Research in Education
- The International Corpus of Crosslinguistic Interlanguag
- The Singapore Corpus of Preschoolers' Spoken Mandarin
- A Corpus of Mandarin Textbooks in Singapore and Malaysia
- An Investigation in Peer Work and Peer Talk in Singapore Primary Classrooms
- A Chinese-English Parallel Corpus of Newspaper Advertisements
- Hongloumeng Chinese-English Parallel Corpus
- A Parallel Corpus of Chinese Legal Texts
- The Babel English-Chinese Parallel Corpus
- A Parallel Corpus and Web Concordances of Five Versions of Laozi
- A Corpus Database of Xuan Ying's Glossary of Buddhist Sutra
- The Lancaster Corpus of Mandarin Chinese
- The UCLA Corpus of Written Chinese
- A Web Concordancer for Modern Chinese Literature
- A Web Concordancer for Modern Chinese Literature (with Chinese segmentation and POS tagging)
Others
4. Roget Thesaurus http://www.thesaurus.com
Semantic networks
– add more semantic relations
5. WordNet http://www.cogsci.princeton.edu/~wn/
6. Dictionary files, source code
7. EuroWordNet http://www.illc.uva.nl/EuroWordNet/
Machine Learning Algorithms
- (Many implementations available online)
8. Weka: Java package of many learning algorithms http://www.cs.waikato.ac.nz/ml/weka/
9. Includes decision trees, decision lists, neural networks, naïve bayes, instance based learning, etc.
10. C4.5: C implementation of decision trees, http://www.cse.unsw.edu.au/~quinlan/
11. Timbl: Fast optimized implementation of instance based learning algorithms, http://ilk.kub.nl/software.html
12. SVM Light: efficient implementation of Support Vector Machines, http://svmlight.joachims.org
Sense Tagged Data
13. A lot of annotated data available through Senseval, http://www.senseval.org
14. Data for lexical sample, http://teach-computers.org
(1)、 English (with respect to Hector, WordNet, Wordsmyth)
(2)、 Basque, Catalan, Chinese, Czech, Romanian, Spanish, etc.
(3)、 Data produced within Open Mind Word Expert project
15. Data for all words , http://www.cs.unt.edu/~rada/downloads.html
(1)、 English, Italian, Czech (Senseval-2 and Senseval-3)
(2)、 SemCor (200,000 running words)
16. Pointers to additional data available from, http://www.senseval.org/data.html
Raw Data
— For use with Bootstrapping algorithms, Word sense discrimination algorithms
17. British National Corpus , 100 million words covering a variety of genres, styles, http://www.natcorp.ox.ac.uk/
18. TREC (Text Retrieval Conference) data, Los Angeles Times, Wall Street Journal, and more, 5 gigabytes of text, http://trec.nist.gov/
19. The Web
WSD Software – Targeted Disambiguation
20. Duluth Senseval-2 systems, Lexical decision tree systems that participated in Senseval-2 and 3 , http://www.d.umn.edu/~tpederse/senseval2.html
21. SyntaLex, Enhance Duluth Senseval-2 with syntactic features, participated in Senseval-3, http://www.d.umn.edu/~tpederse/syntalex.html
22. WSDShell, Shell for running Weka experiments with wide range of options, http://www.d.umn.edu/~tpederse/wsdshell.html
23. SenseTools, For easy implementation of supervised WSD, used by the above 3 systems. Transforms Senseval-formatted data into the files required by Weka. http://www.d.umn.edu/~tpederse/sensetools.html
24. SenseRelate::TargetWord (Demo on Tuesday, July 12 (6:30-9:30pm)),
(1)、 Identifies the sense of a word based on the semantic relation with its neighbors, http://search.cpan.org/dist/WordNet-SenseRelate-TargetWord
(2)、 Uses WordNet::Similarity – measures of similarity based on WordNet, http://search.cpan.org/dist/WordNet-Similarity
WSD Software – All Words
25. SenseLearner , A minimally supervised approach for all open class words, Extension of a system participating in Senseval-3, http://lit.csci.unt.edu/~senselearner
26. SenseRelate::AllWords, Identifies the sense of a word based on the semantic relation with its neighbors, http://search.cpan.org/dist/WordNet-SenseRelate-AllWords
WSD Software – Unsupervised
27. Clustering by Committee, http://www.cs.ualberta.ca/~lindek/demos/wordcluster.htm
28. InfoMap, Represent the meanings of words in vector space, http://infomap-nlp.sourceforge.net
29. SenseClusters, Finds clusters of words that occur in similar context, http://senseclusters.sourceforge.net , Demo on Tuesday, July 12 (6:30-9:30pm)
How to get your algorithms tested?
• Senseval, Evaluation of WSD systems http://www.senseval.org
• Senseval 1: 1999 – about 10 teams
• Senseval 2: 2001 – about 30 teams
• Senseval 3: 2004 – about 55 teams
• Senseval 4: 2007(?)
• Provides sense annotated data for many languages, for several tasks
– Languages: English, Romanian, Chinese, Basque, Spanish, etc.
– Tasks: Lexical Sample, All words, etc.
• Provides evaluation software
• Provides results of other participating systems
Facebook
Post preview:
Close preview