自然語言處理 -- 語料辭典

自然語言

前言

簡介

歷史

理論篇

知識表達

語法理論

語意理論

語用理論

方法篇

規則比對

機率統計

神經網路

應用篇

語料建構

全文檢索

自動分類

自動摘要

機器翻譯

問答系統

中文處理

程式篇

交談程式

英漢翻譯

維基語料

搜尋引擎

相關資源

語料辭典

程式工具

相關網站

相關文獻

網頁列表

統計資訊

最新修改

訊息

相關網站

參考文獻

最新修改

簡體版

English

語料庫

  1. CWB中文词库! — http://www.cwbbase.com/
  2. CC-CEDICT — 開放授權的漢英詞典。
  3. Japanese-English News Article Alignment Data (JENAAD)
  4. 星際譯王 — 搜集了許多的漢英詞典。
  5. http://www.dict.org/
  6. http://freedict.org/en/
  7. 字典總匯
  8. XDXF 字典庫 — http://xdxf.revdanica.com/down/
  9. 維基百科:資料下載 — 可作為平行語料庫來源。
  10. WordNet — 普林斯頓大學維持的 WordNet (字網) 詞典。
  11. 知網 — 董振東創建的電腦用語意素分解辭典。
  12. Mozilla 計畫所收集的繁簡對照表 — http://moztw.org/docs/big5/
  13. CALPER 語料庫蒐集 — http://calper.la.psu.edu/corpus_portal/chinese_corp_guide_trad.php

以下文章修改自 Bob's Blog — http://blog.163.com/wanglaoji_uibe/blog/static/138471014201062343316179/

  1. 北語語言資訊處理研究所CCRL 漢語檢索通
  2. 北京大學《人民日報》標注語料庫:http://www.icl.pku.edu.cn
  3. 北京語言大學的語料庫:http://www.blcu.edu.cn/kych/H.htm
  4. 清華大學的漢語均衡語料庫TH-ACorpus:http://www.lits.tsinghua.edu.cn/ainlp/source.htm
  5. 山西大學的語料庫: http://www.sxu.edu.cn/homepage/cslab/sxuc1.htm
  6. 臺灣中研院的語料庫:
  7. 近代漢語標記語料庫:http://www.sinica.edu.tw/Early_Mandarin/
  8. 古漢語語料庫:http://www.sinica.edu.tw/ftms-bin/ftmsw3
  9. 臺灣南島語典藏:http://www.ling.sinica.edu.tw/Formosan/
  10. 閩南語典藏:http://southernmin.sinica.edu.tw/
  11. 漢籍電子文獻:http://www.sinica.edu.tw/~tdbproj/handy1/
  12. 香港城市大學的LIVAC共時語料庫:http://www.rcl.cityu.edu.hk/livac/
  13. 浙江師範大學的歷史文獻語料庫: http://lib.zjnu.net.cn/xueke/hyywzx/xkjj.htm
  14. 中國科學院計算所的雙語語料庫:http://mtgroup.ict.ac.cn/corpus/query_process.php
  15. 中文語言資源聯盟:http://www.chineseldc.org/xyzy.htm

下面是加州大學陶虹印老師介紹的一些語料庫。

  1. The Singapore Corpus of Research in Education
  2. The International Corpus of Crosslinguistic Interlanguag
  3. The Singapore Corpus of Preschoolers' Spoken Mandarin
  4. A Corpus of Mandarin Textbooks in Singapore and Malaysia
  5. An Investigation in Peer Work and Peer Talk in Singapore Primary Classrooms
  6. A Chinese-English Parallel Corpus of Newspaper Advertisements
  7. Hongloumeng Chinese-English Parallel Corpus
  8. A Parallel Corpus of Chinese Legal Texts
  9. The Babel English-Chinese Parallel Corpus
  10. A Parallel Corpus and Web Concordances of Five Versions of Laozi
  11. A Corpus Database of Xuan Ying's Glossary of Buddhist Sutra
  12. The Lancaster Corpus of Mandarin Chinese
  13. The UCLA Corpus of Written Chinese
  14. A Web Concordancer for Modern Chinese Literature
  15. A Web Concordancer for Modern Chinese Literature (with Chinese segmentation and POS tagging)

Others

4. Roget Thesaurus http://www.thesaurus.com

Semantic networks
– add more semantic relations
5. WordNet http://www.cogsci.princeton.edu/~wn/
6. Dictionary files, source code
7. EuroWordNet http://www.illc.uva.nl/EuroWordNet/

Machine Learning Algorithms
- (Many implementations available online)
8. Weka: Java package of many learning algorithms http://www.cs.waikato.ac.nz/ml/weka/
9. Includes decision trees, decision lists, neural networks, naïve bayes, instance based learning, etc.
10. C4.5: C implementation of decision trees, http://www.cse.unsw.edu.au/~quinlan/
11. Timbl: Fast optimized implementation of instance based learning algorithms, http://ilk.kub.nl/software.html
12. SVM Light: efficient implementation of Support Vector Machines, http://svmlight.joachims.org

Sense Tagged Data
13. A lot of annotated data available through Senseval, http://www.senseval.org
14. Data for lexical sample, http://teach-computers.org
(1)、 English (with respect to Hector, WordNet, Wordsmyth)
(2)、 Basque, Catalan, Chinese, Czech, Romanian, Spanish, etc.
(3)、 Data produced within Open Mind Word Expert project
15. Data for all words , http://www.cs.unt.edu/~rada/downloads.html
(1)、 English, Italian, Czech (Senseval-2 and Senseval-3)
(2)、 SemCor (200,000 running words)
16. Pointers to additional data available from, http://www.senseval.org/data.html

Raw Data
— For use with Bootstrapping algorithms, Word sense discrimination algorithms
17. British National Corpus , 100 million words covering a variety of genres, styles, http://www.natcorp.ox.ac.uk/
18. TREC (Text Retrieval Conference) data, Los Angeles Times, Wall Street Journal, and more, 5 gigabytes of text, http://trec.nist.gov/
19. The Web

WSD Software – Targeted Disambiguation
20. Duluth Senseval-2 systems, Lexical decision tree systems that participated in Senseval-2 and 3 , http://www.d.umn.edu/~tpederse/senseval2.html
21. SyntaLex, Enhance Duluth Senseval-2 with syntactic features, participated in Senseval-3, http://www.d.umn.edu/~tpederse/syntalex.html
22. WSDShell, Shell for running Weka experiments with wide range of options, http://www.d.umn.edu/~tpederse/wsdshell.html
23. SenseTools, For easy implementation of supervised WSD, used by the above 3 systems. Transforms Senseval-formatted data into the files required by Weka. http://www.d.umn.edu/~tpederse/sensetools.html
24. SenseRelate::TargetWord (Demo on Tuesday, July 12 (6:30-9:30pm)),
(1)、 Identifies the sense of a word based on the semantic relation with its neighbors, http://search.cpan.org/dist/WordNet-SenseRelate-TargetWord
(2)、 Uses WordNet::Similarity – measures of similarity based on WordNet, http://search.cpan.org/dist/WordNet-Similarity

WSD Software – All Words
25. SenseLearner , A minimally supervised approach for all open class words, Extension of a system participating in Senseval-3, http://lit.csci.unt.edu/~senselearner
26. SenseRelate::AllWords, Identifies the sense of a word based on the semantic relation with its neighbors, http://search.cpan.org/dist/WordNet-SenseRelate-AllWords

WSD Software – Unsupervised
27. Clustering by Committee, http://www.cs.ualberta.ca/~lindek/demos/wordcluster.htm
28. InfoMap, Represent the meanings of words in vector space, http://infomap-nlp.sourceforge.net
29. SenseClusters, Finds clusters of words that occur in similar context, http://senseclusters.sourceforge.net , Demo on Tuesday, July 12 (6:30-9:30pm)

How to get your algorithms tested?
• Senseval, Evaluation of WSD systems http://www.senseval.org
• Senseval 1: 1999 – about 10 teams
• Senseval 2: 2001 – about 30 teams
• Senseval 3: 2004 – about 55 teams
• Senseval 4: 2007(?)
• Provides sense annotated data for many languages, for several tasks
– Languages: English, Romanian, Chinese, Basque, Spanish, etc.
– Tasks: Lexical Sample, All words, etc.
• Provides evaluation software
• Provides results of other participating systems

Facebook

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License