中文詞彙與跨語詞彙抽取技術在數位佛典上的研發與應用 -- 數位典藏國家型科技計畫

中文詞彙與跨語詞彙抽取技術在數位佛典上的研發與應用
The Development and Application of
the Chinese and Cross-Lingual Term Extraction for Buddhist Digital Archives

數位典藏國家型科技計畫｜技術研發分項計畫
National Digital Archives Program, Taiwan
Research & Development of Technology Division

計畫概述／Introduction

本計畫為國科會數位典藏技術分項計畫的其中一項。其目標及成果是要支援建立一個方便佛教學者建立知識架構的環境與研究平台。
研究平台的構想，是建立一個友善並有效率的介面，佛教學者可以透過它對龐大的數位佛教資源庫，進行統計分析（Statistical Analysis）、資訊檢索及抽取（Information Retrieval and Extraction）、文件分類與分群（Document Classification and Clustering）、資料探勘（Data Mining）等各項工作，以提供研究者不同於傳統佛學的研究方法及更多樣的參考資源和結果。
抽詞及建立跨語詞彙集，是達成上述研究平台目標的重要基礎工作。以各種不同版本及語言的佛經來說，抽詞及建立跨語詞彙表的工作，也可以應用在工具書的整理、文獻學與考證學的研究上。並且還能衍伸出多項資料統計議題上的研究與探討。

This is a project of the National Science Council's National Digital Archives Program, Taiwan. It aims to support the construction of a research platform and environment that will be convenient for Buddhist scholars to work on subject specific knowledge structures. There will be a friendly and efficient platform, on which Buddhist scholars can conduct Statistical Analysis, Information Retrieval and Extraction, Document Classification and Clustering, Data Mining, etc., to work on large numbers of digital Buddhist databases.
Term extraction is an important foundation towards building such a platform. Term extraction and cross-lingual thesaurus for the Tripitaka in different versions and languages can also be applied to constructing reference works, manuscript studies, document proving. Statisical questions can also be derived.

大事紀／Events

◎ 2006.03.01	計劃開始
◎ 2006.06	第一批佛學當代文獻抽辭完成
◎ 2006.06	平行語料庫多語版本蒐集完成，標記開始
◎ 2006.07.07	數位典藏技術研討會論文截稿
◎ 2006.08.15~18	參加PNC研討會
◎ 2006.09.01	第五屆數位典藏技術研討會
◎ 2006.10	中文抽辭程式除錯完成
◎ 2006.10	古典文獻（Cbeta）抽辭完成，當代文獻第二次抽辭。
◎ 2006.11.20	Concordance Website 上線測試
◎ 2006.11.22	計畫網站設立（本網頁）
◎ 2007.01.31	第三屆文學與資訊技術國際研討會論文提交
◎ 2007.02.05	語用索引與時空地理檢索平台上線測試
◎ 2007.03.19~20	參加第三屆文學與資訊技術國際研討會
◎ 2007.04.15	抽辭程式提升速度修改
◎ 2007.04.19	古典文獻抽辭分析基礎完成
◎ 2007.05.02	當代文獻抽辭分析基礎完成

計畫文件／Project Documents

1. 計畫提案 / Proposal	2. 期中報告 / Project Report I ; II
3. 中文簡報檔 / PowerPoint Slide (chi)	4. 海報 / Poster
5. 索引程式檔 / The Suffix Array Program File (Index)	6. 抽辭程式檔 / The Algorism Program File (Term Extraction)

計畫成果／Performance

◎ 古典文獻（Cbeta）抽辭結果 / Term Extraction of the Tripitaka：
　抽辭資源: CBETA
　檔案總 bytes: 1.2 GB (utf8 files)
　所有中文字所佔 bytes: 324,754,728(utf16 file)
　Suffix Array Index bytes: 567,406,444(4 bytes for each charactor)
　總中文字數: 141,851,611
　總標點符號字數: 20,525,753
　一次資料分析：
　　　　A. 與字辭典比對： Soothill-Hodous ｜　佛光大詞典｜　去掉短詞（Soothill-Hodous base）
　　　　B. 亂數抽取比對：第一組資訊
　二次資料分析：
　　　　A. 與字辭典比對： Soothill-Hodous ｜　佛光大詞典
　計算分析條件最佳化與抽辭結果:
　　* 條件: 左R = 67, 右R = 71
　　* 篩出詞條 109,681 個

◎ 當代文獻（佛學學報）抽辭結果 / Term Extraction in Articles of Buddhist Modern Studies：
　抽辭資源: 781單篇文獻（中華佛學學報、華岡學學報、中華佛學研究、台大佛學學報、法鼓全集等）
　檔案總 bytes: 78 MB (utf8 files)
　所有中文字所佔 bytes: 19,328,504(utf16 file)
　Suffix Array Index bytes: 33,851,932(4 bytes for each charactor)
　總中文字數: 8,462,983
　總標點符號字數: 1,201,269
　資料分析: Soothill-Hodous ｜　佛光大詞典
　** 需要以一般性字典比較。以專業佛學字典比較及當代文獻本身數量較小，參數有過大以致暫時無法參考的狀況。

◎ 網路服務：

　1. 語用索引及時空地理檢索系統

　2. CBETA 語用索引線上服務

相關聯結／Related Websites

　◎ 中華電子佛典協會 / Chinese Buddhist Electronic Text Association（Cbeta）
　◎ 數位典藏計畫－CKIP 中文斷詞系統
　◎ Python Offical Website（本計畫使用的 Open Source 程式語言）

連絡我們／Contact Us

主持人 / Project Leader：	黃乾綱 / Huang, Chien-Kang [ ckhuang@ntu.edu.tw ]
計畫助理 / Assistant：	釋法源 / Ven, Fa-yang [ ktang92@mail.chibs.edu.tw ]
計畫助理 / Assistant：	李家名 / Lee, Chia-ming [ trueming@chibs.edu.tw ]

中 文 詞 彙 與 跨 語 詞 彙 抽 取 技 術 在 數 位 佛 典 上 的 研 發 與 應 用 The Development and Application ofthe Chinese and Cross-Lingual Term Extraction for Buddhist Digital Archives

中文詞彙與跨語詞彙抽取技術在數位佛典上的研發與應用
The Development and Application of
the Chinese and Cross-Lingual Term Extraction for Buddhist Digital Archives