Term Extracting steps
[R: 字串左右出現不同字的種類數]
Candidate Terms: R>1 [STRING] R>1 (左右 R 值均大於 1 的字串數): 15,694,556
(某詞條的前一個字或後一個字出現過句號或任兩種以上的字)
Muller: No. of Terms in Charles Muller's Dictionary: 289,838
Muller: 單字辭 3221 個
Muller 有 178,652 個詞條在CBETA中出現
Candidate Terms 與 Muller 辭典交集詞條數: 139,486
|R|值變化的計算與圖表
candidate terms: 15,694,556
muller's terms: 178,652
candidate ^ muller's: 139,486 (when |R| >= 2)
Precision = 139,486 / 15,694,556 (candidate ^ muller's / candidate terms)
Recall = 139,486 / 178,652 (candidate ^ muller's / muller's terms)
F-measure = 2 / (1 / Precision) + (1 / Recall)
Y : Precision, Recall, F-measure
X : |R|數量的變化, 僅取 R > 2 and R < 100
[圖 1] 詞條左側 |R| 值對應 Muller 的變化
上圖 1 實際參數
X | All Candidates | ^ Mullers | Y (Precision) | Y (Recall) | Y (F-measure) |
... | ... | ... | ... | ... | ... |
61 | 197449 | 39034 | 0.197691555794 | 0.218491816492 | 0.207571902228 |
62 | 193683 | 38657 | 0.199589019171 | 0.216381568636 | 0.20764633999 |
63 | 190117 | 38269 | 0.20129183608 | 0.214209748561 | 0.207549983865 |
64 | 186713 | 37900 | 0.202985330427 | 0.212144280501 | 0.207463769108 |
65 | 183303 | 37545 | 0.204824798285 | 0.210157177082 | 0.207456728046 |
66 | 179886 | 37196 | 0.206775402199 | 0.208203658509 | 0.2074870725 |
67 | 176715 | 36871 | 0.208646691 | 0.206384479323 | 0.207509419839 |
68 | 173685 | 36526 | 0.210300256211 | 0.204453350648 | 0.207335590642 |
69 | 170749 | 36219 | 0.212118372582 | 0.202734926001 | 0.207320528562 |
70 | 167842 | 35887 | 0.213814182386 | 0.200876564494 | 0.207143558041 |
71 | 165099 | 35576 | 0.215482831513 | 0.19913574995 | 0.20698703422 |
72 | 162463 | 35268 | 0.217083274346 | 0.197411727828 | 0.206780704455 |
73 | 159897 | 34986 | 0.21880335466 | 0.195833240042 | 0.206682046026 |
74 | 157356 | 34669 | 0.220322072244 | 0.194058840651 | 0.206358181948 |
75 | 154938 | 34363 | 0.221785488389 | 0.192346013479 | 0.206019365089 |
76 | 152520 | 34054 | 0.223275635982 | 0.190616393883 | 0.205657483121 |
77 | 150203 | 33758 | 0.224749172786 | 0.188959541455 | 0.205306290006 |
78 | 147938 | 33476 | 0.226283983831 | 0.187381053669 | 0.20500321504 |
79 | 145791 | 33221 | 0.227867289476 | 0.185953697692 | 0.204787898028 |
... | ... | ... | ... | ... | ... |
[圖 2] 詞條右側 |R| 值對應 Muller 的變化
上圖 2 實際參數
X | All Candidates | ^ Mullers | Y (Precision) | Y (Recall) | Y (F-measure) |
... | ... | ... | ... | ... | ... |
61 | 201549 | 43832 | 0.217475651082 | 0.245348498757 | 0.230572775979 |
62 | 197709 | 43434 | 0.21968650896 | 0.243120703938 | 0.230810312439 |
63 | 194058 | 43079 | 0.221990332787 | 0.241133600519 | 0.231166322342 |
64 | 190606 | 42705 | 0.224048560906 | 0.239040145087 | 0.2313016915 |
65 | 187197 | 42352 | 0.226242941927 | 0.237064236616 | 0.23152721478 |
66 | 183873 | 42003 | 0.228434843615 | 0.235110718044 | 0.231724708641 |
67 | 180708 | 41644 | 0.230449122341 | 0.233101224727 | 0.231767586821 |
68 | 177597 | 41301 | 0.232554603963 | 0.231181291002 | 0.23186591401 |
69 | 174644 | 40968 | 0.234580060008 | 0.22931733202 | 0.231918844255 |
70 | 171776 | 40638 | 0.236575540238 | 0.227470165461 | 0.231933521294 |
71 | 169091 | 40367 | 0.238729441543 | 0.225953249894 | 0.232165708584 |
72 | 166448 | 40053 | 0.240633711429 | 0.224195642926 | 0.232124022023 |
73 | 163778 | 39732 | 0.242596685758 | 0.222398853637 | 0.232059106971 |
74 | 161122 | 39425 | 0.244690358859 | 0.22068042899 | 0.232066020355 |
75 | 158626 | 39115 | 0.246586309937 | 0.21894521192 | 0.231945160965 |
76 | 156270 | 38847 | 0.24858898061 | 0.217445088776 | 0.231976400475 |
77 | 153905 | 38577 | 0.250654624606 | 0.215933770683 | 0.232002333435 |
78 | 151611 | 38306 | 0.252659767431 | 0.214416855115 | 0.231972700545 |
79 | 149400 | 38030 | 0.254551539491 | 0.212871952175 | 0.231853486642 |
... | ... | ... | ... | ... | ... |
[Max: 詞條左右出現各種不同的可能(|R|)中, 其中次數最多的是多少次]
一般化: 將 Max / fx, fx 是該詞條的總數
左右 Max/fx 的變化圖表.
Y : Precision, Recall, F-measure
X : Max/fx 數量的變化, 間隔 0.1
[圖 3] 詞條左側 Max/fx 值對應 Muller 的變化
[圖 4] 詞條右側 Max/fx 值對應 Muller 的變化
[Algorism AEc]
AEc = fx / fy + fz - fx
Ex:
string: 中華佛學研究所
fx = No. of 中華佛學研究所
fy = No. of 中華佛學研究
fz = No. of 華佛學研究所
以 0.01 為間隔, 取一百段的 AEc 值(0.01~1.0)
計算候選詞條中的 AEc 值 >= 上述區段時, 與 Muller 比對的結果