Paper/AAAI-1998-p792

Learning to Classify Text from Labeled and Unlabeled Documents†

@InProceedings{aaai:98:01,
 author =       "K. Nigam and A. McCallum and S. Thrun and T. Mitchell",
 title =        "Learning to Classify Text from Labeled and Unlabeled Documents",
 booktitle =    "Proc. of the 15th National Conf. on Artificial Intelligence",
 year =         1998,
 pages =        "792-799"
}

↑

キーワード†

文書分類, 半教師あり学習, ラベルあり・なし混在データ, EMアルゴリズム, 生成モデル, 半教師ありクラス分類

↑

メモ†

生成モデルに基づく半教師あり学習の端緒となった論文．考えは非常に単純で，ラベルなしデータのラベルを潜在変数と考えてそれらが混合しているとしてモデル化．ラベルあり・なしの両方のデータを併せた尤度を考えてそれを最大化する．

具体的な対数尤度は，ラベルありデータを \(\mathcal{D}^l=\{d_i,y_i\}\)，ラベルなしデータを \(\mathcal{D}^u=\{d_i\}\) として \[\log\mathcal{L}(\mathcal{D}|\mathbf{\theta})=\sum_{d_i\in\mathcal{D}^u}\log[\sum_j^{|C|}\Pr(c_j|\theta)\Pr(d_i|c_j;\theta)]+\sum_{d_i\in\mathcal{D}^l}\log[\Pr(c_{y_i}|\theta)\Pr(d_i|c_{y_i};\theta)]\] ただし \(\mathcal{D}=\mathcal{D}^l\cap\mathcal{D}^u\)．第1項がラベルなしデータの対数尤度，第2項がラベルありデータの対数尤度．