Language Modeling for Information RetrievalBruce Croft, John Lafferty Springer Science & Business Media, 31 мая 2003 г. - Всего страниц: 246 A statisticallanguage model, or more simply a language model, is a prob abilistic mechanism for generating text. Such adefinition is general enough to include an endless variety of schemes. However, a distinction should be made between generative models, which can in principle be used to synthesize artificial text, and discriminative techniques to classify text into predefined cat egories. The first statisticallanguage modeler was Claude Shannon. In exploring the application of his newly founded theory of information to human language, Shannon considered language as a statistical source, and measured how weH simple n-gram models predicted or, equivalently, compressed natural text. To do this, he estimated the entropy of English through experiments with human subjects, and also estimated the cross-entropy of the n-gram models on natural 1 text. The ability of language models to be quantitatively evaluated in tbis way is one of their important virtues. Of course, estimating the true entropy of language is an elusive goal, aiming at many moving targets, since language is so varied and evolves so quickly. Yet fifty years after Shannon's study, language models remain, by all measures, far from the Shannon entropy liInit in terms of their predictive power. However, tbis has not kept them from being useful for a variety of text processing tasks, and moreover can be viewed as encouragement that there is still great room for improvement in statisticallanguage modeling. |
Содержание
III | 1 |
IV | 2 |
V | 6 |
VI | 9 |
VII | 11 |
VIII | 15 |
IX | 18 |
X | 31 |
XXXVI | 137 |
XXXVII | 139 |
XXXVIII | 141 |
XLII | 142 |
XLIII | 143 |
XLIV | 144 |
XLV | 146 |
XLVI | 147 |
XI | 51 |
XII | 54 |
XIII | 57 |
XIV | 58 |
XV | 59 |
XVI | 65 |
XVII | 70 |
XVIII | 73 |
XIX | 76 |
XX | 81 |
XXI | 89 |
XXII | 95 |
XXIII | 96 |
XXIV | 107 |
XXV | 116 |
XXVI | 120 |
XXVII | 125 |
XXVIII | 127 |
XXIX | 129 |
XXX | 130 |
XXXI | 131 |
XXXII | 132 |
XXXIV | 134 |
XXXV | 135 |
XLVII | 148 |
XLVIII | 160 |
XLIX | 167 |
L | 169 |
LI | 171 |
LII | 178 |
LIII | 183 |
LIV | 185 |
LV | 186 |
LVII | 189 |
LVIII | 191 |
LIX | 196 |
LX | 201 |
LXI | 204 |
LXII | 213 |
LXIII | 219 |
LXIV | 220 |
LXV | 221 |
LXVI | 223 |
LXVII | 226 |
LXVIII | 231 |
LXIX | 241 |
Другие издания - Просмотреть все
Часто встречающиеся слова и выражения
ACM SIGIR ad-hoc algorithm applied Arampatzis assumption average precision baseline Bayes biased Bruce Croft Chinese CLIR collection combination compression computed Croft cross-entropy cross-language Cross-language Information Retrieval dataset density Development in Information discussed dissemination thresholds document scores English entropy equation evaluation example experiments exponential Exponential distribution Figure Filtering Track Gaussian Gaussian normalization gist hidden markov model Hiemstra Information Retrieval INQUERY language modeling approach lexicon LM approach log-odds Manmatha maximum likelihood mixture model non-relevant distributions non-relevant documents number of relevant optimal P(score parallel corpus parameters performance Ponte posterior probability probabilistic model probability of relevance probability ranking principle problem query expansion query Q relevance feedback relevance information relevant and non-relevant relevant documents Robertson sampling score distributions search engines SIGIR smoothing Sparck Jones statistical summary Table techniques Text REtrieval Conference tion topic training data TREC unbiased unigram words