By Gerard Salton

Offers a concept of indexing able to rating index phrases, or topic identifiers in reducing order of value. This ends up in the alternative of excellent record representations, and likewise debts for the function of words and of glossary sessions within the indexing approach.

This examine is general of theoretical paintings in automated details association and retrieval, in that techniques are used from arithmetic, computing device technology, and linguistics. an entire conception of info retrieval may perhaps emerge from a suitable mix of those 3 disciplines.

**Additional info for A Theory of Indexing**

**Sample text**

11 is in fact an accurate representation of the indexing value of the terrns it must be possible to improve the retrieval performance by breaking up terms with negative discrimination value in such a way that lower frequency terms are produced from higher frequency components, with correspondingly better discrimination values [28], [29]. Specifically, if the high frequency nondiscriminators are taken in groups, and "phrases" are formed for cooccurring sets of nondiscriminators, the phrases will obviously exhibit lower document frequencies than the original components.

For practical purposes, the average discriminators are terms that occur with a term frequency of 1 in relatively few documents in a collection. The poor discriminators, finally, have high document frequency, and collection frequencies two or three times the size of the document frequency. The number of documents in which these terms occur with low frequency is very large, which of course accounts for their low discrimination values. Whereas no clear correlation was found to exist between the S/N ratings and the document or collection frequencies of the corresponding terms, a direct relation appears to exist for the discrimination value rankings.

7573 TF Standard term frequency weighting (word stem run). PT + SPT Use pairs and triples derived from nondiscriminators plus singles, pairs and triples obtained from discriminators. TF • IDF Use a term weight consisting of term frequency multiplied by the inverse document frequency. G. SALTON 50 TABLE 22 Statistical significance output for selected runs of Table 21 (probability that run B is significantly better than run A, except where A > B indicates that test is made in reverse direction) r-test A.