Abstract:
A computer program written for the UNIX time-sharing system reduces by several orders of magnitude the task of finding words in a document which contain typographical err...Show MoreMetadata
Abstract:
A computer program written for the UNIX time-sharing system reduces by several orders of magnitude the task of finding words in a document which contain typographical errors. The program is adaptive in the sense that it uses statistics from the document itself for its analysis. In a first pass through the document, a table of digram and trigram frequencies is prepared. The second pass through the document breaks out individual words and compares the digrams and trigrams in each word with the frequencies from the table. An index is given to each word which reflects the hypothesis mat the trigrams in the given word were produced from the same source that produced the trigram table. The words are sorted in decreasing order of their indices and printed. Printing is suppressed for words appearing in a table of 2726 common technical English words. The table is attached as Appendix B. The author of a 108-page document needed less than ten minutes to scan the output and identify the misspelled words. There were a total of 30 misspelled words among the 386 words output and 23 of those occurred among the first 100 words output.
Published in: IEEE Transactions on Professional Communication ( Volume: PC-18, Issue: 1, March 1975)