Protein sequences are diverse in size and in content meaningful to researchers. They are rich in what seems to be “noise,” or aspects of lesser interest that obscure clearer core features required to establish true relatedness and function. This paper represents part of a larger study that explores the possible efficient use and storage of “fingers” for protein sequence analysis, i.e., matrices of uniform size and shape that can “stand for” protein sequences by making more explicit the essential aspects of protein sequence pattern information. The essence of the study relates to data compression. Compression invokes an interesting alternative idea of pattern—the concept of “primeness” as in number theory is used to create the notion of an irreducible and potentially recurrent pattern element, and then this philosophy is mapped onto number theory by the unique factorization theorem, in order to define a novel measure of pattern difference. Other possible approaches are also discussed. Because compression and other approximations involve information loss, this is also a study of performance in the face of such loss. Because of the effects of this loss, no claims are made that encourage replacement of established sequence comparison methods, but the concept may have value in a number of applications within, and outside, molecular biology.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.