Skip to Main Content
Identical data may cause many problems in all types of databases, specially distributed and replicated databases. These data will attack consistency and redundancy which are two important problems in databases. Databases or replicas may contain similar records with different appearance, concerning the same real word entity because of many reasons. Some of these reasons are: Entry errors, unstandardized abbreviations, differences details of various databases schemas, package lost, noisy environments and etc are some reasons of duplicates. This paper proposes an approach to detect duplicate or similar data, which are faulty or noisy so they are distinguished as different data, among various replicas in distributed or replicated databases. Multi criteria decision making algorithm is employed for this propose. To detect identical records, at first step some priorities are defined for fields and then percent of similarity of records evaluate. Algorithm's time overhead is improved through using special order of priorities. Multi criteria decision making algorithm is used to decide how to combine records with each other and which record is complete and true one. An instance based learning approach is employed to learn how to set priorities for various fields, creating a uniform schema and find their appropriate match, in other replica.