By Topic

Knowledge and Data Engineering, IEEE Transactions on

Issue 5 • Date Sept.-Oct. 2003

Filter Results

Displaying Results 1 - 20 of 20
  • ClusterTree: integration of cluster representation and nearest-neighbor search for large data sets with high dimensions

    Page(s): 1316 - 1337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5391 KB) |  | HTML iconHTML  

    We introduce the ClusterTree, a new indexing approach for representing clusters generated by any existing clustering approach. A cluster is decomposed into several subclusters and represented as the union of the subclusters. The subclusters can be further decomposed, which isolates the most related groups within the clusters. A ClusterTree is a hierarchy of clusters and subclusters which incorporates the cluster representation into the index structure to achieve effective and efficient retrieval. Our cluster representation is highly adaptive to any kind of cluster. It is well accepted that most existing indexing techniques degrade rapidly as the dimensions increase. The ClusterTree provides a practical solution to index clustered data sets and supports the retrieval of the nearest-neighbors effectively without having to linearly scan the high-dimensional data set. We also discuss an approach to dynamically reconstruct the ClusterTree when new data is added. We present the detailed analysis of this approach and justify it extensively with experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable low-latency cache invalidation strategy for mobile environments

    Page(s): 1251 - 1265
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1160 KB) |  | HTML iconHTML  

    Caching frequently accessed data items on the client side is an effective technique for improving performance in a mobile environment. Classical cache invalidation strategies are not suitable for mobile environments due to frequent disconnections and mobility of the clients. One attractive cache invalidation technique is based on invalidation reports (IRs). However, the IR-based cache invalidation solution has two major drawbacks, which have not been addressed in previous research. First, there is a long query latency associated with this solution since a client cannot answer the query until the next IR interval. Second, when the server updates a hot data item, all clients have to query the server and get the data from the server separately, which wastes a large amount of bandwidth. In this paper, we propose an IR-based cache invalidation algorithm, which can significantly reduce the query latency and efficiently utilize the broadcast bandwidth. Detailed analytical analysis and simulation experiments are carried out to evaluate the proposed methodology. Compared to previous IR-based schemes, our scheme can significantly improve the throughput and reduce the query latency, the number of uplink request, and the broadcast bandwidth requirements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of standard spell checking algorithms and a novel binary neural approach

    Page(s): 1073 - 1081
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (782 KB) |  | HTML iconHTML  

    In this paper, we propose a simple, flexible, and efficient hybrid spell checking methodology based upon phonetic matching, supervised learning, and associative matching in the AURA neural system. We integrate Hamming Distance and n-gram algorithms that have high recall for typing errors and a phonetic spell-checking algorithm in a single novel architecture. Our approach is suitable for any spell checking application though aimed toward isolated word error correction, particularly spell checking user queries in a search engine. We use a novel scoring scheme to integrate the retrieved words from each spelling approach and calculate an overall score for each matched word. From the overall scores, we can rank the possible matches. We evaluate our approach against several benchmark spellchecking algorithms for recall accuracy. Our proposed hybrid methodology has the highest recall rate of the techniques evaluated. The method has a high recall rate and low-computational cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Logical structure analysis and generation for structured documents: a syntactic approach

    Page(s): 1277 - 1294
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1221 KB) |  | HTML iconHTML  

    This paper presents a syntactic method for sophisticated logical structure analysis that transforms document images with multiple pages and hierarchical structure into an electronic document based on SGML/XML. To produce a logical structure more accurately and quickly than previous works of which the basic units are text lines, the proposed parsing method takes text regions with hierarchical structure as input. Furthermore, we define a document model that is able to describe geometric characteristics and logical structure information of documents efficiently and present its automated creation method. Experimental results with 372 images scanned from the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) show that the method has performed logical structure analysis successfully and generated a document model automatically. Particularly, the method generates SGML/XML documents as the result of structural analysis, so that it enhances the reusability of documents and independence of platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detection and recovery techniques for database corruption

    Page(s): 1120 - 1136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1413 KB) |  | HTML iconHTML  

    Increasingly, for extensibility and performance, special purpose application code is being integrated with database system code. Such application code has direct access to database system buffers, and as a result, the danger of data being corrupted due to inadvertent application writes is increased. Previously proposed hardware techniques to protect from corruption require system calls, and their performance depends on details of the hardware architecture. We investigate an alternative approach which uses codewords associated with regions of data to detect corruption and to prevent corrupted data from being used by subsequent transactions. We develop several such techniques which vary in the level of protection, space overhead, performance, and impact on concurrency. These techniques are implemented in the Dali main-memory storage manager, and the performance impact of each on normal processing is evaluated. Novel techniques are developed to recover when a transaction has read corrupted data caused by a bad write and gone on to write other data in the database. These techniques use limited and relatively low-cost logging of transaction reads to trace the corruption and may also prove useful when resolving problems caused by incorrect data entry and other logical errors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable semantic brokering over dynamic heterogeneous data sources in InfoSleuth™

    Page(s): 1082 - 1098
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4730 KB) |  | HTML iconHTML  

    InfoSleuth is an agent-based system for information discovery and retrieval in a dynamic, open environment. Brokering in InfoSleuth is a matchmaking process, recommending agents that provide services to agents requesting services. This paper discusses InfoSleuth's distributed multibroker design and implementation. InfoSleuth's brokering function combines reasoning over both the syntax and semantics of agents in the domain. This means the broker must reason over explicitly advertised information about agent capabilities to determine which agent can best provide the requested services. Robustness and scalability issues dictate that brokering must be distributable across collaborating agents. Our multibroker design is a peer-to-peer system that requires brokers to advertise to and receive advertisements from other brokers. Brokers collaborate during matchmaking to give a collective response to requests initiated by nonbroker agents. This results in a robust, scalable brokering system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reasoning about uniqueness constraints in object relational databases

    Page(s): 1295 - 1306
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (759 KB) |  | HTML iconHTML  

    Uniqueness constraints such as keys and functional dependencies in the relational model are a core concept in information systems technology. We consider uniqueness constraints suitable for object relational data models and identify a boundary between tractable and intractable varieties. The subclass that is tractable is still a strict generalization of both keys and relational functional dependencies. We present an efficient decision procedure for the logical implication problem of this subclass. The problem itself is formulated as an implication problem for a simple dialect of description logic (DL). DLs are a family of languages for knowledge representation that have many applications in information systems technology and for which model building procedures have been developed that can decide implication problems for dialects that are very expressive. Our own procedure complements this approach and can be integrated with these earlier procedures. Finally, to motivate our results, we review some applications of our procedure in query optimization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IFOOD: an intelligent fuzzy object-oriented database architecture

    Page(s): 1137 - 1154
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (394 KB) |  | HTML iconHTML  

    Next generation information system applications require powerful and intelligent information management that necessitates an efficient interaction between database and knowledge base technologies. It is also important for these applications to incorporate uncertainty in data objects, in integrity constraints, and/or in application. In this study, we propose an intelligent object-oriented database architecture, FOOD, which permits the flexible modeling and querying of complex data and knowledge including uncertainty with powerful retrieval capability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • InterBase-KB: integrating a knowledge base system with a multidatabase system for data warehousing

    Page(s): 1188 - 1205
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (564 KB) |  | HTML iconHTML  

    This paper describes the integration of a multidatabase system and a knowledge-base system to support the data-integration component of a data warehouse. The multidatabase system integrates various component databases with a common query language; however, it does not provide capability for schema integration and other utilities necessary for data warehousing. In addition, the knowledge base system offers a declarative logic language with second-order syntax but first-order semantics for integrating the schemes of the data sources into the warehouse and for defining complex, recursively defined materialized views. Furthermore, deductive rules are also used for cleaning, checking the integrity and summarizing the data imported into the data warehouse. The knowledge base system features an efficient incremental view maintenance mechanism that is used for refreshing the data warehouse, without querying the data sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cascade of distributed and cooperating firewalls in a secure data network

    Page(s): 1307 - 1315
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1279 KB) |  | HTML iconHTML  

    Security issues are critical in networked information systems, e.g., with financial information, corporate proprietary information, contractual and legal information, human resource data, medical records, etc. The paper addresses such diversity of security needs among the different information and resources connected over a secure data network. Installation of firewalls across the data network is a popular approach to providing a secure data network. However, single, individual firewalls may not provide adequate security protection to meet the users needs. The cost of super firewalls, design flaws, as well as implementation inappropriateness with such firewalls may retain security loopholes. The idea proposed is to introduce a cascade of (potentially simpler and less expensive) firewalls in the secure data network, where, between the attacker node and the attacked node, multiple firewalls are expected to provide an added degree of protection. This approach, broadly following the theme of redundancy in engineering systems' design, will increase the confidence and provide more completeness in the level of security protection by the firewalls. The cascade of (i.e., multiple) firewalls can be placed across the secure data network in many ways, not all of which are equally attractive from cost and end-to-end delay perspectives. Toward this, we present heuristics for placement of these firewalls across the different nodes and links of the network in a way that different users can have the level of security they individually need, without having to pay added hardware costs or excess network delay. Three metrics are proposed to evaluate these heuristics: cost, delay, and reduction of attacker's traffic. Performance of these heuristics is presented using simulation, along with some early analytical results. Our research also extends the firewall technology into the well-known advantages of distributed firewalls. Furthermore, the distributed firewalls can be designed to cooperate and stop an attacker's traffic closest to the attack point, thereby reducing the amount of hacker's traffic into the network. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using hybrid knowledge engineering and image processing in color virtual restoration of ancient murals

    Page(s): 1338 - 1343
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (566 KB) |  | HTML iconHTML  

    This paper proposes a novel scheme to virtually restore the colors of ancient murals. Our approach integrates artificial intelligence techniques with digital image processing methods. The knowledge related to the mural colors is first categorized into four types. A hybrid frame and rule-based approach is then developed to represent knowledge and to inter colors. An algorithm that takes into account color similarity and spatial proximity is developed to segment mural images. A novel color transformation method based on color histograms is finally proposed to restore the colors of murals. A number of experiments based on real images have demonstrated the validity of the proposed scheme for color restoration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Image content-based retrieval using chromaticity moments

    Page(s): 1069 - 1072
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (573 KB) |  | HTML iconHTML  

    A number of different approaches have been recently presented for image retrieval using color features. Most of these methods use the color histogram or some variation of it. If the extracted information is to be stored for each image, such methods may require a significant amount of space for storing the histogram, depending on a given image's size and content. In the method proposed, only a small number of features, called chromaticity moments, are required to capture the spectral content (chrominance) of an image. The proposed method is based on the concept of the chromaticity diagram and extracts a set of two-dimensional moments from it to characterize the shape and distribution of chromaticities of the given image. This representation is compact (only a few chromaticity moments per image are required) and constant (independent of image size and content), while its retrieval effectiveness is comparable to using the full chromaticity histogram. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive leases: a strong consistency mechanism for the World Wide Web

    Page(s): 1266 - 1276
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (794 KB) |  | HTML iconHTML  

    We argue that weak cache consistency mechanisms supported by existing Web proxy caches must be augmented by strong consistency mechanisms to support the growing diversity in application requirements. Existing strong consistency mechanisms are not appealing for Web environments due to their large state space or control message overhead. We focus on the lease approach that balances these trade-offs and present analytical models and policies for determining the optimal lease duration. We present extensions to the HTTP protocol to incorporate leases and then implement our techniques in the Squid proxy cache and the Apache Web server. Our experimental evaluation of the leases approach shows that: 1) our techniques impose modest overheads even for long leases (a lease duration of 1 hour requires state to be maintained for 1030 leases and imposes an per-object overhead of a control message every 33 minutes), 2) leases yields a 138-425 percent improvement over existing strong consistency mechanisms, and 3) the implementation overhead of leases is comparable to existing weak consistency mechanisms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A data mining algorithm for generalized Web prefetching

    Page(s): 1155 - 1169
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (493 KB) |  | HTML iconHTML  

    Predictive Web prefetching refers to the mechanism of deducing the forthcoming page accesses of a client based on its past accesses. In this paper, we present a new context for the interpretation of Web prefetching algorithms as Markov predictors. We identify the factors that affect the performance of Web prefetching algorithms. We propose a new algorithm called WM,,, which is based on data mining and is proven to be a generalization of existing ones. It was designed to address their specific limitations and its characteristics include all the above factors. It compares favorably with previously proposed algorithms. Further, the algorithm efficiently addresses the increased number of candidates. We present a detailed performance evaluation of WM, with synthetic and real data. The experimental results show that WMo can provide significant improvements over previously proposed Web prefetching algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Epidemic algorithms for replicated databases

    Page(s): 1218 - 1238
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (617 KB) |  | HTML iconHTML  

    We present a family of epidemic algorithms for maintaining replicated database systems. The algorithms are based on the causal delivery of log records where each record corresponds to one transaction instead of one operation. The first algorithm in this family is a pessimistic protocol that ensures serializability and guarantees strict executions. Since we expect the epidemic algorithms to be used in environments with low probability of conflicts among transactions, we develop a variant of the pessimistic algorithm which is optimistic in that transactions commit as soon as they terminate locally and inconsistencies are detected asynchronously as the effects of committed transactions propagate through the system. The last member of the family of epidemic algorithms is pessimistic and uses voting with quorums to resolve conflicts and improve transaction response time. A simulation study evaluates the performance of the protocols. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data management challenges and development for military information systems

    Page(s): 1059 - 1068
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (265 KB) |  | HTML iconHTML  

    This paper explores challenges facing information system professionals in the management of data and knowledge in the Department of Defense (DOD), particularly in the information systems utilized to support command, control, communications, computers, and intelligence (C4 I). These information systems include operational tactical systems, decision-support systems, modeling and simulation systems, and nontactical business systems, all of which affect the design, operation, interoperation, and application of C4 I systems. Specific topics include issues in integration and interoperability, joint standards, data access, data aggregation, information system component reuse, and legacy systems. Broad technological trends, as well as the use of specific developing technologies are discussed in light of how they may enable the DOD to meet the present and future information-management challenges. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An approach for modeling and analysis of security system architectures

    Page(s): 1099 - 1119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1912 KB) |  | HTML iconHTML  

    Security system architecture governs the composition of components in security systems and interactions between them. It plays a central role in the design of software security systems that ensure secure access to distributed resources in networked environment. In particular, the composition of the systems must consistently assure security policies that it is supposed to enforce. However, there is currently no rigorous and systematic way to predict and assure such critical properties in security system design. A systematic approach is introduced to address the problem. We present a methodology for modeling security system architecture and for verifying whether required security constraints are assured by the composition of the components. We introduce the concept of security constraint patterns, which formally specify the generic form of security policies that all implementations of the system architecture must enforce. The analysis of the architecture is driven by the propagation of the global security constraints onto the components in an incremental process. We show that our methodology is both flexible and scalable. It is argued that such a methodology not only ensures the integrity of critical early design decisions, but also provides a framework to guide correct implementations of the design. We demonstrate the methodology through a case study in which we model and analyze the architecture of the Resource Access Decision (RAD) Facility, an OMG standard for application-level authorization service. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient biased sampling for approximate clustering and outlier detection in large data sets

    Page(s): 1170 - 1187
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (596 KB) |  | HTML iconHTML  

    We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient causality-tracking timestamping

    Page(s): 1239 - 1250
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (373 KB) |  | HTML iconHTML  

    Vector clocks are the appropriate mechanism used to track causality among the events produced by a distributed computation. Traditional implementations of vector clocks require application messages to piggyback a vector of n integers (where n is the number of processes). This paper investigates the tracking of the causality relation on a subset of events (namely, the events that are defined as "relevant" from the application point of view) in a context where communication channels are not required to be FIFO, and where there is no a priori information on the connectivity of the communication graph or the communication pattern. More specifically, the paper proposes a suite of simple and efficient implementations of vector clocks that address the reduction of the size of message timestamps, i.e., they do their best to have message timestamps whose size is less than n. The relevance of such a suite of protocols is twofold. From a practical side, it constitutes the core of an adaptive timestamping software layer that can used by underlying applications. From a theoretical side, it provides a comprehensive view that helps better understand distributed causality-tracking mechanisms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Atomic broadcast in asynchronous crash-recovery distributed systems and its use in quorum-based replication

    Page(s): 1206 - 1217
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1842 KB) |  | HTML iconHTML  

    Atomic broadcast is a fundamental problem of distributed systems: It states that messages must be delivered in the same order to their destination processes. This paper describes a solution to this problem in asynchronous distributed systems in which processes can crash and recover. A consensus-based solution to atomic broadcast problem has been designed by Chandra and Toueg for asynchronous distributed systems where crashed processes do not recover. We extend this approach: it transforms any consensus protocol suited to the crash-recovery model into an atomic broadcast protocol suited to the same model. We show that atomic broadcast can be implemented requiring few additional log operations in excess of those required by the consensus. The paper also discusses how additional log operations can improve the protocol in terms of faster recovery and better throughput. To illustrate the use of the protocol, the paper also describes a solution to the replica management problem in asynchronous distributed systems in which processes can crash and recover. The proposed technique makes a bridge between established results on weighted voting and recent results on the consensus problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Knowledge and Data Engineering (TKDE) informs researchers, developers, managers, strategic planners, users, and others interested in state-of-the-art and state-of-the-practice activities in the knowledge and data engineering area.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Jian Pei
Simon Fraser University