IEEE Xplore At-A-Glance
  • Abstract

Visualizing the Intellectual Structure with Paper-Reference Matrices

Visualizing the intellectual structure of scientific domains using co-cited units such as references or authors has become a routine for domain analysis. In previous studies, paper-reference matrices are usually transformed into reference-reference matrices to obtain co-citation relationships, which are then visualized in different representations, typically as node-link networks, to represent the intellectual structures of scientific domains. Such network visualizations sometimes contain tightly knit components, which make visual analysis of the intellectual structure a challenging task. In this study, we propose a new approach to reveal co-citation relationships. Instead of using a reference-reference matrix, we directly use the original paper-reference matrix as the information source, and transform the paper-reference matrix into an FP-tree and visualize it in a Java-based prototype system. We demonstrate the usefulness of our approach through visual analyses of the intellectual structure of two domains: Information Visualization and Sloan Digital Sky Survey (SDSS). The results show that our visualization not only retains the major information of co-citation relationships, but also reveals more detailed sub-structures of tightly knit clusters than a conventional node-link network visualization.



Visualizing the intellectual structure of scientific domains using co-cited units like references or authors has become a routine for domain analysis. Since Henry Small first introduced the notion of co-citation and used the node-link network to visualize the co-citation relationship of 10 famous particle physics papers [26], an abundance of studies have applied the visualization of co-citation relationships to map the intellectual structures of science as a whole, or of particular domains, or sub-domains.

Previous visualization studies normally transform original paper-reference matrices into reference-reference correlation matrices as data sources for visualization and analysis. Then the correlation matrix is projected into 2-D or 3-D maps for further interpretation [21]. One typical 2-D visualization technique is node-link network visualization in which nodes represent cited entities, and links connecting nodes represent co-citation relations among them.

Node-link visualization of co-citation relationships, however, sometimes contains tightly knit components in which cited items are densely connected. The intensively connected component is good for macro-level analysis since it can be viewed as one cluster. But in micro-level analysis, interpreting and analyzing such kind of co-citation networks is a challenging task. Analysts often need to deal with the prohibitive complexity of the structure in order to identify meaningful sub-structures or detailed information. Methods have been developed to reduce the complexity of networks. Normally, these methods are based on the network structure that comes from reference-reference matrices. However, the transformation from a paper-reference matrix into a reference-reference matrix results in information loss, hence leaving fewer clues for analysts.

In this paper we propose a solution to the challenge from a new perspective. Instead of visualizing the co-citation relationships based on reference-reference matrices, we directly use the original paper-reference matrix as information sources without losing information, and visualize the intellectual structure in a tree structure, called FP-tree [14]. We developed a JAVA-based prototype and tested the new approach through visual analyses of the intellectual structure of two domains: Information Visualization (InfoVis) and Sloan Digital Sky Survey (SDSS). The results of case studies show that compared with network visualization our visualization not only retains the major information of co-citation relationships, but also reveals salient substructures of tightly knit clusters. The major contributions of this work are as follows:

  • We use paper-reference matrices to derive the co-citation relationship for visualization without information loss.

  • We visualize the new co-citation relationship in a tree structure, called the FP-tree.

  • Results from two case studies, one from real-life application settings, demonstrate the usefulness of the new approach.

  • A free available prototype for visualizing the intellectual structure with paper-reference matrices.


Related Work

Mapping the intellectual structure of science domains has become a well-established visualization routine in the field of library and information science. In 1973, Henry Small first introduced the notion of co-citation networks and demonstrated its application by mapping the domain of particle physics in terms of a node-link network of 10 co-cited documents. In a series of subsequent co-citation studies [13, 27-31], Small and Griffith documented the principles of co-citation analysis and its applications to map the advance of science and identify the dynamic intellectual structure of science as a whole, or of particular domains. Researchers later extended the unit of analysis from papers to authors, leading to author co-citation analysis (ACA [35]. McCain [21] summarized the process of visualizing the intellectual structure as four general steps, namely, collecting data, constructing a raw co-citation matrix and converting to correlation matrix, multivariate analysis of correlation matrix and visualization, and interpretation and validation.

The basic process of co-citation analysis often transforms a paper-reference matrix into a correlation matrix [19], [22], such as a paper-paper matrix (bibliographical coupling), or a reference-reference matrix (co-citation Figure 1 shows an example of the matrix transformation process. The two steps are marked as 1 and 2 in a circle. The figure also shows a node-link network visualization of co-citation relationships derived from the reference-reference matrix in step 3 Figure 1 contains a paper-reference list on the upper left corner, which has five papers (P1 to P5) and nine references (R1 to R9). Each row in the list represents a paper with its references. In step 1, the paper-reference list is transformed and stored as a paper-reference matrix, in which all references (R1 to R9) are presented in columns of the matrix, and each row represents a paper. If a paper cites a reference, the corresponding cell is marked as "1". If not, as "0". In step 2, the co-citation relation is obtained by transforming the asymmetric paper-reference matrix into a symmetric reference-reference matrix in which all references are presented in both columns and rows, and the co-citation relation between two different references is marked in the corresponding cell. Various correlation measurements such as cosine similarity, Pearson's correlation coefficient, and Jaccard coefficient could be used to calculate the co-citation relationship. For simplicity, this example uses the co-cited frequency of two different references as the correlation measurement.

Figure 1
Fig. 1. The process of matrix transformation in co-citation analysis and the node-link co-citation network with the minimum citation threshold as 2, which means only references that are cited twice or more are presented in the network, hence removing R1 and R6.

Various visualization techniques are applied to map the co-citation relationship based on reference-reference matrices. Node-link network visualization is one of the most widely used techniques. A simple node-link network is shown on bottom right corner of Figure 1 with the minimum citation threshold of 2, which means only references that are cited twice or more are presented in the network, hence removing R1 and R6. Compared to other visual representation techniques, like self-organized map (SOM) and multidimensional scaling (MDS), the node-link visualization can be applied to both macro- and micro-level mapping and analyses. Node-link visualization tools, such as CiteSpace [5] and Parjek [3], have been widely used.

Network visualization of co-citation relationship like the one in Figure 1 may challenge visual analysts. The concise network shows a densely connected reference cluster. In some circumstances, such clusters help to reveal intellectual structures of scientific advance in a macro-view [5], [29]. However, when analysts need to look into the cluster and untie the relationships among cluster members, it is hard for them to identify detailed co-citation relations and meaningful sub-structures.

To deal with the challenge, many methods were developed. Some focus on revealing salient structures based on network properties; while others focus on optimizing the network graph layout. Network properties such as link weight, degree, centrality, and betweenness are useful quantitative measurements of network structure. Pruning techniques such as the minimum spanning tree (MST [1] and pathfinder network (PFNET [4] remove weak links and retain the "backbone" for visual analysis. Other pruning methods remove links with high edge betweenness so that one can detect communities [11]. New network properties such as probability flow of random walks in a weighted and directed network were developed to discover the salient community structure [25].

The tightly knit clusters commonly introduce clutter and occlusion problems in graph layouts, hence preventing analysts from seeing salient structures and patterns. Many methods have been proposed to reduce visual clutter, for example, by optimizing nodes' positions [33] or clustering edges to bundles [7], and to reduce occlusions, for example, by enhancing graph layout with clustering [6] and constraint-based graph layout algorithms in general [8].

In general, these methods are all based on the existing network structure, which originates from the correlation matrix such as the reference-reference matrix. Mathematically, however, the transformation from the asymmetric paper-reference matrix to the symmetric reference-reference matrix results in information loss and the transformation process is irreversible. Once the reference-reference matrix is formed, without additional information it cannot be transformed backwards to the paper-reference matrix. Further pruning the network may lose even more information. In terms of finding a macroscopic structure such as the multiple research communities in a domain, these methods work well. For analyzing microscopic structures, however, these methods may require information that has been lost in the transformation. Therefore, we hypothesize that if we can faithfully retain co-citation relationships based on the original paper-reference matrix and visualize these relationships in full details, we will give domain analyst more information and potentially help them examine the intellectual structure of the domains they target.


New Design

In order to test our hypothesis and visualize the intellectual structure based on co-citation relationships derived from paper-reference matrices, we adopt an existing algorithm, called the FP-tree algorithm, to contain the original paper-reference matrix, and visualize the hierarchical structure in a Java-based interactive prototype developed with the prefuse toolkit [15].

3.1 Data Transformation and Representation

Our goal of the design is to build a visualization system that domain analysts could use to view the intellectual structure represented by paper-reference matrices, and discover potential sub-structures when tightly knit clusters occur. However, very few studies in the InfoVis domain primarily focused on visualizing the paper-reference type of data. But in data mining field, the paper-reference type of data has been well studied, especially with association rule (AR) mining [2]. One algorithm, called FP-tree (Frequent Pattern tree), is explicitly designed to handle and store the paper-reference type of data. Therefore, we adopt this algorithm for the data transformation and representation of our design. The next section explains the construction of an FP-tree.

3.2 FP-tree Algorithm

The FP-tree algorithm was first introduced by Han [14] to store transaction data in a compact format so that items frequently purchased together can be found. In this study we adopt the idea of the tree structure to store and visualize the co-citation relationships from a paper-reference matrix.

Figure 2 shows the three steps of the FP-tree construction with the same paper-reference list used in Figure 1. In step one all references are collected and sorted by the total frequency they are cited in the five papers. Then they are stored in a header table. In this example the minimum citation threshold is set up as two, the same as the citation threshold used in co-citation network visualization in Figure 1, hence removing R1 and R6. Based on the ranking in header table, in step 2 each reference row is sorted in descending order, and stored into the sorted paper-reference list shown in the middle left of Figure 2.

The FP-tree is then constructed in step 3 as follows. First, create the root of the tree. Scan each row in sorted reference list and a branch is created for each row. For example, scanning the first row, "P1: R2, R4, R3, R5," leads to the construction of the branch from R2 to R5, where R2 is the child of the Root; R4 is linked to R2; R3 to R4; and R5 to R4. The second reference list contains R2, R4, and R7, which would result in a branch where R2 is linked to the root; R4 is linked to R2, and R7 to R4. However, this branch would share two common prefix, R2 and R4, with the existing branch for P1. Therefore the node frequency of R2 and R4 in this tree is increased to 2. In general, when considering the branch to be added for a reference list row, the count of each node along a common prefix is incremented by 1, and nodes for the references following the prefix are created and linked accordingly. The FP-tree based on the example paper-reference list is on the bottom of Figure 2, where nodes are labelled with their reference ID and node frequency. Notice that one reference such as R7 could appear in different locations as multiple nodes in an FP-tree.

Figure 2
Fig. 2. Three steps of constructing an FP-tree from a paper-reference list. In the FP-tree each node is labeled with its reference ID and its node frequency in the tree.

The FP-tree is a compact format of the original paper-reference list because the transformation from sorted paper-reference list to FP-tree is reversible. The sorted reference list can be re-built by repeating the following process: pick up any leaves on the tree, traverse back to the root, record the nodes along the path, and reduce the node frequency by one. If the frequency of a node is reduced to zero, remove the node. Each recorded path would represent one reference row in the sorted paper-reference list. Therefore no information loss occurs during the transformation. By visualizing the FP-tree, we would expect to see more information than reference-reference matrix-based visualizations and allow further pruning techniques to detect salient sub-structure and patterns.

3.3 Visual Representations

Several visual representations are available for tree visualizations, including classic 2-D [34] or 3-D [24] hierarchical node-link tree models, tree-map model [16], radial tree model [9], and radial space-filling hierarchical tree [32].

Among the various visual representation studies, two studies provide useful examples to our visual design. Keim et al [18] applied the radial space-filling hierarchical model, called FP-viz, to visualize the FP-tree for mining frequent patterns. They used a SAS example data that has seven grocery items in 1,500 transaction records to visualize frequently co-purchased items. Another study is the Word Tree [34] by Wattenberg and Viegas, which visualizes the suffix tree of sentences in documents based on the idea of keyword-in-context in a classic 2-D node-link hierarchical model.

The FP-viz uses the same FP-tree algorithm as this study, showing a good example of the tree visualization. The space-filling hierarchical model they used fits the task that the FP-viz targets, that is, finding the frequent patterns among a certain group of items. However, no validation and evaluation of using this visual representation model is discussed in their study. The Word Tree is similar to the FP-tree structure in terms of grouping the same high frequent tokens, like words or terms in Word Tree and references in our FP-tree, into one parent node, and positioning its suffixal tokens in subtrees. Wattenberg argued that the 2-D node-link hierarchical tree model has instant readability, and "[I]t immediately communicates to viewers that they are looking at a tree structure."([34] p.1222)

Given the instant readability of 2-D node-link hierarchical model, and given that this study would compare the FP-tree visualization with traditional node-link network visualization for presenting intellectual structures, this design adopts the 2-D node-link tree model as the tree structure visual representation. In addition, some encoding techniques, like setting the size of nodes proportional to their frequency, are adopted from the Word Tree visualization too.

3.4 Implementation Notes

A prototype of FP-tree visualization of co-citation relationship was developed using the prefuse visualization toolkit, leveraging the toolkit's interactive features such as overview, zooming, focusing, filtering, and searching. The interface of the prototype is shown in Figure 3. The standard input bibliographic files were retrieved from Thompson Reuters' Web of Science (WoS). The bibliographical files from WoS were first imported into DIVA's [23] database tool where the citing and cited papers were imported into a paper-reference matrix and assigned unique IDs. We also wrote utility classes for the construction of FP-tree and converting other input files. The final visualization of FP-tree was created in the prototype.

3.5 Scalability

Compared to the node-link network visualization, FP-tree structure increases the number of nodes and links with several magnitudes, especially when clusters are dense. Hence, scalability is one of the implementation issues needed to be concerned. This study considers the scalability problem from two perspectives: application and technique. From the application perspective, the FP-tree visualization normally works in micro level visual analyses of intellectual structure, and focuses on untie the tightly knit clusters, which in many circumstances contain a small number of nodes. For visual analyses of very large number of nodes, methods, like various clustering methods, should be considered. Hence in FP-tree visualization, the scalability problem may be slight. From the technique perspective, the major CPU consumption of the FP-tree algorithm occurs in building up of header tables and sorting paper-reference lists. The computing complexity varies depending on specific implementations. Our prototype has a O(N*Log(N)) complexity where N is the number of unique references. Once the two files are created and stored, constructing the tree is fast, with the O(N) complexity. Both two complexities are nearly linear. A large number of nodes in FP-tree does affect the interactive features in our prototype such as animated transition, but permits the feeling of instant feedback.


Case Studies

To evaluate our approach and further guide the design of our visualization, we conducted two case studies by applying the FP-tree visualization to two sets of bibliographic data. The major purpose of the case studies it to test the usefulness of the new approach, particularly verifying whether the FP-tree visualization can retain the important information in the intellectual structure of a domain, and whether the FP-tree structure can facilitate domain analysts to reveal fine-grained sub-structure when tightly knit clusters occurred in co-citation network.

Figure 3
Fig. 3. The FP-tree Visualizations of co-citation relationships from the InfoVis 2004 Contest dataset, depicting three key document branches led by Robertson (1991), Furnas (1986), and Tufte (1986). The highlighted branch depicts the co-citation relation among the three key documents. Bottom right: Node-link network views of the core co-citation network created by CiteSpace.

In the first case study we choose the information visualization domain and use the InfoVis 2004 Contest dataset [10] to build an FP-tree visualization and compare it with previous node-link network visualization submission. The second case study comes from a real domain analysis task, which is involved with visualization of the intellectual structure of Sloan Digital Sky Survey (SDSS) to fulfilled requirements of an ongoing interdisciplinary project.

In the two case studies, we provide both node-link co-citation network visualization and FP-tree visualization from the same datasets. Network visualizations were created in CiteSpace [5]. Due to the limit of paper layout, we shrank several FP-tree images, which make the detailed contents invisible. We have all higher resolution images along with a short video and the Java-based prototype available online at:

4.1 Case Study 1: Information Visualization

The InfoVis 2004 Contest data contains 614 citing papers, which cited 6,323 references. The reference format in this dataset is different from the WoS bibliographic data. Our parser identified the majority of the references, but the co-citation network may not truly represent the structure. Therefore we referred to another co-citation network that was created by one of the first place entries from Indiana University for the comparison of insights1. The core cluster of their co-citation network contains about 40 highly cited papers. In our visualization we chose references that have been cited at least 10 times, and obtained a total of 44 references, which is nearly identical to the core cluster of their previous visualization.

Figure 3 shows the node-link network visualization and FP-tree visualization of the core cluster in InfoVis 2004 contest data. The node in the FP-tree contains three sections, the unique node ID, the bibliometrical contents, and node frequency as annotated on the area B of Figure 3. The size of node is proportional to the square root of the node's frequency for readability and to highlight the highly cited documents. Visualizations by CiteSpace on area A of Figure 3 sets the size of nodes proportional to the frequency of their citation counts, and set the thickness of links proportional to the co-citation times between the two nodes they connected.

The major insight derived from the Indiana version of co-citation network is the three key documents and their co-citation relationships2, which can also be identified from the node-link network in Figure 3. In the FP-tree visualization, the three key nodes, Robertson, Mackinlay & Card, (1991 Cone trees: animated 3D visualization of hierarchical information, cited 70 times; Furnas (1986 Generalized fisheye views, cited 69 times, and Tufte (1986 Visualization of quantitative information cited 40 times, can be easily identified as they all have large node size. Also in this FP-tree they are all in the first level and each leads a large subtree.

Among the three key documents, Robertson (1991) has very strong co-citation relationship with Furnas (1986), a total of 27 times. And both of the two have less co-citation links with Tufte (1986), a total of six and seven times respectively. But how many times were the three key documents co-cited together? The answer cannot be found in the network visualization since information to answer this question has been lost during matrix transformation. In the FP-tree visualization, there is only one branch connecting the three key documents, as highlighted with dark background on area C of Figure 3 where the node on the right is Tufte (1986) with node frequency as three. According to the principle of FP-tree construction, the child node's frequency determines how many times it is co-cited with all its up level parents in one branch. Since the frequency of Tufte (1986) in this branch is three, the total co-cited time among the three key documents is three.

4.2 Case Study 2: Sloan Digital Sky Survey (SDSS)

The second case study is involved with three astronomers in a weekly meeting environment. In order to map the intellectual structure in SDSS, the first author presents a series of visualizations to domain experts and has their comments for further improvement.

The SDSS bibliographic data has a total of 2,318 papers, citing 35,528 unique references. Some of the highly cited papers obtained several hundreds of citations, which easily make the co-citation clusters dense. The first network visualization of co-citation relation in SDSS turned out to be a very dense "hairball" as showing in area A of Figure 4. Nodes in this hairball had been cited at least 15 times. There are a total of 1,535 links among the 128 nodes. When we introduced this visualization to the three astronomers, they found it is difficult to interpret the intellectual structure. All they can say is that there is a big cluster and no further comments.

We then increased the minimum citation count to 100, reducing the core network to a small scale with 34 nodes and 313 links among them, which is shown on the area B of Figure 4. By viewing this network, astronomers can tell that half of the documents in the co-citation network belong to technical report papers of the SDSS project. These technical report papers introduced the background, principles, methods, and data of the SDSS project. Therefore publications associated with the SDSS would like to cite them as their intellectual bases. But the structure of the co-citation network is still so complex that little meaningful information of sub-structure can be revealed.

Later, we applied the FP-tree visualization to the 34 node co-citation relationships and reveal a clear pattern of sub-structures, which is shown on area C of Figure 4. We turned the FP-tree 90 degree counter clockwise to best fit in the page layout. Each node contains the three sections as in the InfoVis FP-tree. For simplicity, the rest discussion uses the node ID, first author, and publication year only to describe a node. For example, 306-York(2000) represents the reference "306, York DG, 2000, ASTRON J. V120, P1579, 753", which is the paper by York et al published in 2000, titled as Sloan Digital Sky Survey: The Technical Summary.

It is clear to see that the major branch in the SDSS co-citation FP-tree is led by 306-York(2000) (753 citation counts). Other nodes such as 163-Spergel(2003), and 302-Stoughton(2002), also lead large branches, but these two branches do not flourish as the one led by 306-York(2000). Follow the branch of 306, there is a strong trunk containing 302-Stoughton(2002), 278-Fukugita(1996), and 280-Gunn(1998). Then the trunk splits into two subtrees which form the two major branches of the tree.

FP-tree visualization of the SDSS references confirms domain experts' expectations, seeing that the major trunk and its two branches are almost composed of the technical report papers. Therefore this branch should form the major intellectual structure of the SDSS domain.

The FP-tree visualization reveals that reference 163-Spergel(2003) forms another intellectual structure other than SDSS technical report papers. We enlarge the node on the area C of Figure 4 for readability. Reference 163 has a total of 223 citation counts, but nearly half (114) exists in the branch it led on the bottom left of FP-tree on the area C of Figure 4. The domain experts explained that 163-Spergel(2003) paper introduced another large scale sky survey, Wilkinson Microwave Anisotropy Probe (WMAP). Thus it is expected to see astronomers published papers, discussing topics associated with WMAP or its relation with the SDSS project.

Besides the expected patterns, an unexpected pattern did surprise astronomers. In Figure 4, the major trunk is split into two subtrees led by 280-Gunn(1998). We enlarge the two subtrees, which are shown on area D of Figure 4. In the two subtrees, both of their major branch share nearly identical references except that the subtree on the right is led by reference 334-Schlegel(1998). Formation of the two subtrees does not result from the publication time given the fact that reference 334-Schelegel(1998) was published in 1998, earlier than most of the technical report papers of SDSS (normally published after the year of 2000).

Astronomers then gave their interpretation of this phenomenon, that is, 334-Schelegel(1998) paper constructed a full sky-map of the Galactic dust and discussed its use for Cosmic Microwave Background Radiation (CMBR). Therefore the subtree on the right on area D of Figure 4 may focus on how the CMBR study affected the SDSS project and later astronomical studies.



Visualizing the intellectual structure with the paper-reference matrix provides a new approach to look at the co-citation relationships that may not be accessible in traditional node-link network visualization, especially when tightly knit clusters exist. As Greenberg and Buxton argued, good usability in successful products often happens after, not before, usefulness [12]. Thus Liu commented that "it might not make much sense to do usability evaluation for novel techniques or systems, rather the focus of evaluation should be usefulness instead."(p.1178 [20]) Building on this idea, the above two case studies did not focus on how domain analysts used the FP-tree visualization prototype; rather we endeavoured to demonstrate the usefulness of the new visualization.

5.1 Usefulness of FP-tree Visualization of Co-citation Relationships

As we discussed in the purpose of our case studies, the usefulness is defined as whether FP-tree visualization can retain the important information in the intellectual structure of a domain, and whether the FP-tree structure can facilitate domain analysts to reveal fine-grained sub-structure when tightly knit clusters occurred in co-citation network.

As the FP-tree visualization is constructed using original paper-reference matrix, theoretically, it retains all co-citation information. In turn the visualization should at least reveal the same insights of intellectual structure as the traditional node-link co-citation network dose. This expectation is turned out to be true in the two case studies. In case study 1 the insights of the intellectual structure of InfoVis we discussed came from the previous contest submission. By reading the FP-tree visualization, same insights were obtained. In addition, new insights like how many times the three key documents were co-cited together was revealed, which is natural to us since this kind of information is lost during the matrix transformation of construction of co-citation network, but not in the construction of FP-tree.

Another primary usefulness of FP-tree visualization is that it offers a mechanism for domain analysts to untie the tightly knit co-citation cluster, and offering additional information of the co-citation relations among cluster members. As shown in the case study of SDSS, when the inter-connectivity within a co-citation cluster becomes denser, domain analysts find it is hard to make sense of the intellectual structure other than a "hairball." By viewing the FP-tree visualization of the SDSS intellectual structure, astronomers detect the salient sub-structure, and they can connect to reasons behind the major trunk and its two subtrees in Figure 4, as this pattern is already known in their routine literature review and communications.

Besides identifying the expected patterns, the FP-tree visualization introduces the unexpected but interesting patterns, which become a bonus usefulness of this visualization. For example, why did the Schlegal(1998) paper separate the major trunk of SDSS FP-tree? Why did nearly 100 papers cite this one, and another 100 not, when they both cited other technical report papers? After seeing the patterns that Schlegeal(1998) existed in the two nearly identical branches, the astronomers started to raise these questions.

In addition, as one reference can have multiple nodes existing on an FP-tree. The distribution of those nodes and their frequency may indicate the distribution of contributions of a reference to the overall intellectual structure. For example, in the SDSS FP-tree, half of reference 163-Spergel(2003)'s citation counts exist in the branch that primarily focused on WMAP studies. Meanwhile the rest half citation counts (109 times) are distributed in 45 nodes with node frequency ranging from 1 to 25, but most are 1 to 5 times. These findings triggered domain experts to pursue the reasons behind these phenomena.

5.2 Limitations

The FP-tree is a compact format of the paper-reference list. Compared with correlation matrix-based network structure, FP-tree viewers.

Figure 4
Fig. 4. The tight-knit co-citation clusters in SDSS intellectual structure shown in area A and B, and the corresponding FP-tree shown in area C. The FP-tree visualization is turned 90 degree counter clockwise for best fitting with the page layout. The two major subtree led by 280-Gunn(1998) is enlarged in area D on the left.

One primary limitation of FP-tree results from the multiple distributions of the same reference, which makes the tree structure larger than the co-citation network in several magnitude levels. The 34 SDSS references, for example, turn into 2,333 nodes in the FP-tree. The majority of them have less than 10 node frequency in the tree. Without additional assistances, such as display of all nodes representing the same reference at the same time, it is hard to extract detailed tree structure and node distribution in relatively short period.

Another limitation of the FP-tree visualization is involved with human beings' cognitive intuitions. When we introduced the FP-tree visualization to astronomers, they perceived each node in the tree as a unique reference. It was not natural for them to think in the way that the same references could have multiple corresponding nodes in the tree. After carefully explain the construction process of the FP-tree, domain experts started to understand the meaning of tree structure and the roles that the nodes and references played in the FP-tree. But what mental model is corresponding to this feature is still unknown.

5.3 Design Improvements

The multiple appearances of a single reference are the fundamental feature of FP-tree. This feature could have positive effects on co-citation analysis, such as studying the distributed contribution of references, but it could have negative effects as we discussed above. Therefore in the prototype, we leveraged the prefuse toolkit's interactive features to facilitate quick exploration of FP-tree. This prototype, however, still needs more improvement as it still takes relatively long time to conduct tasks that we discussed above. We consider several further improvements as follows:

  • Supply new mechanism to show the co-citation of a group of key documents quickly. Although in case study 1 we quickly identified the co-citation pattern among the three key documents, finding the same pattern among other group of documents in this FP-tree structure is still time-consuming. The FP-tree algorithm has the function to quick reveal such patterns like what Keim et al. did in [17]. Thus our further design will integrate this function in to the prototype.

  • Since the FP-tree structure contains all co-citation information, further pruning techniques could be applied to facilitate visual analysis of intellectual structure with less information loss than pruning network structure. For example, Word Tree employs a "level of detail" method to highlight the major subsets of branches, which can be used to improve our prototype by increasing the readability and revealing salient sub-structure.

  • Besides technical improvements, it is worth studying how users perceive the FP-tree's unique features, i.e., the same reference could appear in various locations in the tree. What is the mental model corresponding to this feature? How to adopt the mental model to facilitate domain analysis of intellectual structures? How to use interactive techniques to accommodate the mental model? We believe answers to these questions will greatly improve our further design.



We have proposed a new approach to visualize the intellectual structure as FP-trees based on co-citation relationships directly obtained from the paper-reference matrix. We evaluated our approach with two case studies based on our JAVA-based prototype. As the approach is new, our evaluation focused on the usefulness, rather than usability. The results of case studies show that the FP-tree visualization not only retains the major information of co-citation relationships, but also reveals salient sub-structures. In one case, our approach discovered unexpected patterns, which are interested to domain experts.

Meanwhile, FP-tree also brings additional cognitive burden to viewers. Interactive mechanisms such as zoom, search, and filter for reducing the burden is applied in this prototype. But future studies on how users perceive the FP-tree structure and on how other encoding mechanisms, like size, color, etc., can facilitate analysts to interpret and validate the FP-tree-based intellectual structure are eagerly needed for improvements.

In conclusion, the FP-tree visualization contributes a new method to help domain experts to visualize and analyze the intellectual structure based on co-citation relationships, especially in microscopic view when tightly knit clusters exist. As the FP-tree visualization is able to reveal local details of sub-structures, it would be an ideal candidate to complement macroscopic visualizations that focus on global structure. The combination of the two visualization methods will help analysts to improve their understandings of the intellectual structure of scientific domains they target.


This work is supported by the National Science Foundation under Grant No. IIS-0612129. Thomson Reuters provides the bibliographic data for the analysis. We also thank Michael S. Vogeley, Danny Pan and John Parejko for their constructive suggestions on the SDSS case study. Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web Site is


Jian Zhang is with Drexel University, E-Mail:

Chaomei Chen is with Drexel University, E-Mail:

Jiexun Li is with Drexel University, E-Mail:

Manuscript received 31 March 2009; accepted 27 July 2009; posted online 11 October 2009; mailed on 5 October 2009.

For information on obtaining reprints of this article, please send email to:


1. Graph Theory : Modeling, Applications, and Algorithms

G. Agnarsson and R. Greenlaw

Upper Saddle River, NJ: Pearson/Prentice Hall, 2007.

2. "Mining association rules between sets of items in large databases,"

R. Agrawal, T. Imielinski and A. Swami

in 1993 ACM-SIGMOD international conference on management of data (SIGMOD'93), Washington, DC, 1993, pp. 207-216.

3. "Efficient Algorithms for Citation Network Analysis,"

V. Batagelj

in arXiv:cs/0309023v1, 2003.

4. "Generalised Similarity Analysis and Pathfinder Network Scaling,"

C. Chen

Interacting with Computers, vol. 10, pp. 107-128, 1998.

5. "CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature,"

C. Chen

Journal of the American Society for Information Science and Technology, vol. 57, pp. 359-377, 2006.

6. "Visual analysis of scientific discoveries and knowledge diffusion,"

C. Chen, J. Zhang and M. S. Vogeley

in 12th International Conference on Scientometrics and Informetrics (ISSI 2009) Rio de Janeiro, Brazil, 2009.

7. "Geometry-Based Edge Clustering for Graph Visualization,"

W. Cui, H. Zhou, H. Qu, P. C. Wong and X. Li

IEEE Transactions on Visualization and Computer Graphics, vol. 14, pp. 1277-1283, 2008.

8. "Exploration of networks using overview+detail with constraint-based cooperative layout,"

T. Dwyer, K. marriott, F. Schreiber, P. J. Stuckey, M. Woodward and M. Wybrow

IEEE Transactions on Visualization and Computer Graphics, vol. 14, pp. 1293-1300, 2008.

9. "Drawing Free Trees,"

P. Eades

Bull. Inst. Combinatorics Appl, vol. 5, pp. 10-36, 1992.

10. "IEEE InfoVis 2004 Contest, the history of InfoVis,"

J.-D. Fekete, G. Grinstein and C. Plaisant


11. "Community structure in social and biological networks,"

M. Girvan and M. E. J. Newman

Proceedings of the National Academy of Science of the United States of America, vol. 99, pp. 7821-7826, 2002.

12. "Usability Evaluation Considered Harmful (some of the time),"

S. Greenberg and B. Buxton

in ACM Conference on Human Factor in Computing Systems (SIGCHI 2008), 2008, pp. 111-120.

13. "The Structure of Scientific Literatures II: Toward a Macro- and Microstructure for Science,"

C. B. Griffith, H. Small, A. J. Stonehill and S. Dey

Science Studies, vol. 4, pp. 339-365, 1974.

14. "Mining Frequent Patterns without Candidate Generation,"

J. Han, J. Pei and Y. Yin

Data Mining and Knowledge Discovery vol. 8, pp. 53-87, 2004.

15. "Prefuse: A Toolkit for Interactive Information Visualization,"

J. Heer, K. S. Card

and A.

J. Landay

in ACM Human Factors in Computing Systems (CHI) Portland, Oregon USA: ACM Press, 2005, pp. 421-430.

16. "Tree-maps: A Space-Filling Approach to Visualization of Hierarchical Information Structure,"

B. Johnson and B. Shneiderman

in IEEE Information Visualization 1991, Indianapolis, IN USA, 1991, pp. 275-282.

17. "Visual Analytics: Definition, Process, and Challenges,"

A. D. Keim, N. Andrienko, J.-D. Fekete and C. Gorg

in Information Visualization: Human-Centered Issues and Perspectives. vol. 4950,

A. Kerren, T. J. Stasko, J.-D. Fekete and C. North

Eds. New York, USA: Springer, 2008, pp. 154-175.

18. "FP-Viz: Visual Frequent Pattern Mining,"

A. D. Keim, J. Schneidewind and M. Sips

in IEEE Symposium on Information Visualization (Infovis 2005). Poster, Minneapolis, MN USA, 2005.

19. "Co-occurence Matrices and Their Applications in Information Science: Extending ACA to the Web Environment,"

L. Leydesdorff and L. Vaughan

Journal of the American Society for Information Science and Technology, vol. 57, pp. 1616-1628, 2006.

20. "Distributed Cognition as a Theoretical Framework for Information Visualization,"

Z. Liu, J. N. Nersessian and T. J. Stasko

IEEE Transactions on Visualization and Computer Graphics, vol. 14, pp. 1173-1180, 2008.

21. "Mapping Authors in Intellectual Space: A Technical Overview,"

W. K. MaCain

Journal of the American Society for Information Science, vol. 41, pp. 433-443, 1990.

22. "Mapping research specialties,"

S. A. Morris and B. Van

der Veer Martens,

Annual Review of Information Science and Technology, vol. 42, pp. 213-295, 2008.

23. "Time Line Visualization of Research Fronts,"

S. A. Morris, G. Yen, Z. Wu and B. Asnake

Journal of the American Society for Information Science and Technology, vol. 54, pp. 413-422, 2003.

24. "Cone Tree: animated 3D visualization of hierarchical information,"

G. G. Robertson, D. J. Mackinley and K. S. Card

in ACM Conference on Human Factors in Computing Systems 1991, New Orleans, LA USA, 1991, pp. 189-194.

25. "Maps of Random Walks on Complex Networks Reveal Community Structure,"

M. Rosvall and T. C. Bergstrom

Proceedings of the National Academy of Science of the United States of America, vol. 105, pp. 1118-1123, January 29 2008.

26. "Co-citation in the scientific literature: A new measure of the relationship between two documents"

H. Small

Journal of the American Society for Information Science, vol. 24, pp. 265-269, 1973.

27. "Co-citation context analysis and the structure of paradigms,"

H. Small

Journal of documentation, vol. 36, pp. 183-196, 1980.

28. "Knowledge representation via co-citation-cluster,"

H. Small

in Proceeding of the American Society of Information Science, 1981.

29. "Tracking and predicting growth areas in science,"

H. Small

Scientometrics, vol. 68, pp. 595-610, 2006.

30. "Citation context analysis of co-citation clustering - Recombinant-DNA,"

H. Small and E. Greenlee

Scientometrics, vol. 2, pp. 277-301, 1980.

31. "The Structure of Scientific Literatures I: Identifying and Graphing Specialties,"

H. Small and C. B. Griffith

Science Studies, vol. 4, pp. 17-40, 1974.

32. "Focus+context display and navigation techniques for enhancing radial, space-filling hierarchy visualizations,"

J. Stasko and E. Zhang

in IEEE Symposium on Information Visualization 2000, Salt Lake City, UT USA, 2000, pp. 57-65.

33. "Interactive visualization of small world graphs,"

F. van Ham and J. J. van Wijk

in IEEE Symposium on Information Visualization 2004 Austin, TX USA, 2004.

34. "The Word Tree, an Interactive Visual Concordance,"

M. Wattenberg and F. B. Viegas

IEEE Transactions on Visualization and Computer Graphics, vol. 14, pp. 1221-1228, November/December 2008.

35. "Author cocitation - a literature measure of intellectual structure,"

H. White and B. Griffith

Journal of the American Society for Information Science, vol. 32, pp. 163-171, 1981.


No Photo Available

Jian Zhang

Student Member, IEEE
No Bio Available
No Photo Available

Chaomei Chen

No Bio Available
No Photo Available

Jiexun Li

No Bio Available

Cited by

No Citations Available


No Corrections




30,534 KB


33,928 KB

Indexed by Inspec

© Copyright 2011 IEEE – All Rights Reserved