To give an overview, our method works as follows. The data is scanned and the significant single activities (1-sequences) are identified, these constitute the patterns of order 1. Selecting a 1-sequence as a starting point, significance scores for each activity connecting to this sequence are calculated. Connecting activities are then drawn in a tree-like representation (figure 1(a)) in which the currently explored starting pattern is drawn in the middle and activities connecting to it are drawn as nodes linking into or out of it. Visual cues are given for significance and frequency of each node. A user can steer the exploration by choosing a connecting node to add to the identified 1-sequence creating a 2-sequence. This starts a new iteration of the algorithm and the nodes connecting to the new sequence are drawn. Having the user steer the creation of the sequences means that no pre-processing is needed. Also the algorithm does not need to have previously identified all n−1 sequences in order to find the n sequences. Given any sequence it will immediately calculate the significance of the activities connecting to it. We will now describe the details of the process.
3.3.1 Algorithm
The algorithm for identifying significant n-sequences has three steps:
Creation of an adjacency matrix.
Repeated application of formula 5 to get similarity scores.
Weighting of the retrieved scores.
The initial step is to find all significant 1-sequences. This prepares the ground for the exploration to start.
Similarity scores are calculated between the nodes of the total activity graph, A, and the structure graph
in → centre → out
by first computing the adjacency matrix B, of A. This is done by setting each element (i, j) of B equal to the number of transitions from activity i to activity j. The rows of B then represent in-links to, and columns out-links from, each activity. The mutual reinforcing iteration, formula 5, is then used to assign 'in-', 'central-' and 'out-scores' to each activity node. We use a convergence criterion for deciding the appropriate number of iterations, the magnitude of k, in formula 5. For the data that we use this equation always converges within 10 iterations.
The similarity scores give an indication of the behaviour of each node within the graph, hence activities with a high in-score point to many significant activities, activities with a high out-score are pointed to by many significant activities and activities with a high central-score are both pointed to by, and point to significant activities. In the initial stage, a reasonable assumption to make is that the activity nodes with a high central-score are the most appropriate as patterns of single activities, 1-sequences, since these are the most prominent 'link' activities holding the overall sequence of the individuals, in the data, together. Hence we choose the central-score as the significance score for the single activities.
The computed similarity scores are a measure of significance that is dependent on the connectivity of the nodes in the graph. Weighting this measure with the frequency of occurrence of the nodes gives a more general sense of significance in the context of the activity dataset. A node, for example, pointed to many times by a single low scoring node can be equally as interesting as a node pointed to once by each of several significant nodes. However, using only the similarity scores the former would not be considered a significant node. Hence, weighting the scores by frequency of occurrence balances these factors. Furthermore, since we are interested in the relative values of the scores we normalize them to the range [0, 1]. So now activity patterns of order 1, 1-sequences, are found along with their significance scores and the exploration process of higher order patterns can begin.
In order to explore n-sequences, an (n-1)-sequence is used as a query sequence. A subgraph of the total activity graph, A, is found and the adjacency matrix, B, of this subgraph is computed. So to explore 2-sequences, for example, a 1-sequence is used as a query sequence.
The subgraph considered consists only of the currently explored query sequence itself, which can be seen as a supernode, q, and of activities linking into and out of the query sequence (figure 2). This is done by scanning the activity data set for matches of the query and considering the preceding and succeeding activities to these matches. An adjacency matrix, B, is computed by setting each element (i,q1) of matrix B, where q1 is the first activity of the query sequence, q, equal to the number of transitions from activity i to the query sequence and each element (qn−1, j) of B, where qn−1 is the last activity of the query sequence, q, equal to the number of transitions from the query sequence to activity j. The rows of B then represent in-links to, and the columns out-links from, the query supernode. Figure 3(b) shows an example of such a subgraph adjacency matrix.
Having computed the adjacency of the subgraph the mutual reinforcing iteration, through formula 5, is applied in the same manner as before and 'in-', 'central-' and 'out-scores' are assigned to each activity node connecting to and from the query sequence. The objective now is, given an (n-1)-sequence query, to explore all possible n-sequences. So, we are interested in finding the activities linking into and out of the query and their respective scores, this means that we consider the activity nodes twice. Once as in-nodes, which are the activities having high in-scores linking into the query and once as out-nodes which are those with high out-scores linking out of the query.
The weighting of the scores also occurs in two steps. All in-nodes are weighted by the number of times each of them appears as an in-link into the subgraph, their frequency of 'in occurrence', while the out-nodes are weighted by the frequency of their 'out occurrence'. The weighted scores are then normalized to the range [0, [1]. As a result we get all nodes linking into the explored query ((n-1)-sequence) and their significance score (in-score) and all nodes linking out of the query and their significance score (out-score) and are now able to explore the combinations that can be made by adding these nodes to the query.
Alternative weighting by features of the data can be combined into the procedure in two ways, depending on the nature of the weighting to be applied. The adjacency values can be modified for each activity transition to reflect it's desirability. This is then combined into the calculation of the scores for each potential node. Alternative weightings can also be combined into the scores themselves at each iteration.
To summarize, the algorithm consists of three simple computational steps. An adjacency computation, a repeated matrix-vector multiplication and the application of weighting factors. The process of computing the adjacency matrix, B, expands linearly with the number of transitions between activities and so with the number of activities recorded in their diaries by the participants in the study. The dimension of the matrix, M, and the scores vector, xk, in equation 5 are defined by the number of distinct nodes in the activity graph. In the case of our social science activity diary data this is a fixed value of approximately 330 different activity codes. Hence this step takes constant time and M occupies approximately 8MB. The matrix is very sparse so, in cases where the size of the matrix is much larger, sparse matrix methods could be used. The simplicity of these operations is what makes the exploration process highly interactive. Furthermore, any rejection of identified sequences is performed by the user. There is no cut-off by the algorithm, the identified sequences are simply ordered by their significance and made available to the user for filtering and analysis.
3.3.2 ActiviTree
The sequence exploration algorithm is steered using an interactive visual interface in which the identified activity sequences performed by individuals in their daily life are represented in a tree-like fashion which we therefore call 'ActiviTree' (figure 1(a)). The ActiviTree is implemented as a feature within a visual exploration tool for activity diary data, details of which can be found in [7], [23]. In order to start exploring patterns in the data the significance of the single activities is initially computed and the results are presented to the user in an ordered list format, see figure 4. Clicking on an activity in the list will pop up the ActiviTree and the exploration process can start. In the ActiviTree time goes upwards.
The currently explored query activity sequence is drawn in the middle (as the trunk of a tree), the activities preceding it are drawn as nodes pointing to the query sequence (as roots connecting to the trunk) and the activities succeeding it are drawn as nodes pointed to by the query sequence (as branches growing from the trunk). Both in and out activity nodes are ordered by their significance score from left to right. So the leftmost activity is the most significant one. The frequency of their occurrence is mapped onto the opacity of the edge connecting to the query, so frequent sequences are opaque while infrequent ones appear more transparent. Each activity node has a label that includes its corresponding activity code and its frequency of occurrence as an in/out-node to/from the query. Right clicking on an activity node will show an explanation of the activity code. Nodes are colour coded in accordance with the activity classification colour scheme in [7].
Clicking on an in or out activity node in the interface will add it to the query sequence and identify the in- and out-nodes of the new query sequence in a fraction of a second. Clicking on the head or tail of the query sequence will remove the selected node from the query and update its in- and out-nodes.
If too many nodes connect to the explored query the representation can become cluttered and the nodes overlap. We avoid this by implementing a decluttering mode which shrinks the edge length of the nodes and expands a few at a time on mouse over (figure 5). Clicking on the nodes is available in this mode also, so the user is still able to select a node to add to the query and continue the exploration.
Each time a node is added to the currently explored query sequence the identified new activity sequences and their locations in the dataset are saved. Removing a node from the query sequence, and choosing a different exploration path will replace the saved sequences and their locations. The identified activity sequences can be highlighted in a separate linked view within the context of the individuals' days in order to study their distribution more closely. This can be seen in figure 1(b) where the query sequence of figure 1(a) is highlighted. In the separate linked view time is represented on the y-axis going upwards and the individuals are represented ordered on the x-axis by sex and age. So, from left to right one can see older men to young boys and older women to young girls. This view reveals the distribution of the explored sequence across ages and sexes but also the duration of the sequence, its repetitiveness, and its time of occurrence during the day.