A Similarity Measure in Formal Concept Analysis Containing General Semantic Information and Domain Information

Formal concept analysis (FCA) gets into good graces by increasing big data scientists due to its unique advantages. Concept similarity measurement is the key to the FCA-based application. Most of the previous methods are based on set theory and less concerned with semantic information, whereas those methods focusing on semantic information usually rely on ontologies or knowledge bases to obtain the relevant semantic knowledge. However, it is difficult for knowledge methods to obtain domain knowledge in formal contexts (datasets), which are not suited well for domain text data. To tackle these problems, this paper proposes a novel formal concept similarity measure that synthesizes the Semantic information in knowledge bases and Domain information in the formal context (S&D measure). S&D uses word vectors as the representations of words to obtain the semantic information in general knowledge bases while defining novel semantic relations of intent words to obtain the domain information contained in the data itself. It can measure the similarity relation of concepts more comprehensively and precisely, particularly in a domain textual formal context, and it can be implemented automatically and unsupervisedly without any knowledge base, ontology or external corpus. Compared with other related works, experiments show that this method has a better correlation with human judgment.


I. INTRODUCTION
Formal Concept Analysis (FCA) was introduced by R. Wille as a mathematical theory [1], [2] and became an effective technique that constructs a concept hierarchy (i.e., the concept lattice) by using the binary relation between objects and attributes in the data set (i.e., the formal context, and the context for short), and then analyzes data by using lattice algebraic theory, ensuring the maximum decomposition of information while retaining all data details. Since its inception, FCA has been applied in software engineering, machine learning, knowledge discovery, information retrieval, and other fields [3]- [5]. New research results on the FCA-related algorithms [6]- [8] have recently made a wider application of FCA in the large-scale data processing The associate editor coordinating the review of this manuscript and approving it for publication was Zhe Xiao . field, among which the application under the text context (the textual formal context, that is, take text documents as objects and take the words in the documents as attributes) has a promising future [9]. Just as the recognition of semantic similar concept becomes a fundamental technological component in many fields, such as cognitive science, artificial intelligence, and semantic web, the similarity measurement of formal concept serves as the basis of almost all FCA-based data processing applications and plays a vital role [10]- [15].
Many studies focus on the similarity measurement of formal concepts. However, in those studies, most formal contexts extract fewer features from objects to serve as attributes, which is different from the text contexts that use most words in the documents (objects) as attributes. These techniques work based on set theory [16]- [21], so their results lack semantic content and they are not suitable for processing VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ domain text data. The methods in [22]- [24], and [12] take the semantic relation of attribute words into account, which combines the semantic information from ontology (or WordNet) with the statistical information of words from the external corpus (such as the Brown Corpus in [25]), and then calculate the Information Content (IC) of words by using the two information to evaluate the similarity between the attribute words. The contexts of them are mainly limited to smallscale text data in the general domain, such as geographic names, hotel facilities, etc.. Paper [13] evaluates the semantic relations by using Wikipedia as the source of semantic knowledge, and paper [26] utilizes the linked data of DBpedia and the synsets of WordNet to judge the semantic correlation of words. It can be learned that the existing semantic-related methods obtain semantic information mainly by the knowledge base. However, reliance on some alternative knowledge bases inevitably increases the complexity of the system and simultaneously reduces the reliability of it. Furthermore, as a knowledge structure, knowledge bases undoubtedly have the problem of Vocabulary Deficiency (VD). It will be intractable if some words in the data don't exist in the knowledge bases.
Recently, the performance of distributional methods (such as word embedding) for lexical semantic similarity has far exceeded those of knowledge base methods [27]. So, it is simple and effective to calculate the semantic similarity of words based on the semantic information represented by word vectors. Moreover, it is easy for word vector methods to solve the VD problem. On the other hand, domain data contain abundant domain-related information, so it is obviously imprecise to judge the relations of concepts without such information, which cannot be obtained through a general knowledge base.
To overcome these problems, this paper proposes a novel formal concept similarity measure that synthesizes the Semantic information in knowledge bases and Domain information in the formal context (S&D measure), but it is independent of any knowledge base, ontology, or corpus (except the data set). Besides considering the number of the common members (objects or attributes) from the set theory perspective, this method considers the importance of members from the concept lattice perspective, and the measurement of the importance is put forward. Moreover, the semantic similarity of concepts is also reflected in terms of text context, and the measurement of semantic similarity of intents is given. Therefore, the unsupervised S&D method can automatically measure the similarity relation of concepts more comprehensively and precisely, particularly in a domain text context. The rest of this paper is organized as follows. Section II introduces the basic theory of FCA. Section III gives a brief overview of previous relevant works. Section IV describes the proposed method for measuring formal concept similarity systematically, and then compares it with benchmarks and evaluates it by the real-world data in experiments. The last section makes a summary and proposes future work directions.

II. BASIC THEORY OF FCA
FCA is a mathematical formalism [15] that provides a conceptual framework for analyzing, visualizing and structuring data to get a better understanding of them. The formal concept of FCA is defined in a formal context [22]. A formal context F is a triple (O, A, R), in which O is a set of objects, A is a set of attributes, and R is a binary relation defined by the Cartesian product of O and A, that is R ⊆ O × A. When o ∈ O and a ∈ A, (o, a) ∈ R or oRa denotes that object o possesses attribute a. Given two sets E ⊆ O and I ⊆ A, then E = {a ∈ A|∀o ∈ E, oRa}, I = {o ∈ O|∀a ∈ I , oRa}. A formal concept c of (O, A, R) is defined as a pair (E, I ), in which E = I and I = E. Here, E and I are called the extent and intent of the formal concept (E, I ) respectively.
Given two concepts c 1 = (E 1 , I 1 ) and c 2 = (E 2 , I 2 ) within the context F, we call c 2 the subconcept of c 1 and c 1 the super-concept of c 2 (expressed as c 2 ≤ c 1 ), if E 2 ⊇ E 1 (or, equivalently, I 2 ⊇ I 1 ) and E 2 ⊆ E 1 . The relation ≤ is a partial order relation here, which denotes the double inclusion among concept components (extent and intent), that is, the extent of concept c is contained in the extent of each of its super-concepts, in turn, the intent of c contains the intent of each of its super-concepts [22]. For concept p, c and s of F, we call p the parent-concept of s, and s the son-concept of p, if p > s, and c ∈ F while p > c > s. The son-concepts of all the parent-concepts of c and the parent-concepts of all the subconcepts of c belong to the same level as c in terms of inheritance, so these concepts are called general siblingconcepts.
All concepts within F and their partial order relations constitute concept lattice, denoted as L (F) or L (O, A, R). Hasse Diagram is the best and most general representation of concept lattice [2], in which nodes denote concepts and arcs denote their partial order relations. Two most special nodes in concept lattice, the top nodes and the bottom nodes (also called the most general and the most specific concepts), contain the most objects and the most attributes respectively. To construct a concept lattice, it is necessary to identify the relevant objects related to the application domain of the data first and their relevant features. From this point of view, the concept in FCA is not only an abstraction, but also a clustering of objects and their common attributes based on careful observation of reality [22].

III. RELATED WORK
Similarity measurement of formal concepts is of great importance to FCA and many works about it have been done. Paper [28] put forward two similarity measures: the local similarity S l and global similarity S g . For any two formal concepts c 1 = (E 1 , I 1 ) and c 2 = (E 2 , I 2 ), S l and S g are defined as follows: 75304 VOLUME 8, 2020 where |O| and |A| refer to the number of objects and attributes in the formal context respectively. It can be learned that both measures take into account the coincidence degree of the members in the extents and the intents of two concepts. Local similarity takes the two concepts as a universe, whereas global similarity takes the whole concept lattice. Paper [29] and [30] discussed the distance of concepts. Similar to the terminologies in [28], local distance and global distance were defined respectively, and then they were combined by some way to obtain the distance of two concepts. These two methods, in essence, are an integration or development of [28]. Paper [29] also discussed the relation between the concept distance in terms of set theory (i.e., the similarity of concepts) and that in terms of lattice theory (i.e., the sibling relationships of concepts), and the conclusion was drawn that there is no consequent correlation between the similarity and the physical distance of concepts in the same lattice. Moreover, the experimental results of [30] showed that for acquiring information, there is no guarantee that the concepts with the best similarity will be found in any of the sibling spaces.
Paper [16] introduced rough set theory into a measure model. First, it finds meet irreducible and join irreducible concepts by virtue of superma and infima structures in a lattice, and then, using these two concepts, calculates Tversky similarities [31] between the extents and the intents of two concepts respectively, finally obtains the similarity of the two concepts by combining the extent Tversky similarity with the intent Tversky similarity. This method preserves both feature information and structural information of the lattice, which can be regarded as the development of Tversky model.
The first similarity measure proposed by [17] is called weighted concept similarity, which is also a combination of two similarity values calculated from the extents and the intents of two concepts respectively, and the formula to compute the similarity of the two extents or intents, according to user's needs, can be Jaccard index, Sorensen coefficient or Symmetric difference. Paper [17] also proposed another measure, which utilizes the truth that formal concept is essentially the full rank matrix of the formal context matrix. In other words, its author believes that the similarity between two concepts depends on the proportion of nonzero elements in the context matrix formed by the two concepts. Experiments show that the matrix method has certain advantages for the sparse context.
For the applications in Case-Based Reasoning (CBR), paper [19] and [21] gave their own approaches respectively. Consistent with the techniques above, they compute the overlap ratio between concepts. But they consider the frequency of the common members as well, which enables the approaches to distinguish the properties of different common members in detail and to have a better practical significance for the larger contexts. The difference between them is the opinion about the extents and the intents. As measuring the similarity of two concepts, the former argues that the extents and the intents contribute the same, whereas the latter believes that the intents are the decisive factors, so are the only things that need to be compared. Also, paper [21] believes that the attributes in the intent should have different weights based on their categories, and the weight of each attribute should be assigned by counting the frequency proportion of this attribute in different categories.
Some of the above methods are the variants of [28] or Tversky, and the others set the weights of common members in different ways. They are based upon set theory, and none of them think about the effect of the text context on the similarity. As a result, the semantic relation between concepts is ignored.
Paper [22] and [23] observed the influence of semantic relation of words (as attributes) on concept similarity. For two concepts, they calculated the similarity of the extents (based on set theory) and that of the intents (based on the semantic relation of words) respectively, and finally integrated them in some proportion. Measuring the semantic relation of words is the atomic operation of measuring that of intents. The difference between methods of [22] and [23] lies right in the difference of their atomic operations. The former is based on human domain knowledge (ontology), whereas the later on the IC [25] synthesized by the structure of WordNet and the statistical data of external corpora.
Focusing on the semantic similarity measures for FCA, besides WordNet, paper [26] utilized Linked Data provided by the network knowledge base DBpedia as the resource of information. Combining with the idea of possibility theory, it holds that the similarity of two attribute words depends on the number of similar links that they share while the dissimilarity on the number of different links that they have respectively. The dominant feature of this approach lies in the similarity degree it gives, which is not a definite value but an interval composed of minimum and maximum of the similarity of two concepts. As the author says, it will be pursued to apply such interval similarity in different fields that FCA is applied in.
Aiming at the domain text context, this paper proposes a novel formal concept similarity measure, which does not need any knowledge base, ontology or other corpora, but synthesizes the semantic information in the knowledge base and domain information in the formal context, so it is called S&D measure for short. The specific implementation process will be discussed in detail in the next section.

IV. S&D SIMILARITY IN FCA
A concept in FCA consists of extent and intent, in which an object or attribute can be called a member of the concept or the lattice. For two concepts, the intuitive and naive similarity relays on the number of common members, as described in equation (1). However, with regard to text data, the importance of each element (term or word) in bagof-words (a document or paragraph) models may be different. Hence, for two pairs of documents, which possess the same number of common members but the different names of them, there is a strong possibility that the similarity of one pair and VOLUME 8, 2020 that of another are different. For instance, there are three sets, S 1 = {a/0.7, b/0.3} (set S 1 includes elements a and b, their importance are 0.7 and 0.3 respectively), S 2 = {a/0.6, c/0.4} and S 3 = {a/0.3, d/0.7}. Comparing the similarity between S 1 and S 2 and that between S 1 and S 3 , although the two pairs of sets share the same number of the elements, obviously, the two similarity values are different and the former is higher than the latter factually. Therefore, to accurately measure the similarity between two concepts, especially in a text context, the importance degree, rather than the number only, of common members should be taken into account. Also, in regard to the text context, particularly the domain text context, the semantic similarity of the two concepts' intents, which depends on the number of common words and the semantic relationship between non-common words, should be taken into account.

A. MEMBER IMPORTANCE
Inspired by the idea of the well-known TF-IDF, we believe that if a member is contained by many concepts, the member is relatively ordinary; conversely, if it is contained by few concepts, its peculiarity is stronger. So, if two concepts share a certain number of peculiar members, the similarity of them is higher than that of those share the same number of ordinary members. In view of this, we propose the terminology of Inverse Concept Frequency (ICF). If we refer to the proportion of the concepts including a certain member as Concept Frequency (CF) in a lattice, then ICF measures the reciprocal of CF, which shows the peculiarity degree of a member in a lattice (rather than in a concept).
Definition 1: Given a concept lattice L, the ICF of a member e i denoted as ICF i can be calculated as follows: where N is the number of all concepts in the lattice L and n i denotes the number of the concepts including the member e i . It can be learned from the above statements that a member in a lattice belongs to at least one concept, so the value of n i will not be zero. Note that the right-hand side of (3) was used in [19] and [21], here we just give it a name. For a member (no matter an object or attribute), the higher its ICF is, the stronger its peculiarity is. And, for a pair of concepts, the stronger their common members' peculiarities are and the bigger these members' number is, the larger the similarity value of the two concepts is. Different from an object appearing once in a concept, an attribute appears frequently since it is shared by all objects of the concept. Moreover, the times of the attribute appearing in each object vary. Viewing the sum of occurrence times of an attribute in each object of one concept as the total occurrence times of the attribute appearing in the concept, we can conclude that the importance of an attribute with different total occurrence times should be different. Considering the different lengths of objects, we define Attribute Frequency (AF) to measure the importance of an attribute in a certain concept.
Definition 2: Given a concept c k = (E, I ), the AF of the attribute a i ∈ I in c k denoted as AF k i is defined as follows: where |E| represents the number of the objects in c k and AF i,j indicates the frequency of a i in the object o j ∈ E, which can be calculated by: where af i,j denotes the number of the times that a i occurs in o j and l j denotes the length of o j . It is obvious that AF i,j , in essence, is the mean value of occurrence times of a i in each object in c k . Such a definition avoids that the AF of some attribute becomes oversized, which is caused by a large extent, and expresses the importance degree of every attribute in a certain concept fairly.
In a text context, the intents of two concepts are two different sets of terms (words), between which the similarity depends on, in textual view, not only the number of their common terms but also their own semantic meaning. In the next section, we will discuss the semantic relations of intents.

B. SEMANTIC RELATIONS OF INTENTS
The intent of a concept is a set of words, so the semantic similarity of two words is an element of the semantic similarity of two intents. The measurement of lexical semantic similarity is a lasting topic. At present, computational linguists adopt the methods based on either general knowledge base (such as WordNet, Wikipedia, etc.) [27] or distributional word representation (like word embedding) [32], and, of course, the combination of these two kinds of methods [33]- [35]. In recent years, as using the neural networks trained on very large corpora, the word embedding methods have outperformed knowledge-based methods for predicting human judgments of lexical similarity. Compared with the previous WordNet-based methods, Word2set, a recently published WordNet-based method [27], has improved significantly in performance, which is, however, only close to that of word embedding methods. In terms of the ideal of sense in Word-Net, each word always has multiple senses, each of which is usually bound to a particular semantic context, and a contextfree word (that is, a word without a context and can be referred to as a dictionary word) is a synthesis of all its senses in WordNet. When a word is in a context, it becomes the specific expression of one of its senses. So it can be said that the semantic relation between words is embodied in the relationship between their senses. The word vector obtained by word embedding in an ideal condition (trained on an ideal large-scale corpus) is an integrated expression of all the senses of the word, so it is not perfect for the meaning of words in a specific context. The model of [36] overcomes the problem, and its word vectors are context-sensitive, which can be employed for tasks such as Word in Context (WiC) similarity [37], [35]. Although the FCA application with a text context needs to consider the semantics of words in a specific document, it is more important to consider the semantics of them in the entire dataset (the formal context). So it is not a good choice to use the BERT vector in such FCA programs.
The set of words as an intent derives from a set of documents (extent) and is the common words of them. In terms of a group of words, typical differences exist in the expressions of senses between the words shared in multiple documents and those in one document. Multiple terms in a document belong to the same semantic environment, and probabilistically, it is more possible for them to show the single meaning (assuming that the document has a single semantic environment). However, for a group of documents, because the semantic environment of each document is probably to be different, each common term in multiple documents may have a different sense in each document. Therefore, when the document set is large enough, the semantic meanings of common words correspond to those of the word vectors trained on the set. In other words, word vector is essentially the expression of the meaning of a word as a common word in the documents of the large-scale training corpus. Because of this, the word vector [32], [38]- [42] trained on a large general corpus is a proper representation of the words in intent. As mentioned in the introduction, any word representation model will face the challenge of the VD. There is usually a lack of phrase word(such as ''New York'') instead of individual word (namely single word, such as ''New'' or ''York'' in ''New York'') in various models. One advantage of the word representation of vector is that you can combine the word vectors of individual words into the vector of a phrase word, which easily solves the problem of VD.
One obvious problem is that a formal context cannot be infinite or as large as the general training corpus normally, that is, an intent word is not exactly a dictionary word in the ideal (large enough) corpus. Furthermore, their senses will tend to the overall semantics of the formal context due to the influence of its domain properties. In summary, to measure the relation of a pair of intent words, we have to consider not only the comprehensive semantic meanings of them as dictionary words but also the domain-dependent semantic meanings as the members of a context.
For instance, two intent words may have different senses in each document in the context. The closer the semantic relation between the two sets of senses is, which are comprised of all senses of the two words respectively, the more similar the two words are. Similarly, for two words in different concepts, the bigger the intersection of their respective senses is, the more similar the two concepts are. The size of the intersection of senses does not depend on the sizes of senses sets of the two words, but on the formal context (that is, depends on the global semantic properties of the dataset). This implies that the intersection of the two senses sets is a subset of the intersection of all senses of the two words as dictionary words. Therefore, we perform semantic clustering on the context data. Two words in the same semantic cate-gory indicate that their senses have a strong correlation, so their semantic relation is closer. Since one word has multiple senses, each word after clustering should be allowed to belong to different categories. Thereupon, we adopt topic clustering, to a certain degree, a clustering of each word according to its senses in the formal context. Each topic is a semantic category, and each word may belong to different categories, which depends completely on the semantic properties of the corpus itself.
Based on it, we define the semantic similarity of two terms in an attribute set.
Definition 3: For concept lattice = (O, A, R), given t 1 , t 2 ∈ A, the semantic similarity of term t 1 and t 2 is defined as follows: ; otherwise (6) where V i is the word vector of term t i , V i represents the 2norm of the vector V i , and n is the times that t 1 and t 2 belong to the same topic category. Obviously, when the topics that they belong to have no intersection, the semantic similarity of t 1 and t 2 is the cosine similarity of their respective semantic meaning in the dictionary; and as the number of their common topics increases, they become more similar.
Following the thought of the maximum weighted matching problem in bipartite graphs [43], the semantic similarity between two intents (term groups) is the maximum similarity value matched between them. A non-formal definition is given here.
Definition 4: Sample two sets of terms (that is, two intents) without replacement respectively to form the set of termpairs, and many such sets will be obtained. For one of the pair sets, calculate and cumulate the semantic similarity value of each pair within the set, and then take the maximum value as the semantic similarity of the two intents.
For example, there are two intents I 1 = {a 1 , a 2 , . . . , a m } and I 2 = {b 1 , b 2 , . . . , b n } (let m < n, and the attributes in the two intents exist in form of the term). Without replacement, take terms out one by one from I 1 and I 2 respectively to form m term-pairs, which are called one sample. Repeat the sampling process until all combinations are finished. Each of these combinations (i.e., a sample) is a subset of I 1 ×I 2 , and in each sampling, any two term-pairs do not share the same term. Find the sum of the similarity values of the m pairs within each sampling (denoted as Tse a i , b j and m values will be obtained, each of which corresponds to a set of term-pairs. The maximum of these m values is the semantic similarity between intents I 1 and I 2 , expressed as STse (I 1 , I 2 ).

C. S&D SIMILARITY
Based on the discussion above, the importance of each object or attribute in a lattice varies with its frequency appearing in the lattice, whereas, for a certain attribute, its importance varies with the different concept in which it occurs, because the frequency that it occurs in different concept is different VOLUME 8, 2020 usually. Therefore, the semantic similarity of the two concepts is related closely with the importance of their members, especially those they share. In addition, for the text context, since the intent is represented as a word set, there must be a semantic relationship between two intents, which will inevitably affect the similarity of the two concepts that they exist in respectively. So, the proposed S&D similarity, which measures the overall similarity of two formal concepts within a text context, is defined as follows.
Definition 5: Given concept lattice L, the semantic similarity of any two concepts c 1 = (E 1 , I 1 ) and c 2 = (E 2 , I 2 ) in L is defined as: a m I 1 AF c 1 a m ·ICF a m · a n I 2 AF c 1 a n ·ICF a n 1/2 where, o i , o j and o k are objects, a l , a m and a n are attributes, |I 1 | and |I 2 | represent the numbers of attributes in I 1 and I 2 respectively, and 0 < w 1 , w 2 < 1 are the weight factors. The first two terms of (7) express set-theory-based similarity between two concepts, which is obtained by using our AF-ICF, and the third term does semantic similarity of the two concepts when they are treated as two sets of words. Generally, set theory-based concept similarity measures consider that the extents and intents of two concepts make equal contributions to the similarity of the concepts, that is w1 and w2 should usually be equal. So w 1 = w 2 = 1/3 becomes an ideal default for (7), where the semantic component (the third term) of the equation is considered to be ideal, which can fully and accurately measure the semantic correlation between the two word-sets (i.e. concepts). Semantic information is contained in the human language, so statistical computation on ideal (large-scale) corpus is a typical way to obtain the exact semantics of words. Obviously, the size of the corpus affects the quality of semantic information obtained. Many other factors affect this quality, which is beyond the scope of this paper and will not be discussed in detail. Here, we just want to show that, in general, the available corpus is difficult to reach the ideal, so we suggest that the weight of the third term of (7) should not exceed 0.3.
When two terms (words) are the same, obviously, they have the largest similarity value of 1. Hence, when two concepts have the same attributes, the semantic similarity between two intents can be simply calculated by: Therefore, another expression of equation (7) for simplified calculation is as follows: a m I 1 AF c 1 a m ·ICF a m · a n I 2 AF c 1 a n ·ICF a n 1/2 Proposed S&D similarity considers the importance of members in both views of the lattice and the concept, which is based on our AF-ICF. Also, it considers the semantic relation between intents as a collection of terms and the semantic relationship reflects the domain-related information contained in the dataset, which is implemented by topic clustering. Thus, the S&D measure can work on the similarity relationship between two formal concepts in the domain text context more precisely.

V. EXPERIMENTS AND EVALUATION
Many works have contributed to the measurement of formal concept similarity, but the experimentation and evaluation of it remain a challenge. Paper [23] and [26] explained the reason that each study has its own different inherent assumptions, and, of course, individual advantages. For example, some methods require the support of domain ontology [22], some have the advantage of simplicity [16], some need a specific classification structure [21], and some require the performance of the downstream application to reflect its effect [19], [13]. More importantly, no public data set can be used, on which the performances of various approaches can be compared with each other. Therefore, it is difficult to implement the previous approaches to form an effective benchmark method and further generate comparable results [12], [26]. At present, it is an advisable choice to compare the computerized results of various concept similarity measures with human judgment [45], [13], [26], which provides objective evaluation for these different works.
Paper [26] gives a simple concept lattice and the human judgment scores of the similarity of 8 formal concepts from the lattice. Therefore, this paper selects 10 documents to be the object set O = {o 1 , o 2 , . . . , o 10 } according to the requirement of this paper and defines the sameA = {a 1 , a 2 , . . . , a 9 }, an attribute set containing 9 attributes, in which a 1 to a 9 refer to nine noun phrases respectively: Data mining, Formal Concept Analysis, Artificial Intelligence, Ontology Engineering, Description Logic, Fuzzy Logic, Ontology Learning, Information Retrieval, and Pattern Recognition. Table 1 shows the formal context F = (O, A, R) corresponding to the above two sets O and A, the concept lattice L (O, A, R) is shown in Table 2, and 8 typical concepts, which are same as [28], are extracted from Table 2 to form Table 3. For the convenience of observation and comparison, we do not present the lattice in the form of a Hasse Diagram, but in a table. We choose the local similarity in [28] (denoted as Sl), the matrix method in [17] (denoted as Sz), the method in [19] (denoted as Sf) and the method in [26] as the benchmarks. Paper [26] proposes several methods, among which two best performing methods are selected: CASS Jac and CASS Sor (denoted as CASJac and CASSor in Table 1 respectively). Also, we have improved the approach in [21] by replacing the attribute weights in the original literature with the AF in this paper, and the improved approach (denoted as Sw+) is also a benchmark. The technique in [23] is not selected due to the weakness of knowledge-based methods, namely the VD problem. We evaluate the performances of all compared methods by computing the correlation scores between the results of them and those of human judgment. Pearson Correlation Coefficient and Spearman's Rank Correlation Coefficient are usually used to measure the correlation of two sets of data from a different perspective. The former assesses the linear correlation between them, and the latter does the monotonic one. Therefore, to reflect the correlation of the two sets comprehensively, we take the Average of the two Coefficients (denoted as ASP) as another evaluation metric.
The similarity values of the 8 typical concepts from various techniques and human judgment (HJ) are listed in Table 4, in which HJ and the methods in [26] all directly adopt the results given in [26]. In section III.C, we briefly discussed the setting of weight factors of S&D. Considering the data size and the formal context of this experiment, we roughly selected the weight of the semantic component of S&D to be 0.2. Therefore, in this experiment, all variants of our method ran with the weight factors w 1 = w 2 = 0.4. The above three correlation coefficient values of the different methods are shown in Table 5, in which CAJac-a (or CASor-a) represents the average of the upper limit and the lower limit of CASJac (or CASSor) method. For showing the relationship between the correlations of the methods clearly, the values in Table 5 are reshown as Fig. 1 using line charts.
As can be seen from Fig. 1-(a), Sz has the lowest correlation with HJ, and its result is significantly different from those of the others. Fig. 1-(b) is a partial enlargement of Fig. 1-(a), in which the gaps between the correlation coefficients of the methods other than Sz can be shown clearly. In this figure, both S&D-f (the first two terms of (7) or (9)) and Sf measure the similarity of concepts based on set theory while considering the importance of common members. And the former has better performance, which indicates that the AF-ICF-based method measuring the importance of members is superior to the methods merely considering the proportion  of members in a lattice. Fig. 1 also shows that the Pearson Coefficients of Sw+, S&D-t2 (the second term of (7) or (9)), CASor, CAJac, and S&D are higher than the Spearman's Coefficients of them, indicating that these methods have a better linear correlation with HJ than the monotonic correlation, whereas Sz and Sf are on the contrary. By comparing the performance of S&D-t2 and S&D, we can learn that the other two modules of S&D: the object module (the first term of (7) or (9)) and the intent semantic module (the third term of (7) or (9)), promote the overall performance of S&D, especially in terms of the monotonic correlation. This can also explain why the gap between the two coefficients of S&D is lower than the other methods except Sl and Sf. On the other hand, the ASP value of all the benchmarks including Sl and Sf is lower than that of S&D. All of these illustrate the effectiveness of concept similarity measure based on both the AF-ICF importance of members and the semantic relations between intents containing domain information. One problem is that our method works better when using only the first two terms of (7) or (9) (i.e., S&D-f), that is, the proposed intent semantic module has a negative effect on this experiment. So, is there the necessity to use the semantic similarity of intents to measure the similarity of concepts? It should be noted that although this experiment is based on the realworld data, a mini formal context with only 9 attribute words is used for the comparison with the benchmarks. However, most text contexts in the real-world are larger than that we used, and in this case, the concept will contain more attribute  words, in other words, the concept will have greater intent. Therefore, the semantic similarity between concepts must be unignorable, and maybe have a greater effect on the similarity of concepts. To a certain extent, the mini context is also the reason why S&D-t2 and Sw+ perform poorly. Both S&D-t2 and Sw+ work only on the attributes of two concepts, considering the importance of the attributes, and the larger the text context, the better the effect on the importance of attributes reflected by AF-ICF. In summary, the combination of the AF-ICF-based member importance and the domaininformation-contained semantic similarity of intents could reflect the difference and relation between two concepts in the specific text context well. For a larger domain text context, we believe and expect that, the presented method will find similar formal concepts more accurately than for a mini one.

VI. CONCLUSION AND FUTURE WORK
This paper proposes a method for measuring the similarity of formal concepts, which aims to measure the similar relation of the FCA concepts in text contexts of domain data, full-automatically, without any direct use of external corpora, ontologies or knowledge bases, without supervision. The method can be used as a component of any system using FCA for knowledge mining and knowledge discovery. The similarity relation of concepts obtained by this technique contains not only syntactic information in terms of text composition but also semantic information in terms of text meaning; which can reflect the semantic knowledge contained in the general knowledge base or other large-scale corpora and embody the domain-related semantic knowledge contained in the data itself. However, as many methods rely on certain knowledge bases (such as WordNet, Wikipedia or DBpedia), the precision of our method depends on that of the word embedding technique to some extent. As mentioned above, in the performance of lexical similarity, the word embedding techniques have been evidently better than other techniques. More optimistically, the word vector used in this paper, to some extent, performs better than other word vectors [40] (http://conceptnet.io/). Besides, this paper compares the proposed method with other related works in a relatively overall way. Experimental results prove the advantages of this method. The future work is to employ the method for knowledge mining in large-scale domain text data.