Skip to Main Content
Unstructured open source information, especially the social, political, economic and cultural events described within web-based text/news articles, often contain possible motives for cyber security and trust issues. Automated processing of numerous open source intelligence sources requires the discovery of key domain terms, their conceptual hierarchies and the coherent relationships among them. A syntactic analysis of the word sequences in unstructured text documents allows for the extraction of subject-predicate-object triples, which form the basis for Term Extraction Patterns (TEP). In our research, we use TEPs to discover domain-specific multi-word entities which in turn, can be arranged in a taxonomy based on their semiotic inter-relationships. We explore the use of this method within the cyber security domain and analyze a collection of related news articles gathered from various public web sources. In this paper our initial results of term extraction and the semantic coherence derived from the TEP analyses are described. Our work extends beyond current methods, and our contribution is a novel methodology to extract semantics from unstructured text in domain specific open source information and its application to predict cyber attack outbreaks.