2. Dark & Clear Web Crawling mechanism for Indicator of Concern (IoC):
This layer aims to discover and acquire Surface/Deep/Dark Web and open social media content (OSINT) relevant to cybersecurity risks as defined by the end-user requirements. The identification and crawling of relevant Web multimedia content will be performed by domain-specific visual and programmatic focused crawlers (e.g. ACHE) able to navigate seamlessly between the Surface, the Deep, and the various Darknets in the Dark Web; to this end, a crawling infrastructure will be developed based for instance on Apache Nutch and index in Elasticsearch will be employed.
Moreover, appropriate keyword-based queries will be submitted to relevant search APIs to retrieve relevant content; the APIs of search engines will be queried to identify additional relevant Web content, while the APIs of social media platforms (e.g., Twitter) will be queried to identify relevant social media posts. To increase search coverage and address vocabulary mismatch and concept drift problems between queries and targeted content, these queries will be automatically reformulated using word embeddings models. A pool of target pages’ classifiers will also be used to update the vocabulary of the crawler.