1. Databases-Data Lake:

IDUNN will develop a fair AI system for prediction and real time analysis of cybersecurity threats that will be trained on the following types of data:

  1. Real-time data that are being streamed from hard-wired sensors that can be embedded in the machinery, scattered in the built environment of the vicinity/location/site, or derived from sensors embedded in a smart
  2. Furthermore, sensor data may also exist in historical archives in the form of batch time-series that provide access to valuable past These data will be used by the AI layer to detect anomalies.
  3. User-generated content (UGC) and Open data in the form of images, videos, or text that are contributed by users and developers in online social networks platforms like Twitter, GitHub, Stack Overflow, Reddit,, and

2. Dark & Clear Web Crawling mechanism for Indicator of Concern (IoC):

This layer aims to discover and acquire Surface/Deep/Dark Web and open social media content (OSINT) relevant to cybersecurity risks as defined by the end-user requirements. The identification and crawling of relevant Web multimedia content will be performed by domain-specific visual and programmatic focused crawlers (e.g. ACHE) able to navigate seamlessly between the Surface, the Deep, and the various Darknets in the Dark Web; to this end, a crawling infrastructure will be developed based for instance on Apache Nutch and index in Elasticsearch will be employed.

Moreover, appropriate keyword-based queries will be submitted to relevant search APIs to retrieve relevant content; the APIs of search engines will be queried to identify additional relevant Web content, while the APIs of social media platforms (e.g., Twitter) will be queried to identify relevant social media posts. To increase search coverage and address vocabulary mismatch and concept drift problems between queries and targeted content, these queries will be automatically reformulated using word embeddings models. A pool of target pages’ classifiers will also be used to update the vocabulary of the crawler.

3. Analytics (Reasoning):

The purpose of this layer of the architecture is to transform information stored in the data lake into actionable knowledge that can be used by both machines and humans enabling best-of-breed tools for intrusion detection and threat analysis. The utility of the Big Data analytics will be measured by suitable Key Performance Indicators (KPIs), Key Risk Indicators (KRIs), and Key Fairness Indicators (KFIs) which in this project are defined in terms of Data Repository infrastructure metrics. The analysis is further elaborated below: