DATA REQUIREMENTS FOR ANOMALY DETECTION

Size: px
Start display at page:

Download "DATA REQUIREMENTS FOR ANOMALY DETECTION"

Transcription

1 DATA REQUIREMENTS FOR ANOMALY DETECTION Steven Horn *, Cheryl Eisler *, Dr. Peter Dobias *, Lt(N) Joe Collins Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2016 ABSTRACT New processing techniques are being developed to extract and highlight anomalous maritime behaviour by leveraging the abundance of open source and/or commercially available information on a global scale. This common approach to data science relies on the exploitation of large datasets through methods such as data analytics and data mining. In the reverse aspect, one can instead explicitly consider the definition and types of anomalies that a maritime security operator would desire to know about to derive the quantity of data required to achieve a given level of confidence in detection. The requirements gap between the available information and the desired effect can be identified by working the problem of anomaly detection from both ends: exploiting the data available, and quantifying the desired end state. This work presents a framework for the definition of data requirements for a set of operationally relevant anomalies. By formally quantifying the data gaps, resource investment for additional data can be better directed in order to improve operational utility of the dataset. Index Terms Maritime, Surveillance, Anomaly Detection, Requirements, Fusion, Information, Knowledge. 1. INTRODUCTION The quantification of requirements for data which may be used for multiple decision-making purposes is a challenging task, which is further complicated if there is a wide spectrum of available data. First, several concepts and definitions are required to describe and conceptualize the process from data to decision. This process is described herein as a chain of derived utility, building on lower level data in order to achieve higher-level awareness, and has been previously described using multiple models such the Data Fusion Information Group (DFIG) model [1], and the Data, Information, Knowledge, Wisdom (DIKW) model [2]. In this paper, the concepts of data, information, and knowledge from the latter model will be adopted, acknowledging that the specific definition of each of these elements often varies throughout the literature [3]. However, it is noted here that there is generally a difference between the concepts of data, information, and knowledge; wherein information is something of value derived from data, and knowledge is a yet higher level cognitive situational understanding. When considering the practical use of data, while it can be argued that more data and, therefore, information is desirable, it is not necessarily achievable if the cost to extract useful information from the data increases with the data volume faster than the actual information value of these data. Consideration of these diminishing returns is of crucial importance when resource costs are associated with data collection. Due to the volume and associated costs of both data acquisition and data processing, it is highly desirable to know what is the right amount of data that will enable detection of anomalies with a desired confidence In other words, what is the minimum volume of data that will give the operators a minimum desired confidence in their surveillance and anomaly detection capabilities. This is the question that this paper strives to answer. The paper is organized as follows: First, the nature of various types of anomalies and the data types and information necessary to detect an anomaly are discussed in Section 2. Then the detection algorithms are presented in Section 3, followed by the description of the quantification of the operational requirements in Section 4. Finally, the methodology and results are given in Section 5, and a brief summary is provided in Section TYPES OF ANOMALIES AND DATA REQUIREMENTS In this section, a review and description of types of maritime anomalies is presented with the objective of characterizing several fundamental properties of the anomalies. It is important to note here that the objective is to describe anomalies in the context of the ships 2 that are being described by the data, and not in the data itself, i.e., behavioural anomalies vice data anomalies. Maritime anomalies are divided here into two categories: kinematic, meaning the anomaly is exhibited in the motion of a ship (such as unusual maneuvering); and static, meaning that the anomaly is exhibited by the properties of a ship (such as with unusual crew or cargo). For the purposes of this paper, 2 In the maritime environment, data are most commonly collected on surface ships, although data may also include submarines, aircraft, and any unmanned forms of the aforementioned vessel types.

2

3

4 some objective value. This is a straightforward task since the log-logistic distribution is well described in closed form. (a) Model I Figure 4 Log-logistic model for the inter-detection times of satellite-based AIS detections over two time frames 5. RESULTS The data requirements for the three types of kinematic anomalies are presented. Note here that the metrics have been applied to positional data, which are common to all anomaly types discussed. Also note that sensor false alarm rates, depending on the performance of the data fusion process, will also impact the data requirements. These results, however, are still widely applicable to maritime surveillance data from sources such as AIS. (b) Model II, d i =20 nautical miles 5.1. Detection under models I, III, and III In Figure 5, each figure consists of data for a total of 5 million Monte Carlo (MC) simulation runs consisting 100,000 iterations for each of 50 distribution shapes chosen by varying the mean inter-detection time. The upper plot in each figure pair presents the distribution of the occurrences where the anomaly could not be detected (t stopped ) in the y- axis for an inter-detection time with a mean shown in the x- axis. The lower plot presents the same information, but with the x-axis being the 95 th percentile for data in the loglogistic sensor model inter-detection times. The curves represent the frequency of occurrence where the anomaly would not be detectable in 99.7%, 95%, on average, or in 50% of the MC evaluations. The visible blip on the on the right end of the figures (most visible for the 50% line) is due to the relatively small quantity of runs that populated that region of the results Joint requirements (c) Model III Figure 5 Plot of simulated bounds as a function of interdetection time distribution For each of the models presented in Figure 5, the maximum undetected time (worst case) is selected for each MC run and plotted in Figure 6. This represents the requirements to

5 achieve detection of all three anomalies investigated. Given some operational requirements, such as being able to detect a ship which is loitering for longer than T s hours, being able to detect a ship spending longer than T z hours in an exclusion zone, or to detect a potential rendezvous which lasted longer than T r hours, one can extract the minimum required inter-detection times. By choosing a desired level of detection, the corresponding acceptable mean and upper tail for a required sensor performance can be directly chosen (x-axes) from the upper and lower plot in the figure pair. For example, given an objective of the maximum T s, T z, or T r being 1 hour, for 95% of inter-detection times, then the mean time between updates must be less than 28 minutes, and no more than 5% of updates should exceed 80 minutes. Figure 6 Combined simulated bounds for models I-III 6. SUMMARY A framework for quantifying data requirements based on operational processes has been presented. This method enables identification of operational requirements for data acquisition in a robust, objective, and repeatable manner. Notably, it supports the optimization of the quantity of data to meet operator requirements under constraints, thereby preventing over-expenditure in procurement costs while maximizing operational benefits. The results, when considered in comparison to the data being collected, can also be used to identify if there are any gaps in data collection, which can then be directed to the investment of additional data collection resources. In addition, this approach helps to identify the confidence level that the operators can achieve in the anomaly detections, and thus provides an indicator of the associated risks. Future work includes investigation of real-time anomaly detection requirements, and the formalization of the data and information requirements process which can be generally applied for tasks such as the procurement of data and information. 7. ACKNOWLEDGEMENTS This work was conducted under the Defence Research and Development Canada (DRDC) Project in Maritime Information Warfare (MIW) 01da Next Generation Naval Command and Control Systems. REFERENCES [1] Blasch, E., Steinberg, A., Das, S., Llinas, J., Chong, C., Kessler, O., Waltz, E., White, F., Revisiting the JDL model for information Exploitation, in Information Fusion (FUSION), th International Conference on, pp IEEE, [2] Rowley, J. E., The wisdom hierarchy: representations of the DIKW hierarchy, Journal of information science, [3] Zins, C., Conceptual approaches for defining data, information, and knowledge, Journal of the American society for information science and technology, 58.4, pp , [4] Roy, J., Davenport, M., Categorization of maritime anomalies for notification and alerting purpose, NATO workshop on data fusion and anomaly detection for maritime situational awareness, La Spezia, Italy, [5] Martineau, E., Roy, J., Maritime anomaly detection: Domain introduction and review of selected literature, DRDC- VALCARTIER-TM , Defence Research and Development Canada, [6] Food and Agriculture Organization of the United Nations, Code of conduct for responsible fisheries, [7] Bar-Shalom, Y., Willet, P. K, Tian, X., Tracking and data fusion: a handbook of algorithms, [8] Pallotta, G., Vespe, M., Bryan, K., "Vessel pattern knowledge discovery from ais data: A framework for anomaly detection and route prediction." Entropy vol. 15 no. 6 pp , [9] Page, E. S. Continuous Inspection Schemes, Biometrika 41, no. 1/2, pp , [10] Ru, J., Jilkov, V. P., Li, X. R., Bashi, A., Detection of Target Maneuver Onset, in IEEE Transactions on Aerospace and Electronic Systems, vol. 45, no. 2, pp , April [11] ITU-R M , Technical Characteristics for an automatic identification system using time division multiple access in the VHF maritime mobile band, [12] Kapur, P. K., Pham, H., Gupta, A., Jha, P. C., Software reliability assessment with OR applications, London: Springer, 2011.