ATHABASCA UNIVERSITY AN INTELLIGENT AGENT-BASED APPROACH TO NETWORK MANAGEMENT. Munir Ahmad. MASTER OF SCIENCE in INFORMATION SYSTEMS

Size: px

Start display at page:

Download "ATHABASCA UNIVERSITY AN INTELLIGENT AGENT-BASED APPROACH TO NETWORK MANAGEMENT. Munir Ahmad. MASTER OF SCIENCE in INFORMATION SYSTEMS"

Martin Reynolds
5 years ago
Views:

1 ATHABASCA UNIVERSITY AN INTELLIGENT AGENT-BASED APPROACH TO NETWORK MANAGEMENT BY Munir Ahmad. An essay submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in INFORMATION SYSTEMS Athabasca, Alberta April 2015 Munir Ahmad. 2015

2 DEDICATION This essay is dedicated to my wife, Yasamin for being my personal cheerleader who has been a constant source of support and encouragement during the challenges of graduate school, work and life. This work is also dedicated to my parents, for their unconditional love, support and encouragement. Who have taught me to work hard for the things that I aspire to achieve. 2

3 ABSTRACT Multi-agent systems (MAS) are becoming more dominant in the world of information systems. Our day-to-day life is increasingly influenced by the rapid growth of complex interconnected network based systems. Complex interconnected network based systems with effective event and fault management increases network uptime improving customer experience. The ever-increasing expectation for high availability systems and competitive pressure motivates service provides to look for new ways to managing their infrastructure. It cannot be disputed that the industry needs to adapt some form of automated fault management in order to continue providing high availability self-healing networks. Agentbased technology is a powerful technology for the deployment of distributed systems in a dynamic environments. Agent-based approach to network management is particularly suitable for complex, dynamic and interconnected networks. This essay will extensively and thoroughly review current research, and effectiveness of multi-agent systems in network and server incident management focusing in computer networks operated by Internet Service Providers (ISPs) to deliver services such as internet, video and voice. Based on the review, the problems of the existing methods and approaches will be identified. This essay concludes by making recommendation to solve the problems identified for such systems. 3

4 ACKNOWLEDGMENTS First and foremost, I would like to express my sincere gratitude to my advisor Dr. Fuhua Lin for the useful comments, remarks and engagement through the learning process of multiagent systems and this master's essay. Also, I would like to take this opportunity to thank faculty and staff at Athabasca University s School of Computing and Information Systems for all their support during my graduate studies. Last but not the least I would like to thank my family, friends and colleagues for their support, both by keeping me harmonious and encouraging me with their best wishes. 4

5 TABLE OF CONTENTS Abstract 3 Acknowledgements 4 Table of Contents 5 List of Tables 6 List of Figures 7 Chapter 1 Introduction 8 Chapter 2 Review of Related Literature 13 Chapter 3 Methodology 20 Chapter 4 Design 25 Chapter 5 Conclusions and Future Research 43 References 46 5

6 LIST OF TABLES 1. Scenarios Initial Goals Detailed Goals Functionality Descriptors Precepts Actions Event Functionality Resource Functionality Ticketing System Functionality Administrative Functionality 35 6

7 LIST OF FIGURES 1. High Level Fault Detection, Treatment, and Resolution Workflow High Level Event Detection, Treatment, and Resolution Process System Overview Diagram Data Coupling Diagram Agent Acquaintance Diagram Interaction diagram when a new incident arrives Interaction diagram when a new incident arrives requiring escalation Main Interaction Protocol Final System Overview Diagram 40 7

8 CHAPTER 1 INTRODUCTION The focus of this essay is on the research into the use of multi-agent systems (MAS) for network incident management. Due to the lack of current research in use of MAS for network incident management, this essay draws on the use of MAS to automated power system restoration. The main contribution of this essay is the design of a multi-agent system for network incident management. The suggested method is aimed at reducing manual tasks by automating incident correction and in turn improving the network stability. A multi-agent system is composed of a set of agents in an environment collaborating with each other to solve a problem. An agent is an autonomous entity and is at least partially independent such as a process, a robot, a human, etc. [1]. In recent years MAS have been an active research field. There are many advantages for multi-agent based systems to managing computer networks and data centres, including real time incident isolation and correction minimizing meantime to recovery as well as improving productivity by automating manual tasks [2]. Current network management systems perform basic tasks in response to an incident and greatly rely on human intervention to correct faults [3]. In this decade the significant improvements in semantic technology has been used to improve decision making process over time [4]. MAS along with semantic technologies is a great way to reduce cognitive load on systems decision making by promoting collaborative problem solving among systems called agents. Agents are capable of reasoning and interacting with their environment, making decisions based on their beliefs, desires and intents [1]. The agents interact with each other, 8

9 engaging in cooperative decision making. This is different from a typical computer program which has a very structured and rigid process for interaction. This difference allows for the application of MAS into simulations that closely represent real life scenarios. Network monitoring is essential for a view of network health. It also allows network operations center (NOC) to observe changes in the behavior of elements. Network management has become increasingly distributed due to complexity requiring each area requiring highly specialized teams [5]. The distribution of the task among specialized teams leads to the point of bringing intelligent agents to delegate tasks, reducing repeat tasks performed by subject matter experts (SMEs). The monitoring process can be defined as the process of collection, interpretation, and presentation information concerning objects or software processes [6]. Some of the common areas of network monitoring include configuration, fault, accounting, and security management. As part of monitoring, system behavior is observed and information is collected and then used to make decisions. There are many systems and processes (e.g. servers, network devices, software process, etc.) that would require monitoring. This essay only focuses on network monitoring. The object status may change. Each change is considered an event. An event can trigger other events such as thresholds being reached, or failure on a secondary node might trigger a critical notification to raise awareness to the issue or simply produce a notification report to senior management advising on penitential for an outages or service degradations. The model shown in Figure 1 shows main activities involved when a network device is experiencing a problem such as failure of one or more components. 9

10 1. Incident detection: monitoring system receives notification 2. Incident treatment: validation, correlation, filtration, etc. 3. Presentation: information is presented to users in an appropriate form. 4. Resolution: necessary action is performed to resolve the issue. Figure 1: High Level Fault Detection, Treatment, and Resolution Workflow. As part of this essay, an intelligent agent based network management architecture is proposed and a proof of concept is designed. The objective of this proof of concept is to 10

11 show the possibilities of intelligent agent based network management while relying on current fault management systems for information gathering. Therefore this essay will not focus on how the faults are received by agents rather faults are treated as a data source. Based on the type of event triggered it may not require any actions to be performed we will refer to such event as notification event. Other events may require action(s) and in order to recover from the state by which the event was triggered throughout this essay such events will be referred as incident or fault interchangeably. The action(s) vary greatly depending on the incident and one detailed in Figure 1 requires a remote validation and intervention in order to be corrected. Reducing intervention where a technician is physically present on site can be time consuming and is critical to economic efficiency therefore remote troubleshooting is introduced which is ideal role for an intelligent agent. Deployment of a device specific agent which would collaborate with other agents to perform its task makes the use of agent technology for such tasks scalable, where new agents can be created as new devices are brought in to be managed [7]. The process of network management is highly dynamic, being impacted both by internal and external factors. The impacts ranging from unexpected high traffic volume to network outages caused by loss of major components, to external factors such as weather conditions impacting transmission towers. The dynamic nature of computer networks and the way networks are managed makes MAS an ideal candidate. If a major network outage occurs and several device need to be recovered and restored it may take network technicians and subject matter experts hours or days using manual process to fully recover. On the other hand, using an intelligent agent technology the recovery time may be greatly reduced by agents performing the recovery activity. 11

12 The agent roles that are involved in network incident management are shown in Figure 2. Figure 2: High Level Event Detection, Treatment, and Resolution Process The process of event management shown in Figure 2 may involve all or of some the resources such as network operations technician, tier 2 technician, field technician, subject matter expert and other resources may be involved to aid the resolution of an issue such as vendor support. Each of the roles described provide necessary skills for the above tasks. The decisions taken by each role and knowledge level about past similar experiences may impact the resolution time [8]. The uncertainty in the ability of the right resources being available during this network incident impacts time of resolution and in order to overcome such scenarios by having agent resources of all skill levels available at all times will address this concern. 12

13 CHAPTER II LITERATURE REVIEW A complete Fault Management System has functions to detect, isolate, and correct incidents in a telecommunications network, however the incident correction functionality is not easily achieved with existing system [9]. One of the major obstacles in archiving incident auto correction is the lack of a common platform to interact with network devices. Other challenges include complexity of networks and variety of devices requiring management [9]. The ever-increasing expectation for high availability systems and competitive pressure motivates service provides to look for new ways to managing their infrastructure. MAS have been used successfully for variety of applications ranging from computer games to transportation and logistics. There was significant amount of research available on MAS s use for managing power distribution systems where agent technologies are used for fault isolation as well as automated restoration [10]. Several articles related to MAS were researched and contribute to this essay. It is important to note as the result of review of related literature it was discovered that fewer recent academic research in the area of network incident management using agent technologies exist. However there are some articles advocating for self-healing networks using agent technologies through modeling network devices as the agents [11]. The aim of this essay is to extend and show how MAS can be used for network event management in order to increase network availability by automating fault correction remote agents. This approach allows service providers to take advantage of agent technologies without having to fully upgrade their network devices. In order to effectively use MAS for network event management it is important to first explore MAS architecture. 13

14 As introduced in Chapter 1 an agent is an autonomous entity and is at least partially independent such as a process, a robot, a human, etc. This paper will not attempt to focus on one or another definition but will look at an agent as simply a software entity that responds to changes in an environment and is responsible for representing user interests. Agents possess an internal state and make decisions based on perceptions as opposed to someone else telling it what to do [12]. The fundamental properties of an intelligent agent and its main components are defined during agent s design phase. In this essay an agent is autonomous, can communicate, can cooperate and delegate tasks to other agents. Before looking into MAS it is appropriate to briefly explore agents and their properties. There are four classes of agents and it is important to differentiate and select the most appropriate agent during the design phase based on agent functions and system requirements [13-17]: 1. Logic based agents 2. Reactive agents 3. Belief Desire Intention (BDI) 4. Layered Architecture In this essay agents with proactive components are also referred as intelligent agents, such as the subject matter expert (SME) agent. SME agents can be developed to perform complex decision-making and are dispatched after reactive agents are unable to correct incident(s). It cannot be disputed that some form of agent coordination such as Contract Net 14

15 Protocol (CNP) [18], Voting [19] will be necessary in order to distribute tasks to appropriate agents efficiently. However it is important to note, accuracy of the resolution should be a priority as it vital to the health of the network. Multi-agent Systems Multi-agent systems enable the resolution of complex problems by dividing into sub problems [20]. Each agent specializing in a specific task, complex problems can be solved efficiently and distributed model works well. Agents in a system may operate on their own and or pursue common interests. The implementation of these features is made possible by MAS communication infrastructure allowing communication and cooperation among agents. Communication is one of the key aspects of a multi agent system and is geared to the human language [14]. It allows agent communication and cooperation among themselves using communication protocols and an evolving language [21]. Agents may sent messages to specific agent or may broadcast the messages to the agent community it is also possible to narrowcast the message to a specific group of agents. [22] Results of a study on A Cooperative Multiagent Framework for Self-Healing Mechanisms in Distribution Systems shows that two-way communication between agents provides a good solution for fault isolation and effective restoration plan [21]. The two way communication will play an important role in validating current device state within the network as the device states change an alternative plan may be necessary. Computer networks are increasingly complex, dynamic with variety of devices interconnected to provide services across platforms, such as delivery of traditional television service to mobile devices. Internet Service Providers (ISPs) are providing a range of new 15

16 services as response to customer demand [24]. Over the last decade the increase in mobile technology and delivery of variety of services over mobile networks has been an enormous change for traditional ISPs. Delivery of new and existing service over a variety of platforms requires regress process for managing such services. As part of effective management of networks it is important to become aware of problem before or as soon as they occur. It is equally important to isolate and resolve such problems without impacting services and customer experience. The impact to end-user is greatly reduced by having redundancy in parts of the system however not all systems are redundant for various reasons including cost. Utilizing traditional operational support systems for the purpose of fault management is not optimal due to manual human interventions for resolution where automation would greatly reduce the resolution and decrease manual repeat tasks. In order to resolve critical issues more quickly subject matter experts are engaged promptly after minimal effort to correcting issues resulting in high workload for subject matter experts. The amount of repeat tasks for subject matter experts are reduced by training tier 1 and 2 technicians to assist with some of the common tasks. As such a solution to automate such tasks performed by tier 1 and 2 technicians would greatly reduce the manual human intervention to correcting such issues. However in order to so in a complex environment such computer networks operated by ISPs today would require a framework which would allow SMEs to develop procedures to be performed in event of issues. MAS and artificial intelligence techniques proposed for automated wireless intrusion detection to reduce human intermediation [25]. Automation has been adopted by many industries including automate fault diagnosis and correction on servers where a number of Operational Support Systems (OSS) exist specifically targeting automated fault correction for servers [26]. Although automation is archived through a client 16

17 and server application server management has greatly benefited from such tools allowing server administrators monitor and perform automated action based on predefined action reducing manual and repeat tasks as result increasing server stability [5]. However traditional ISPs with variety of network devices are unable to take advantage of such existing systems, due to the architecture of such fault management systems requiring an agent application to be locally installed on the server [26]. Today s networks are made up of several different devices with fewer commonalities and varying proprietary operating systems. In some cases network devices are treated as appliances where ISPs are not permitted to install third party application purposes such as monitoring. On the other hand severs management success comes mainly from the commonalities of the base operating systems where there are a handful of most commonly used operating systems making it appropriate for vendors of fault management systems to develop client applications. Locally installed agent clients on servers for the purpose of monitoring various aspects of the servers such as processes, physical components and performance related activities, also allows server administrators to define automated actions to be performed [26]. Due to the lack current research in use of MAS for the purpose of network incident management no case studies were found to indicate agent cooperation in resolving network issues would better compared to current practice. However case studies presented on the use of MAS for the purpose of automated power system restoration show successfully power restoration with minimal impact to the end user [26-28]. Authors of Conceptual Design of A Multi-Agent System for Interconnected Power Systems Restoration using simulation have demonstrated use of agent technology to aid power restoration. The authors investigated three cases of power system failures. The first failure case had minimal impact. In the second 17

18 scenario one of the generators failed where load balancing was required. The third case involved both buses having faults concurrently. In all the three cases multi-agent systems approach was effective in restoring service [28]. Gulnara Zhabelova and Valeriy Vyatkin (2012) outline in an article published by IEEE on how multi agent systems can be used to achieve self healing power grids using collaborative fault isolation and restoration [27]. The proposed approach automatically senses the fault in power system after correlation and isolation of the problem it then switches customer to non-faulted section of the system [27]. Other research on intelligent energy systems indicate the importance of monitoring the performance in addition to fault monitoring as a way to optimal network configuration during a failure to support self healing functionality [29]. IBM researchers have demonstrated using practical scenarios how agent technology can help manage power consumption using a medium sized server cluster [30]. Effective network management is essential to the success of ISPs as networks are the backbones for the services they provide [5]. As this essay has discovered multi agent systems have been effectively used for managing fault and automated fault restoration in power systems. There are a number of reasons why MAS is a great application for managing distributed systems, such as agent communication and negotiation when solving problems collaboratively, other notable features including agent adaptability in dynamic environments [23]. Today s computer networks are dynamic and becoming even more dynamic as additional services are being offered using the same infrastructure [31]. Reports indicate number of Internet connected devices is on the rise and by the end of 2020 it is estimated to reach 75 billion connected device which translates to nine connected device per person [31]. 18

19 The growth in the Internet of Thing makes the case for automated fault correction towards achieving self healing networks. As already mentioned the need to automate fault correction is not unique to computer networks. Automated network fault correction has a lot of similarities with modern smart power grids [32]. ISPs can leverage success in other areas such as ones explored in this essay on power systems leveraging multi agent systems for the purpose of monitoring and auto resolution of network related issues. The research results suggest that the multi agent systems approach with agent cooperation and autonomy is a major advantage over conventional systems [32-33]. Existing methods of fault management extensively relies on human intervention and with the expected rapid growth in Internet of things makes multi agent systems a sustainable long term solution for automating fault management in computer networks [33]. Finally MAS has been proven to work well in dynamic environments, as is the case with computer networks. 19

20 CHAPTER III METHODOLOGY This essay presents design of a multi-agent system aimed at network management, specifically of automated incident management. High-level diagram are used to describe the process for identifying and correcting a network fault. The overall system should be designed to be open, to allow ISPs add new agents quickly as new devices are introduced into the network. 1. Device one or more device of a type may exist and each device is responsible for a given task. Devices within an Internet Service Provider (ISP) may range from network devices to billing systems. ISPs may configure their device to trigger notification event when of parts or entire device is at fault. An event or fault occurs as result of several factors such as hardware/software problem, capacity, intrusion, etc. 2. Fault Management System receives notification as indication of an event such as fault. The notification is delivered in several forms through Element Management System (EMS) or directly from the end device. If no EMS is present and device is incapable of sending a notification the device may be polled periodically for logs or other information by fault management system or through a custom middleware. Event validation, correlation, and filtration may be performed in order to reduce some of the work for NOC Administrators. 3. NOC Administrator usually performs tier one tasks for events received by creating a trouble ticket and dispatch to appropriate queue usually based on 20

21 technology or service. In ideal scenario event severity is used to determine if immediate attention is required, if so appropriate technician(s) are contacted by phone to investigate the issue. 4. Technician perform tier two role by investigating issues and may attempt steps necessary to resolve the problem. Technicians are classified by their roles and responsibility such as network, cable technician, etc., and may perform tasks remotely. 5. Subject Matter Expert perform tier three role when it comes to troubleshooting and resolution. SMEs specialize in a specific area/technology and are highly trained for systems they support. 6. Incident Manager gathers information related to a given incident to study the impact and may send an notification to service advisory distribution indicating the issue and its impact to customers. The notification may contain number of customers impacted and if business customers are impacted customer names may be listed. Incident Manager would work with NOC and field technicians closely to provide subsequent updates. 7. Vendor Technical Support unresolved issues may be escalated to vendor support to investigate and resolve such issues. Vendor support may refer the issue within their design teams in order to resolve the issue. Usually issues escalated to vendors are complex and is done after in house SMEs are unable to resolve the issue. 8. Business Intelligence Analyst responsible for finding repeat problems based on problem history by analyzing network event data. Usually the feedback is sent to 21

22 subject matter experts and/or engineering to find root cause and/or find alternative products depending on the outcome of the root cause analysis. This process in turn may trigger other processes that are not of concern for the purpose of this essay. During the design stage the roles detailed in above steps will be mapped into one or more agents responsible for performing the role of the stakeholders involved. In order to bring network events into multi agent system it will be necessary to create one or more fault management system agent(s) responsible for retrieving event information and communicate to appropriate device agent(s) for further processing. Event correlation logic may be in place to discard notification events that do not require any action. Since the aim of this essay is to outline how an agent can be designed for an existing device and be responsible for its management. It is necessary to create an agent for each device type in order to keep these technician agents as simple as possible whereas in real world a technician may be trained for more than one device. Technician agents will communicate with the network device of their type and perform actions defined by human subject matter expert when an event is triggered therefore device agents can be reactive agents. There will be one or more subject matter expert (SME) agents responsible for a device type similar to their technician agent counterparts SME agents can have a reactive component as well as BDI component working towards long term goal of the network stability. The balance between its reactive and BDI component is dependent to the network device they are responsible for managing. The Incident Manager (IM) agents can be mapped into logic based agents. For example their decision to send notification on service outages or customer impacts caused by a fault in the network can be based on percentage/number of customers impacted. The role of 22

23 the Vendor Technical Support agents would be limited given today s method of communication with vendor support to submit trouble tickets that has to remain as a manual process. However it is important to note that if vendor-ticketing systems are robust and system generated tickets then creating an agent to automate the ticketing process may be feasible. The role of the Business Intelligence Analyst agent is to dynamically analyze and report repeat patterns of failure and the information collected can be used by engineers to improve the systems design in order to prevent repeat failures. The Prometheus methodology is used to design the system. Prometheus methodology is a detailed procedure for specification, design and implementation of intelligent agent systems [34]. The Royal Melbourne Institute of Technology (RMIT) in close collaboration with Agent Oriented Software Pty. Ltd developed Prometheus methodology. The company sells JACK Intelligent Agents a commercial agent platform. Prometheus development process consists of three major phases: 1. Specification of the system involves with identification of goals and sub-goals of the systems. System specification phase describes how the agents will interact with the environment. Agent interactions with the environment is called actions and are to be distinguished from incidents (events). The goals describes the task of system to be solved and the functionalities are description of individual functions by which this occurs 23

24 2. In the architectural design phase numerous diagrams are created to specify structure of the system. Agents are identified as well as what events they have to react and actions they need to perform to effect the environment. 3. The detailed design phase looks at the internal workings of each agent in terms of capabilities, events, data structures, and plans [34]. Agents are plan based and rely on user-defined plans; which is a major advantage for network management. Agent plans are defined based on devices they managed. In this phase all events are defined including external, between agents and within agents. The result of this essay is design of a multi-agent based system aiming for automated incident isolation and correction. 24

25 CHAPTER IV DESIGN Prometheus Methodology is following in this chapter for system specification, high level architectural design and detailed design. The system specification includes identification of agents, the initial goal and functionality descriptors. In the high level architectural design stage the overall system structure is describes using system overview diagram whereas detailed design stage is concerned with internals of each agent [36]. Prometheus is an iterative methodology aimed at design and development of intelligent agents [35]. More information on Prometheus Methodology can be found in Chapter III and reference section. As previously mentioned in Chapter III computer networks have become complex with the recent addition of new services from video on demand to next generation home security [37]. Due to the complexity of networks and the variety of problems they may experience the focus of this chapter will be limited to one specific incident. The incident outlined in Figure 1 is of a primary network switch experiencing a problem with one of its physical links and there are three virtual interfaces connected to different routers. The network events generated as result of this problem are listed below: Primary switch sends a notification indicating failure in one its links. Secondary switch sends a notification indicating loss of its peer. The three routers each send a notification indicating loss of primary link. 25

26 The system actors are NOC Administrator, Technician, Subject Matter Expert (SME), Incident Manager, Vendor Technical Support, and Business Intelligence Analyst. In this section we can identify the tasks associated with each of the actors: Table 1: Scenarios NOC Administrator tasks Collects information about the device Finds appropriate dispatch procedure Creates an incident ticket and dispatches to appropriate queue Pages prime technician if the event occurs outside working hours of the team and if the event requires immediate attention based on information gathered Technician tasks Accesses ticketing system for new work assigned Attempts to identify problems with the device or service Attempts to resolve the problem If unable to identify the cause or unsuccessful in resolving the issue contacts device subject matter expert Updates ticking system Subject Matter Expert tasks Accesses ticketing system for new work assigned Reviews procedures followed by the technician agent If unable to identify the cause or unsuccessful in resolving the issue contacts device vendor technical support Collaborates with vendor support team to provide access to the device or send diagnostics data Incident Manager tasks Accesses ticketing system periodically for new incidents Accesses ticketing system periodically for changes to existing open tickets Asses situation by identifying customer and service impact Send notification to stakeholders Escalate the incidents that do not get resolved quickly Vendor Technical Support tasks Review ticketing system for customer tickets Attempt to assess the issue using information provided 26

27 Communicate with customer to perform a procedure and or ask for more details Communicate internally with other teams Business Intelligence Analyst tasks Analyze ticketing system and fault history Isolate devices with repeat patterns of problems Communicate the results of findings with stakeholders In this section tasks outlined in Table 1 for each actor can be presented as use cases. The use case is Use Case 1: A network device loses one its communication links Fault management system receives event notifications via SNMP traps from multiple devices. Fault management system correlates events based on some predefined logic to isolate the issue with primary switch. A NOC agent is created that will respond to the event by gathering device information from different sources such destination queue. The NOC agent s goal is to identify a device technician agent. A device agent is notified by the NOC agent to look into the issues. If event correlation was not done by the fault management system the NOC agent will notify a device agent for each and every device from which a notification was received. 27

28 Device agent receiving the notification attempts to identify the root cause of the issue. The device agent after verifying interface card it is able to conclude the issue is not as result of a component failure. With root cause of the issue unknown device agent notifies Incident Manager Agent to issue a threat advisory. Incident Manager Agent is created and continues to issue periodic updates until the issue is resolved at which point a final update is sent. Based on the scenarios identified we can determine the system goal. The initial goals of the system are only based on Use Case 1. Table 2: Initial Goals Receive a Notification Correlate Events Assign a Technician Troubleshoot an Issue Assign SME Update the Ticketing System Send Service Advisory Notification Escalate unresolved issue Identify repeat problem From initial goals we can identify additional sub goals. Table 3: Detailed Goals Receive a Notification Keep track of active incidents Update list of active incidents Archive resolved incidents 28

29 Correlate Events List active events Look for relationship between events Group events based on relationship Present a few events that are really important Assign a Technician List Technicians Look for an appropriate technician Provide event and ticket details Notify for device status changes Keep track of active technicians Troubleshoot an Incidents List active incidents Search for resolutions procedure Select resolution procedure Attempt resolution procedure Confirm resolution Update ticketing system Assign SME List SMEs Look for an appropriate SME Provide event and ticket details Review attempted steps performed by technician Notify for device status changes Keep track of active SMEs Update the Ticketing System Search using ticket number View event details Update ticket with steps performed for issue resolution Change ticket status Send Service Advisory Notification Search for active tickets Filter on ticket severity requiring service advisory notification Associate services/users impact caused by event Distribute findings Escalate Unresolved Incident 29

30 Search for active tickets View service-level agreements(slas) Escalate issue to meet SLAs Update ticketing system Identify repeat problem Search for archived events Group events by device and root cause Identify repeat problems Calculate resolution costs Send period reports The goals identified in Table 3 can be grouped by functions removing repeat functions. The groupings listed in Table 4 describe system functionality. Table 4: Functionality Descriptors Event Function Accept new events Keep track of active events Update event details based on new events and actions Look for relationship between events Group events based on relationship Present a few events that are really important Resource Function List resources Look for an appropriate resource Provide event and ticket details Notify for device status changes Keep track of active resources Search for resolutions procedure Select resolution procedure Attempt resolution procedure Confirm resolution Ticketing System Function 30

31 Search using ticket number Search using device name View event details Update ticket notes Change ticket status Administrative Function Search for active tickets View service-level agreements(slas) Escalate issue to meet SLAs Search for archived events Filter on ticket severity requiring service advisory notification Associate services/users impact caused by incident Group events by device and root cause Identify repeat problems Calculate resolution costs Send period reports The precepts intelligent agents receive from external environment play an important role in network management. The following precepts are identified: Table 5: Precepts Receive new events The System receives new events to be added to the list of active events and subsequently be dispatch to a resource for troubleshooting. Receive updates for existing events The system receives updates for active events from network devices. The updates are forwarded to the assigned agents. Ticket closures The system receives notification of a ticket closure by human users after performing activities to correct issues such as replacing a power supply. Problem escalations 31

32 The system receives notification if issues are not resolved within a defined period time based on severity. Power outages The system receives notification when a central office switches power source to battery/generator. Service outages The system receives notifications when there are service outages. In this section of the design we can define actions. Actions are opposite of precepts and such as communication from intelligent agents with external environment. Actions of the system are defined in Table 6. Table 6: Actions Resolve device issues The overall purpose of the system is to resolve network issues. Receive event notifications A notification is received by the system indicating a problem with a device. Request device related information Device information such as credentials are accessed. Troubleshoot issues Establish connection with the device and identify issues. Run commands Agents run specified commands in an attempt to resolve issues. 32

33 The initial functionality descriptors provide an abstract view of the system by combining precepts, actions and goals. The system performs functionalities listed in Tables 7 10 with others such as agents, humans, or other systems. Table 7: Event Functionality Functionality Event Description This functionality manages the list of active events. Resources are dispatched to troubleshoot and resolve incidents reported using the event. Events reported to resources are tracked using ticketing system. Goals Accept new events, Keep track of active events, Update incident details based on new events and actions, Look for relationship between events, Group events based on relationship, Present a few events that are really important Actions Triggers Information Used Information Produced Assign a resource New event, Update to an existing active event Event description, severity, device name, type. Assign a resource Table 8: Resource Functionality Functionality Resource Description This functionality manages various resources such as technicians, subject matter experts (SME). Resources are assign to troubleshoot and resolve issues. Technician agent can assign an incident to SME agent when unsuccessful in resolving the issue. 33

34 Goals List resources, Look for an appropriate resource, Provide event and ticket details, Notify for device status changes, Keep track of active resources, Search for resolutions procedure, Select resolution procedure, Attempt resolution procedure, Confirm resolution Actions Triggers Information Used Information Produced Troubleshoot, resolve issues Assign SME resource Event details, Device type Resolve issue, escalate issue Table 9: Ticketing System Functionality Functionality Ticketing System Description This functionality tracks event related activities such Goals Actions Triggers Information Used Information Produced as resource assigned and actions performed. The system identifies devices by name and provides device specific details. It also allow search of all tickets created for a specific device. If issues are not resolve within a specified time period sends notification. Search using ticket number, View event details, Search using device name. Update ticket notes, Change ticket status Escalate ticket Event details Event maintenance history 34

35 Table 10: Administrative Functionality Functionality Administrative Description This functionality keeps track of event related activity such as actions performed when a problem occurs. The resulting information is used by Business Intelligence agent to find repeat problems. Goals Search for active tickets, View service-level agreements (SLAs), Search for archived events, Filter on ticket severity requiring service advisory notification, Associate services/users impact caused by event, Group events by device and root cause, Identify repeat problems, Calculate resolution costs Actions Triggers Information used Information produced Report Escalate issue to meet SLAs, Send period reports Event, Device, and Resources data Report of device with repeat problems In the high-level architectural design stage of Prometheus; overall structure of the system is described using a system overview diagram. The System Overview Diagram is produced based on system specifications discussed so far in this chapter. 35

36 Figure 3: System Overview Diagram The data-coupling diagram presented in Figure 4 is based on the initial functionality descriptors and initial system overview diagram. 36

37 Figure 4: Data Coupling Diagram As part of the high-level architectural design stage of Prometheus, we can define which agents will exist within the system. Based on the system overview diagram there are six types of agents pictured in Figure 5: NOC Administrator Agent Technician Agent Incident Manager Agent Subject Matter Expert Agent Business Intelligence Analyst Agent Device Vendor Technical Support Agent 37

38 Figure 5: Agent Acquaintance Diagram Interaction diagram pictured in Figure 6 describes arrival of a new incident, which can be resolved by technician agent. Figure 6: Interaction diagram when a new incident arrives Interaction diagram shown in Figure 7 describes a scenario where an incident will require escalation to Subject Matter Expert as well as the Device Vendor Technical Support before being resolved. However it does not involved the Incident Manager Agent. 38

39 Figure 7: Interaction diagram when a new incident arrives requiring escalation These interaction diagrams presented in Figures 6 and 7 cover a basic scenario. For the purpose of this essay only a few interaction diagrams were produced. Depending on the complexity of network and devices to be managed there will be several additional interaction diagrams for each scenario. Agents must interact with each order to perform their tasks. Agent communication can be achieved through the communication protocol presented in Figure 8. The aim of a communication protocol is to allow agents synchronization and exchange of messages. Figure 8: Main Interaction Protocol As part of the details design of Prometheus System Overview Diagram can be produced using the previous diagrams and system specification. The System Overview Diagram is presented in Figure 9 where agent interactions between are made through the main communication protocol. It also shows each agent has actions, percepts and messages associated with it. 39

40 Figure 9: Final System Overview Diagram Multi-agent system approach can be applied for a variety of problems aimed at automating manual process. As previously explored in Chapter III MAS has been advocated as way to automating fault correction in power grid networks. The MAS approach for automated fault correction in computer networks has greater challenges compared to power grid networks due to the variety of services offered over these networks and which is considered as one to end users of such services. However as proposed in this essay multi 40

41 agent systems can be beneficial in automating some of the basic task performed by network technicians. There are several ways for identifying system improvements using Multi Agent System (MAS), companies may have their unique way to measure such improvements based on problems identified that led to look into solutions such as MAS. One way to measure network fault correction after using MAS is to compare the time it takes using the manual process for resolving similar incidents. However this is not ideal; the manual process involves a human to react to the incident which can have time delays on the other hand the automate fault correction using MAS would require the development of the procedures an agent would execute in response to an incident. Other benchmarks may include the cost savings by deployment of MAS where the savings would come from a combination of human resource reduction as well as network uptime by automated actions being performed to resolve incidents with near instant reaction from agents. Computation cost among agents allows tradeoff between time and solution quality, this can play an important role in network fault correction, where an agent makes a tradeoff between a lengthy procedure and a temporary yet quicker solution to restore service. Such decision making can help restore service as soon as possible and then agents focus in correcting the issue using the original (lengthy) procedure. An implementation of multi-agent system approach in automating network fault correction must focus on a specific devices type targeting a subset of fault where the correction process can be clearly identified and the steps performed to correct the fault is the same every time. As discovered in the design chapter agent communication is minimal and agents are relatively autonomous. This essay suggests a bottom-up methodology for 41

42 implementing agent based fault correction system this allows the system to be highly adaptable and scalable without relying on abstract representation. However when following bottom-up methodology it is important to keep in mind a high level design of the global system. 42

43 CHAPTER V CONCLUSIONS AND FUTURE RESEARCH The research results show that there is a great potential for agent technology to be used in dynamic environments and the general viability of MAS for automating manual tasks. Similarly network incident management using MAS approach presented in this essay shows the great potential for self healing networks. MAS offers several advantages mainly through agent autonomy and cooperation over conventional systems making it ideal for solving complex problems within a network using specialized agents focusing on a specific issue. The distribution of agents allows for a distributed network incident management, by having several agents working concurrently each with a focus area. This allows network managers to build and deploy new agents as they add new device types to their networks making MAS approach significantly scalable. Through integration of new technologies such as artificial intelligence and semantics; multi agent systems are increasingly flexible to deal with uncertainties in dynamic environments. These features will allow powerful new agent based applications to support the ever growing computer networks as well as allowing them to change agent internals to accommodate for device software upgrades or simply adding additional tasks. This is a great advantage in itself over a single system being responsible for managing all devices. It allows making changes to one specific device agent without disturbing others in turn minimizing the risks to the network. Other advantages include the ability to automate incident resolution minimizing manual tasks and reducing recovery times. However it is important to note multi agent systems lack a common conceptual and technological foundation for mass deployment in production environments. Future developments in MAS as well as the future growth in the Internet of things will promote 43

44 the notion around self healing networks where MAS can play an important role. Finally the methodology proposed in this essay is practical when used for basic automation as a starting point and an iterative approach for further improvements as network device behaviors become apparent. IMPLEMENTATION METHOD & FUTURE RESEARCH Research shows software agents and agent based systems have been an active research area in the past decade. However most agent based applications have focused on simulation and modeling issues. In order to verify the usefulness of multi-agent systems for the purpose of network incident management one must implement the solution with agents performing routine tasks on switches, routers and other network devices within a lab environment. First stage implementation of automated fault correction should include commonly reoccurring incidents by studying network fault data from current fault management system in use. Tests results using MAS approach should be compared to traditional fault management systems currently in use. The comparisons would support further implementation of multi-agent systems to automate additional incidents. The Prometheus Methodology can be used to design multi-agent systems fault management in multiple phases in order to reduce complexity of such implementations. Allowing companies to see the benefits of MAS as the system is being implemented across the network. Also allowing new discoveries to be incorporate to the system using iterative model to refine the design. Java Agent Development Framework (JADE) can be used as an implementation platform. 44

A technical discussion of performance and availability December IBM Tivoli Monitoring solutions for performance and availability

A technical discussion of performance and availability December IBM Tivoli Monitoring solutions for performance and availability December 2002 IBM Tivoli Monitoring solutions for performance and availability 2 Contents 2 Performance and availability monitoring 3 Tivoli Monitoring software 4 Resource models 6 Built-in intelligence