IT Service Management: Understanding Office 365 contracts

Size: px
Start display at page:

Download "IT Service Management: Understanding Office 365 contracts"

Transcription

1 IT Service Management: Understanding Office 365 contracts In this RoboTech we will cover a very important topic for organizations using Cloud services, especially Office 365. We are speaking about contracts, what guarantees they include, and how to optimize your relationship with Microsoft. With Software as a service (SaaS) like Office 365, the role of the admin team has shifted dramatically from infrastructure support and maintenance to service management. The purpose of IT Service Management is to manage end-user expectations, reduce the number of issues, reduce our mean time to repair, and verify the best possible service delivery to your users. So let s take a look at Microsoft contracts. Microsoft Office 365 Contract Management Contracting with Microsoft Office 365 offers you the best-in-class Collaboration SaaS application. That comes with several guarantees in term of availability of the services from a Microsoft perspective. A SLA is a kind of insurance against service disruption so the first thing to do is to understand the limitations of this insurance. Basically, they concern: Anything happening outside of reasonable control (force majeure) Anything happening outside of their datacenter Anything that has been caused by your company (irrespective of Microsoft recommendations, bad configuration bandwidth, unauthorized actions, etc.) Any downtime happening during scheduled downtime To be clear, the service delivered to your end-user is NOT guaranteed by Microsoft SLA. And that is completely normal as Microsoft is not running your Network, your ISV or anything inside your infrastructure. Only the service delivered to the edge of their datacenter is guaranteed provided that you didn t contribute to make it fail. So now that we understand what is excluded, let s understand what s the famous 99.9% means for you. Microsoft calculates a downtime ratio based on your total number of user minutes of use of the service.

2 The calculation is: [(User Minutes Downtime Minutes) / User Minutes] *100 As downtime only counts for users that are impacted, you might surmise that you need a big incident to go under 99.9% of availability. Let s do a short calculation for a company with 10,000 users: In order to breach the 99.9% SLA, you would need for example to have an outage of user minutes of downtime per month. And that means an incident of almost 45 min for a 1000 of your users per month. We now understand the general limitations of the contract and what is generally insured. Now let s look at services. Microsoft Services Guaranteed Microsoft Exchange Online For your users and so for your ITSM, Exchange Online encompasses a wide range of actions including accessing mailboxes from Outlook, sending , creating meetings, checking free/busy statuses, searching for mail in the mailbox, etc. But in the Microsoft SLA, the only service guaranteed for Exchange Online is the ability to send or receive with Outlook Web Access. Here we are speaking about availability only; not about Performance. If the service is slow, it is still considered as up from an SLA prospective, even if your users might consider it down. Any other Microsoft Exchange feature is excluded from the SLA. Let s continue with Microsoft Teams. The calculation is the same but only on the ability for a user to read or post to chat conversation for which they have appropriate permissions. Nothing regarding calls, video sharing, etc.

3 If we look at OneDrive, the only service guaranteed is the ability for a user to view or edit files that are stored on their personal OneDrive for Business Storage. If we look at Microsoft SharePoint Online, the SLA is a bit the same and consist in the ability for a user to read or write any portion of a SharePoint Online site collection for which they have appropriate permissions. Now things are a bit different with Skype for Business Online. It is one of the only service to also have a kind of performance SLA. There are 3 SLAs for Skype. The first is based on the ability for a user to see presence status, conduct instant messaging conversations, or initiate online meetings. The second is based PSTN Calling and Conferencing, guaranteeing the ability of a user to initiate a PSTN call or conference. And finally, the last SLA on Skype is about Voice quality. For voice quality, Microsoft basically calculates a Network MOS that predicts what would be the end-user call quality ranking. They then check how long these poor-quality calls last and provide a ratio with the total number of user minutes in a month. The network MOS is based on a constant measurement of the roundtrip time, packet loss, Jitter and concealment factors. The calls need to be placed on Skype for Business Certified IP Desk phones on wired Ethernet. Any network latency that would be found on your network would prevent you from claiming any credit in case of major issue. If you want more details on Skype for Business Voice quality monitoring, you can: Watch our webinar about Skype for Business performance >> Read our dedicated GSX RoboTech articles >> At this point, to maximize your relationship with Microsoft we would recommend: Read your contract and make sure you differentiate what Microsoft promises you and what you are promising to your business lines. Microsoft SLAs are a good starting point but cannot be a basis for your Service Delivery 90% of the time, a user s performance issue root cause will be found outside Microsoft range of responsibility. So, you need to implement Cloud Service Delivery best practices to deal with the end-to-end service delivery to your end-users. But let s say that you have identified issues and you want to talk with Microsoft. What is required for that?

4 How to let Microsoft help you To help you, Microsoft needs to have a certain number of statistics and facts: A detailed description of the incident Information regarding time and duration of the downtime Number of locations and affected users Description of your attempts to resolve the incident The question is then, how do you collect this information? How do you know that the service was possibly down on a Sunday at 3 am if you are not constantly monitoring it? These questions point to the necessity of monitoring, from a Microsoft service perspective and from an end-user perspective. We would recommend here to not forget to report your outages and performance issues. But, you will you need statistics. There is an easy way and a hard way to get those statistics. The hard way - Deploy and maintain complex scripts running from every locations, alerting you when an issue arises and feeding databases that can be easily used to share the data with Microsoft. The easy way -Or you can use third-party solution tools, like GSX Gizmo Robot User for Office 365. The GSX Robot Users are small Windows services that you can install anywhere you want. The Robot Users act exactly as a user would do on Office 365, performing complex enduser scenarios. They alert you in case of any availability and performance issues and provide every data you need through PowerBI or any other BI Solution. To know more about the GSX Robot User, please read this article >> Finally, before contacting Microsoft, go through incident analysis to make sure that you are not responsible for what is happening. Here is an example of Exchange Online Service Level dashboard that you can easily get on PowerBi with the GSX Gizmo Robot User data.

5 What we can see here is that you can have the service delivered per location, but also per actions that your users are performing. With the convenience of the GSX Service level dashboard, you not only have the service availability information, but also vital information about the performance you deliver and reach on a daily or monthly basis. It is also a perfect way to share your data with Microsoft to help them helping you. We built these dashboards using Microsoft and Gartner recommendations that we are now about to detail. Now that we ve seen the benefits and limitations of the contract with Microsoft, we understand that you need to go a step further if you want to ensure Service delivery to your End-Users. The next RoboTech will focus on what Microsoft recommends helping you managing the service delivered to your end-users.

6 IT Service Management: Modern Service Management with Microsoft Recommendations Microsoft states in their blog, the Service Management should be a focus for all customers Microsoft defines Modern Service Management as a way to ensure business consumption and productivity. This model is based on 4 pillars. Let s start with Service desk. Service Desk and Normal Incident Management needs processes in place to support your users in their day-to-day lives. For that Microsoft recommends leveraging automation investment from the Office 365 Service. They also recommend using your existing ticketing tools to measure the number of tickets and escalation rates in order to focus on the right areas. Regarding Administration and Feature Management, use every tool that Microsoft provides to manage the workload and configuration. The Evergreen Management defines the processes in place to test, implement and be ready for the continuous improvement that Microsoft implements into Office 365 regularly. To manage that you should make someone accountable, on a daily basis, for the triage of Message Center content and the integration of the notifications into your existing tooling.

7 You should stay updated by reading blogs, technets and the message center notifications, to understand when new features will be implemented and how they could impact your end-user experience. Let s focus now on monitoring and major incident management. Monitoring and Major Incident Management It defines the processes that ensure the detection and troubleshooting of end-user service delivery issue, regardless of root cause. The main challenge here is to bring back visibility on the services that you actually deliver to your end-users. For that Microsoft recommends integrating the Office 365 service health dashboard notifications into your existing incident workflow and tools. Then, to combine this information with end-to-end monitoring scenarios that will measure the true end-user experience for each main capability of Office 365, from where your users are. By Capabilities, Microsoft defines, for example for Microsoft Exchange: Login via Outlook Mail flow Mobile Sync For SharePoint Online or OneDrive. Login Download / upload document For Skype for Business Online:

8 Login via Skype Client Instant messaging and presence Voice call As we can see, Microsoft clearly states that you need to have visibility on what users are able to accomplish with Office 365. To complete this first set of recommendation, we would recommend that you: Find the right balance between measuring too many complex scenarios that would quickly become unmanageable, and not measuring enough, thus decreasing the visibility you would have on the service. Check performance also; measuring availability is not enough. Measure whatever is creating the most user complaints and tickets to your Service Desks. If we take the example of Exchange Online again, you would then also include: The availability and performance of creating meetings, and use of the free/busy feature The availability and performance of searching through a mailbox The availability and performance of downloading an attachment That would give a first good step for your Service Delivery of Exchange. As we already mentioned, the GSX Gizmo Robot Users have been designed to perform the end-user transactions you need to measure. More information about GSX Robot Users here >> That is why GSX has been approved by Microsoft to deliver our solution both the Azure MarketPlace and the Office 365 Appsource. To finish our discussion of the Microsoft recommendation about Office 365 Service Management, it is also interesting to understand how they are organized to manage the incidents and communicate on them. Response to Incidents: Microsoft Organization Within the Microsoft Office 365 organization, you have multiple teams that work together to prevent and communicate about service outages.

9 For that you have incident managers that have deep expertise in their relevant area. They determine the scope of outage and the root cause, then go through incident resolution. You have communication managers that also have a deep expertise that coordinate and provide information across internal teams and post customer facing communication to the service health dashboard. You then have support that are technical resources, providing 24*7 customer phone and web support. And finally, you have service account management that are account representative and that can escalate tickets and provide rapid on-site support. Below you can see the communication process Microsoft follows for each incident. You don t have to entirely replicate the Microsoft Service Incident organization, but it gives you a good staring basis on how to be organized, especially for large organizations. This concludes our RoboTech on Microsoft recommendations for Modern Service Management of Office 365. But Microsoft is not the only SaaS provider on the planet, and best practices for Cloud Service delivery have been developed and refined for many years. We will see in the next article how cloud service delivery management best practices can be beneficial for your IT organization and your end-users. IT Service Management: Cloud Service Management best practices for Office 365 ITSM represents how IT manages, operates and transitions technology but also designs services and manages risks within the organization. You certainly have already seen what Service Strategy, Service design, Service Transition, Service Operation, or Service Incidents are. They are clearly defined in ITIL processes, and I would warmly recommend you read about these processes if you haven t started to implement these best practices for service management.

10 However, even clearly defined, these processes are facing serious challenges with Cloud Service Delivery. In term of service strategy, one of your goals is to avoid risks of Shadow IT. For that you need to understand the use of the current solution the overall business needs and anticipating new services. Regular surveys among your employee, and public cloud services offering must be analyzed by your team. When you think of Service Design and Service Operation, Software as a Service can be a headache. The challenge is to first define what service you want to manage. In a previous RoboTech, we ve seen that even for a simple Exchange Online environment, the notion of service can be very versatile. Then you need to define Service Level Target instead of SLA for these services. Service Level Targets are better here, because they re based on best effort and are not correlated with penalties. You cannot take the risk of an SLA when you are not controlling the entire route of the service to the end-user. Service Level Targets should be defined with the business lines at the location level (per country or region) and should not forget your mobile user. That is why you need to have a way to measure the mobile experience when it makes sense. Defining SLT will enable you to start working on continuous service improvement, pinpointing the issues and justifying investment and success in service delivery. Regarding Service Transition and Change Management, the continuous migration of onpremises devices to mobile ones and the access to hybrid cloud system clearly complicates the job of IT administrators. To face that challenge, you should ensure the monitoring of the health and usage of your complex hybrid identities management. The correlation between hybrid CMS and SaaS end-user service delivery is critical to manage. Service Improvement is now even harder to reach than before because you can only improve what you can measure. For that, you need to implement a way to continuously measure the end-user service that is really delivers to your locations. We ve already explained how you can do that with GSX Gizmo Robot Users. Finally, regarding Service Incident and Service Desk, we ve already seen what Microsoft is recommending and how GSX can really help you on that topic. We would just stress a few additional recommendations here.

11 You need to correlate the end-to-end user experience with the health of every component that can impact it, including the Microsoft Service Health Dashboard. You need to determine quickly if it is a local or tenant-wide issue. You need to know if anything that can impact the end-user experience is having trouble and then immediately share the information with the right team to fix the situation. That will drastically reduce the mean time to repair and clearly improve the whole process of incident resolution and incident assignment, enabling better incident analysis and evaluation and the creation of a knowledge base. Now that we ve covered the main aspect of IT Service Management and what to put in place to face the Cloud delivery challenges, it is important to focus on how to set the right target for your service delivery. Define Service Availability & Performance Level Target We ve already seen that defining the right services, based on user capabilities can be a challenge but is important to do. Next is to define what level of service we want to reach. From a user perspective, the notion of availability and performance are really intertwined. Something too slow to use quickly becomes an availability issue for them. So now we have to define what level of availability and performance we want to define in our Service Level Target and how to put them in place. First you need a way to test continuously the service you want to provide.

12 Again, one of the best way to do that is to use GSX Gizmo Robot Users. The purpose is to define baseline in term of availability and performance, on a stable environment, at the location level action per action. These baselines should recognize pattern of utilization. Usage varies during the day, the week the month and even the year. With these baselines you can start to define threshold on service delivery for alerting purposes. Being alerted on any service degradation allow you to react quickly and fix issues before they really impact your users. Now you can calculate a target in term of minutes downtime per month that provides a good balance between the necessary business needs and the resources you have. For example: Service true availability target at 98% Service Level Target Latency Threshold at for example 500 milliseconds to download a 1MB document. Service performance target at for example 80%. It represents the percentage of the time where this service latency is below a certain threshold. So now that we have these Service Level Targets in place, you can really measure their achievement and enforce them. Here comes the Service Capability monitoring. Implementation of Service Capability monitoring It defines the ability of your own environment to deliver the cloud service. Your environment is of course your hybrid component, your network, your internal applications using the cloud service and generally everything that can impact the end-user experience. What s critical here is to be able to measure and correlate the information.

13 For that you should have a single pane of glass that breach silos of your IT environments and display, in real-time, the health and main usage statistics of every component that can impact your Service Level Target. GSX Gizmo for Office 365 is the only tool on the market that provides you that. We have pretty much covered the essential point of Service Delivery Management when it comes to SaaS application like Office 365. In the next RoboTech we will focus on best practices for real-time and historical Dashboards. These are fundamental to breach IT silos and really manage Office 365 Service delivery. IT Service Management: Office 365 Service Level dashboard Gartner s recommendations for useful service dashboards For Gartner, Service dashboard needs to be able to show the end-user experience availability and performance conditions, to help to define root cause analysis and to allow trending and planning of the past statistics while providing insight about what might happen in the future. The purpose of a service delivery dashboard is to breach the silos that exist across IT departments by providing a common source of reliable statistics that display and explain the end-user performance and the health of impacting infrastructure components. Gartner s name for service delivery dashboard is top level business and end-user experience dashboard. The name of the service incident dashboard is the triage dashboard that is here to quickly analyze the main root causes of issues. They also mention depending-mapping dashboard that help identifies hybrid-cloud component at stakes. We will see that you can do that with a platform-oriented dashboard. As we can see, Gartner s service management dashboards are completely in line with Microsoft recommendations and ITSM best practices when it comes to Cloud Service Delivery Management. Let s see now example how GSX Gizmo for Office 365 enables you to manage your Office 365 Service delivery.

14 Using GSX Solutions to build Office 365 Service Management practice As a first example, we will focus on Exchange Online service. Here is a sum up of what we will focus on. So let s look first on our real time top level dashboards that are displayed in our Gizmo real time UI. What we see here are 3 Robot Users operating from free different locations (Boston, Pennsylvania and Azure) but operating the same monitoring of Exchange Online. Here you can see that the Boston location is not at its greater state but because of our Robot User we have been alerted even before user start to realize something was not working properly. Let s take a deeper look. We can see that in the same time frame, the Robot User from Azure or the one from Philadelphia had no issue. And if we want to corelate the data with the Service Level Dashboard, we can also see that there was no issue from an Office 365 perspective at that time.

15 From the data that we gather, it is safe to say that Exchange Online is meeting its SLA. Let s go deeper in the Boston location to see what is going on. Here you can see that Boston is clearly having an issue with the network and the end-user experience. Going deeper in the Network statistics:

16 We can see that there is clearly excessive round-trip time and packet loss from this location. It is now the perfect moment to contact the network team in Boston with this information. With our GSX Gizmo Robot User, you can easily get real time unbiased Exchange Online performance data from your most critical location. We saw that Boston s user experience was clearly subpar compared to the overall Exchange user experience and that it clearly was not a tenant issue or a problem in multiple locations, but something specific to the Boston network environment. The second dashboard we mentioned earlier is the triage dashboard. Let s take a look at one of our real-time triage dashboards. Our Office 365 real-time dashboard can really contain everything that can impact the Office 365 end-user experience. For example, here, ADFS proxy, ADFS, Azure AD Connect, any Mail-routing, any ActiveSync, SharePoint Online, etc. If we take a look at Azure AD Connect for example, we can see that the synchronization service is clearly down preventing any sync to take place. So again, you can quickly contact the identity management team to resolve the issue without contacting Microsoft. As you can see, these dashboards allow you to be proactive instead of reactive because you can be alerted on these issues before end-users even realize what is going on.

17 We have seen how real-time top level dashboards and triage dashboards can significantly guide your understanding what is going on and fix issues without involving Microsoft, and before your users are impacted. Now let s examine the service level delivery of Office 365, at the location level over time and how to see your achievements in term of Service Level Targets for your users. For that we will use PowerBI. If you want more information on how to read PowerBI dashboards, please read this article >> You can see here an example of our Top Level dashboard for Exchange online services delivered to 3 different locations. You can see several gauges on the top that shows the % of achievement of the Service Level Target we defined. For more information about how to define Service Level Target, please read the corresponding Robotech article >> To sum up, each action that a user can do with Exchange can be considered as a service that you provide. Service that is based on a hybrid infrastructure encompassing Office 365, your ISV, your network and any server and application you maintain that can impact the end-user experience. The purpose of a SLT is to measure the % of achievement of service quality for your user. So, you have to decide what is the happiness threshold for each action. What is the performance that you should be delivered to your user for them to consider the service healthy? For example, 200ms to open a mailbox, 500ms to download an attachment etc. Once you ve define that, you want to know how often you deliver a good service.

18 And that means to calculate the % of time you deliver the service below the threshold. For example, you want to make sure that 98% of the time, the Exchange Online feature can be used by your user with the performance that you have define. And that is your Service Level Target. You can see on the top right of the dashboard that the SLT here is 98% and you can see what it means in term of minutes. Basically 98% of SLT allow the service to be down or degraded 29 minutes per month. It allows you to communicate on something real, something that your users understand, and it gives a very good sense of the quality of the service that you provide to your locations and to your users. The top two gauges are a consolidation of all the services / actions for all the locations. The top left represents the pure availability of them when the top right shows the overall achievement of the Service Level Target. So right here, it looks good. But it is not because your overall performance is good that it is the same for each location. And that is why it is important to check what is really happening location per location. As we can see on top critical locations chart, Boston seems to have way more problem than the 2 other ones. So let s take a look at Boston statistics alone to have a better idea of what is going on there.

19 Now that we have isolated Boston from the other location, you can see that Boston is not necessarily meeting our service level targets. You can see that several actions, corresponding to services that you deliver to your users, experienced more issues than they should. Free/busy, Search through mailbox, and downloading attachments do not provide the desired quality of service in Boston. So now we want to know what happened and try to quickly understand who/what is responsible for these issues. That is why we are going to take a look at our PowerBI triage dashboard. Triage Dashboard We are here focusing again on the Boston statistics in order to quickly understand how to improve the situation in Boston. As you can see, the % network performance uptime shows that almost 50% of the time the network between Boston and Office 365 is below our performance threshold. You can see below that they often have excessive round-trip time, and packet loss as well as high DNS resolution requests. Right here already, you clearly have enough information to ask your Boston network team to investigate the issue and fix it.

20 As problem usually never come alone, we can also see here in that the ADFS Proxy there experienced some problems. We also see below that federation request status request time dramatically rose, impacting of course the end-user experience. Again, you can directly contact the ADFS team to have the problem fixed. But you can also provide more information by going into the platform level dashboard of ADFS Proxy. Here we can see that our federation request performance was 50% below our defined performance threshold. But we can also see in the graph below that the ADFS Proxy server experienced excessively high CPU, RAM and disk time.

21 So again, instead of contacting Microsoft because of performance issues that you don t understand, you now have more than enough information to check with the identity management team so they can resolve the ADFS Proxy issues that we see coming out of Boston. To sum up that part, we can see here how the triage dashboard can breach the silos between your IT departments, avoiding the blame game and going straight to the root cause of the issue. And finally, we have seen with our top-level dashboard that even if the performance of the services looks good overall it is important to track it as well at the location level.

22