Implementation of Automatic Monitoring for all your environments allows you to deploy better systems at lower risk!

Size: px

Start display at page:

Download "Implementation of Automatic Monitoring for all your environments allows you to deploy better systems at lower risk!"

Sophie Harper
5 years ago
Views:

1 How DevOpsPro helped Twist Bioscience reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Sentry and New Relic.

2 Implementation of Automatic Monitoring for all your environments allows you to deploy better systems at lower risk! Leading innova on in DNA Synthesis technology, used by its customers to revolu onize industries such as medicine, agriculture, industrial chemicals and data storage. HIGHLIGHTS CHALLENGES: More than 15 microservices running on the client's 5 different Kubernetes clusters. Many AWS services that need to be monitored. On-premise environment with several VPN tunnel connec ons between their on-premise environment and AWS accounts. Problema c code versions deployed to a bunch of AWS lambda func ons. Used a broken third-party library that caused many problema c symptoms. Non-scalable in-house code that had been deployed. A happy customer is one that has his bugs fixed before he realizes there were bugs to be fixed! You may not realize it, but it s not magic to get error information as it is happening. These days, when a lot of companies are on a journey of moving to Microservice, Containers, Cloud, etc, we expose ourselves to a lot of different systems that potentially can break and create downtime. For our customers, downtime equals losing money and losing money is unacceptable to us. Downtime can be prevented in many ways, but two essential factors are alerting and monitoring. By implementing the correct methods and tools for our customers, we reduce downtime and prevent the loss of money. - When using alerting, you can be notified as soon as you have a problem with your systems. - Monitoring can assist you to predict a potential problem and give you an inside look at the core problems. SOLUTIONS: Added various alerts using Prometheus Alertmanager system. Added an alert that sends a no fica on immediately each me a human error occurred. Implemented Sentry, providing immediate no fica ons on integra on and code issues. Grafana graphs helped us understand and con nue to refine our understanding of the problems. Grafana shows all the resources consumed by the services running in Kubernetes clusters. RESULTS: Saved more than 25% per month on the cost of EC2 instances. No fica on of errors quickly to allow implementa on of a solu on before customers experience any issue. Time saved in finding and fixing bugs/problems quickly. In the next several paragraphs, I ll show and explain a case study we implemented for a client of ours, Twist Bioscience, about how we helped them to reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Sentry and New Relic. These tools tackle all parts of the environment from cloud resources, infrastructure, dependencies and applications. So let's get started with a description of the infrastructure and service we were dealing with: Approximately 15 different microservices on 5 different Kubernetes clusters (dev, qa, staging, production and tools) on 3 different AWS accounts, several AWS lambda functions with API Gateways in front of them, more than 10 AWS RDS DB s, ElasticSearch cluster, Redis instances and more. The tools are just a means to an end, here we will focus on the improvements made allowing the client to feel safe and secure in his systems. At the bottom of this article, I have provided the tools overview from their formal websites. 1

Next, I ll show several examples of problems we encountered, and how the alerting system notified us about them instantly. 1. Twist Bioscience has an on-premise environment.

We want to know instantly if for some reason one of the VPN tunnels goes down.

3 Next, I ll show several examples of problems we encountered, and how the alerting system notified us about them instantly. 1. Twist Bioscience has an on-premise environment. We created several VPN tunnel connections between their on-premise environment and their AWS accounts. A number of services on the cloud should connect to on-premise services via these tunnels. We want to know instantly if for some reason one of the VPN tunnels goes down. So, we added an alert using Prometheus Alertmanager system that sends a notification to specific Slack channels each time it happened (AWS side alert). We can see here a Production VPN tunnel that is down. 2. The K8s clusters we created for them in specific environments are accessible for some developers, sometimes human errors can happen, a developer can delete a deployment (K8s resource) by mistake and we want to know immediately when this occurs. Once this happens, an alert is sent to an #alert Slack channel that the developers are monitoring. It looks like this: We can see here an example of mes-clu-celery deployment that is down in Staging cluster. 2

4 3. Unfortunately, bugs are unavoidable because developers are human, all we can do is try our best to prevent them. Here, we can see an error in one of the third-party libraries they use. This causes a connectivity error from the local service to a saas service they use. They are notified immediately by Sentry when the exception happens and now have the ability to understand and fix the error by downgrading the third-party version. Our Beta testing customers were reporting problems they experienced in the system before we knew about them. Since DevOpsPro implemented their solution for our full scale e-commerce launch, we now know about, and can x, a problem before a customer experiences it. 3

When a problematic function version deployed once, it caused the execution time of the function to take more than 30 seconds - AWS API Gateway has a timeout of 30 seconds!

5 Now, I will show some graphs from Grafana that helped us understand the problems and how it saved Twist Bioscience money. 1. Twist Bioscience has AWS Lambda functions on several different AWS accounts, the functions were triggered by their services via API Gateway requests. When a problematic function version deployed once, it caused the execution time of the function to take more than 30 seconds - AWS API Gateway has a timeout of 30 seconds!! So the service that calls this function got a timeout from the API Gateway. Using the correct graph in Grafana we can see the timeouts: Here we can see API Gateway Latency hit 30 Seconds. 2. Another example of an AWS Lambda function problem was when a problematic function deployed. We can see in the graph below the amount of error the function had: 4

3. More than 15 microservices were running on their Kubernetes clusters. We have followed Kubernetes best practices and provided to the pods request and limited compute resources.

In Grafana, we have a graph that shows all the resources consumed by the services running in Kubernetes clusters.

6 3. More than 15 microservices were running on their Kubernetes clusters. We have followed Kubernetes best practices and provided to the pods request and limited compute resources. Initially, Twist Bioscience decided what the request and limit values were going to be based on the framework requirements they used. In Grafana, we have a graph that shows all the resources consumed by the services running in Kubernetes clusters. When we looked at some of the services metrics in this graph, we saw a bunch of services with way too many resources than they needed. After updating the services resources to the correct numbers, they saved more than 25% per month on the cost of EC2 instances. Take a look at the graph below regarding memory usage: We can see here that the limit is much higher than the actual limit and we know that the actual limit is much lower than what we put as a resource in the pods. 5

7 RESULTS When I asked Roy Nevo, Director of Product Development at Twist Bioscience, if he could measure the time it took to identify a problem in his systems before and after implementing the alerting and monitoring systems, he said: 1. Our Beta testing customers were reporting problems they experienced in the system before we knew about them. Since DevOpsPro implemented their solution for our full scale e-commerce launch, we now know about, and can fix, a problem before a customer experiences it. 2. In regards to the time it takes for us to find out about a problem; it moved from days to minutes!!! By implementing monitoring and alerting, Twist Bioscience s team sees many long-term benefits, including improved efficiency across the entire company. This includes the streamlining of several crucial processes and the elimination of the needless overhead that was wasting human resources. Twist Bioscience s team is left not only with a more efficient system, but also the confidence that they re meeting their requirements, and that their new features will withstand future company changes and growth. In regards to the time it takes for us to nd out about a problem; it moved from days to minutes!!! 6

8 TOOLS OVERVIEW Prometheus - is an open-source systems monitoring and alerting toolkit originally built at S oundcloud. Since its inception in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community. We use Prometheus to collect metrics from our K8s clusters, AWS CloudWatch, CI server and more, check thresholds and send alerts. Grafana - is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application. We use Grafana to build informative dashboards to receive a better in-depth view of what the hell is happening in our environments. Grafana uses AWS and Prometheus as data sources. Sentry - provides open source exception tracking, it tracks every exception in your applications as it happens and send the stack trace, environment information needed to prioritize, identify, reproduce, and fix each issue. For us, Sentry is a must: It is very easy to implement, we use it to track error in all our applications, it allows us to find bugs in the Development stage and fix them before they more to a higher environment. New Relic - gives you deep performance analytics for every part of your software environment. With New Relic, we can optimize our services, in the context of Memory, CPU, number of workers and more. We knew we needed to improve our working production environment and after much research and based on outstanding recommendations, we choose to work with the experienced and trusted DevOpsPro." 7

9 Bene cial Monitoring and Alerting doesn t have to be a pain. Find out how we can help you Prevent Downtime and Save Money by leveraging our expertise to provide the right solutions, so you can achieve the desired results. Contact DevOpsPro 17 Broad Ct London, WC2B 5QN, United Kingdom info@devopspro.co.uk