Practice 14 - Service Discovery, Monitoring, and Alerting in cloud infrastructure
In this lab, we will take a look at service discovery, monitoring, and alerting services used in the cloud infrastructure.
Service discovery help to discover, and track the nodes, services, and applications. Monitoring helps to keep track of important events and metrics so that nodes, services, and application downtime or other anomalies are noticed immediately. Further, such abnormal behaviors are notified by generating alerts and sending notifications using email or webhooks, or HTTP events. In this practice session, we will study configuring open-source tools for service discovery, monitoring, and alerting. The following Figure shows the overview of the services to be configured on three OpenStack VMs.
- Consul: It is used for automated service discovery, i.e. track machines to be load balanced and scaled. Consul agents installed on servers that make to join the Consul group.
- Prometheus: Is used for collecting metrics and configuring alerts.
- Prometheus AlertManager used for sending alerts. It sends alerts to external webhooks -arbitrary REST API endpoints.
- Prometheus Node-exporter installed on machines to expose the metrics. Prometheus is configured to periodically pull metrics from every server.
- Grafana: Used for visualizing performance metrics.
References
- Consul documentation: https://developer.hashicorp.com/consul/tutorials
- Prometheus: https://prometheus.io/docs/introduction/overview/
- Alert Manager: https://prometheus.io/docs/alerting/latest/alertmanager/#alertmanager
- Grafana : https://grafana.com/oss/
Exercise 14.1 Setting up of Consul, Prometheus, and Grafana services in VM1
In this task, we are going to set up consul, Prometheus, and Grafana in the VM1 which are used for service discovery, metrics collection, and monitoring the nodes and services.
- Create a Virtual Machine with name
Lastname_VM1
- Image:
ubuntu22.04
- Flavour: should be
m3.tiny
- Security Group:should be
lab14
- Image:
- SSH into the VM1
- Installing Consul as a Server
- Install the consul
sudo apt-get update && sudo apt-get install consul
- Check for the Consul installation
consul -v
. We have installed Consul, and this can be configured to designate this node as a server or agent by configuring the settings. In the next step, we configure this machine as server. - Create a key with Consul CLI
consul keygen > sudo /etc/consul.d/consul.keygen
- Edit/Create the Consul's configuration file
sudo nano /etc/consul.d/consul.hcl
(NB! Change the bind_addr to your VM1_IP_ADDRESS)datacenter = "lab14" bootstrap_expect = 1 server = true data_dir = "/opt/consul" log_level = "INFO" client_addr = "0.0.0.0" bind_addr = "VM1_IP_ADRESS" node_name = "consul" leave_on_terminate = true rejoin_after_leave = true connect{ enabled = true } ui = true encrypt = "3EY9MGe2a1ElUNrsl5E6rHAtJEL7lqe+1CDZAyY8yTU="
- We named our datacenter group as
datacenter = "lab14"
andserver=true
- We named our datacenter group as
- Start the Consul service
sudo service consul start
- Access to consul's web UI http://VM1_IP_ADDRESS:8500 and see the up/down services, currently it's only the consul server itself
- Install the consul
- Installing Prometheus
- Download the Prometheus repo
wget https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz
- Untar
tar -xvzf prometheus-2.37.0.linux-amd64.tar.gz
- Copy to bin
sudo cp prometheus-2.37.0.linux-amd64/prometheus /usr/bin/prometheus
- Create user
sudo useradd --system prometheus
- Make directory
sudo mkdir /etc/prometheus /var/lib/prometheus
- Copy config file
sudo cp prometheus-2.37.0.linux-amd64/prometheus.yml /etc/prometheus/prometheus.yml
- Copy the files
sudo cp -r prometheus-2.37.0.linux-amd64/consoles /etc/prometheus/consoles
- Copy the libs
sudo cp -r prometheus-2.37.0.linux-amd64/console_libraries /etc/prometheus/console_libraries
- Change ownership
sudo chown -R prometheus:prometheus /etc/prometheus
- Change ownership
sudo chown -R prometheus:prometheus /var/lib/prometheus
- Let us create Prometheus as service
- Create a service file for Prometheus
sudo nano /etc/systemd/system/prometheus.service
**** Take the content of the file from here: prometheus.service to create the service. - Enable the service
sudo systemctl enable prometheus.service
- Start the service
sudo service prometheus start
- Create a service file for Prometheus
- Try accessing the in the browser http://VM1_IP_ADRESS:9090
- Download the Prometheus repo
- Installing Grafana
- Install required libraries:
sudo apt-get install -y libfontconfig1
- Get the Grafana .deb pkg
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.0.4_amd64.deb
- Install the pkg
sudo dpkg -i grafana-enterprise_9.0.4_amd64.deb
- Restart the service
sudo service grafana-server start
- Try accessing the Grafana in the browser http://VM1_IP_ADRESS:3000
- Install required libraries:
Exercise 14.2 Service Discovery: Configuring the consul agent, node-exporter (metric scrapper) on System Under Test machines
In this task, we will configure the consul agent as a client and node-exporter to scrape the metrics from the System Under Test machines. Here, Consul client connects to Consul Server, and helps for service discovery. In this task, node-exporter services are listed in the Consul's service list that belongs to the data center group lab14.
- Create VM2 with name: Lastname_VM2, flavour:
m3.nano
- SSH to the VM2
- Installing Node Exporter - Used to scrape the resource utilization metrics of the VMs
- Download the node-exporter pkg
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
- Extract the pkg
tar -xvzf node_exporter-1.3.1.linux-amd64.tar.gz
- Copy to the bin
sudo cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/bin/node_exporter
- Make it run as a service
sudo vi /etc/systemd/system/node_exporter.service
[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] Type=simple ExecStart=/usr/bin/node_exporter --collector.systemd --collector.processes [Install] WantedBy=multi-user.target
- Enable the service
sudo systemctl enable node_exporter.service
- STart the service
sudo service node_exporter start
- Download the node-exporter pkg
- Installing Consul client
- Install Consul
sudo apt-get update && sudo apt-get install consul
- To configure the Consul client and services to report to the server, we have to create two files consul.json (This is to tell that consul will act as a client) and services.json (This is to add node-exporter service in the consul services)
- Create consul.json at /etc/consul.d/
sudo nano /etc/consul.d/consul.json
(PS! Change VM1_IP_ADRESS in retry_join to your VM1 IP address){ "datacenter": "lab14", "data_dir": "/var/consul", "encrypt": "3EY9MGe2a1ElUNrsl5E6rHAtJEL7lqe+1CDZAyY8yTU=", "log_level": "INFO", "enable_syslog": true, "enable_debug": true, "enable_script_checks": true, "server": false, "leave_on_terminate": true, "rejoin_after_leave": true, "retry_join": [ "VM1_IP_ADRESS" ] }
- Create services.json at /etc/consul.d/ -
sudo nano /etc/consul.d/services.json
{ "services": [ { "name": "node_exporter", "tags": [ "node_exporter" ], "port": 9100, "checks": [ { "tcp": "localhost:9100", "interval": "10s" } ] } ] }
- Create consul.json at /etc/consul.d/
- Start the agent
sudo service consul start
- Go to consul server UI http://VM1_IP_ADDRESS:8500. We will find our node-exporter service has been registered.
- Install Consul
- Configure Prometheus to pull the metrics from the Consul service data center group
lab14
- NB! SSH to VM1
- Edit prometheus.yml file
sudo nano /etc/prometheus/prometheus.yml
- Append the following inside the
scrape_configs:
block# node exporter configuration, - job_name: "node_exporter" consul_sd_configs: - server: "VM1_IP_ADDRESS:8500" datacenter: "lab14" services: [node_exporter] relabel_configs: - source_labels: [__address__] target_label: instance
- Change the server IP address in the configuration (VM1_IP_ADDRESS) and you can see that
datacenter: "lab14"
. This indicates the data center with services group name node_exporter is used for pulling the metrics.
- Append the following inside the
- Restart the Prometheus service.
- Now, go to the Prometheus UI http://VM1_IP_ADDRESS:9090 and check the targets, should see the node-exporter service.
- Configure Grafana data source to query the metrics from Prometheus and create a dashboard to visualize the data. This is similar to task you performed in 9.4.6 in Practice 9
- Make sure you are in VM1
- Go to the Grafana and add the Prometheus data source
- Change URL as
http://127.0.0.1:9090
- Click on Test and Save
- Change URL as
- Add the dashboard to visualize the resource utilization metrics of System Under Machines
- Go to Grafana dashboards http://VM1_IP_ADDRESS:3000/dashboard/import and In import from grafana.com section, add
1860
in id and select Prometheus as data source.
- Go to Grafana dashboards http://VM1_IP_ADDRESS:3000/dashboard/import and In import from grafana.com section, add
- Go to the Dashboard and should see the graphs as shown below. Here, you're visualizing the data of VM2.
- Now, let us create VM3 and monitor the resource metrics. In this task, you need to configure the following and should see the service registered in the consul and directly visualize the data in the grafana. Further, you need not edit/modify the Prometheus.yml
- Install and configure the following
- Install Consul
- Configure Consul client
- Install node-exporter
- Go to Consul UI, should see the service registered.
- Make a screenshot of the consul dashboard showing two node-exporter services
- Go to Grafana and select the Host and Visualize the resource metrics.
- Make a screenshot of the grafana dashboard showing VM3 metrics (you can select the Host) with a graph showing CPU, System Load etc
- Install and configure the following
Exercise 14.3 Monitoring with Prometheus
In this task, we will write the queries to get the data from the Prometheus time series database. We will use Prometheus Query Language (PromoQL) to scrape the data. Further, we will deploy the flask application and generate load on the System Under Test machines using locust.
- Understanding Prometheus Queries
- Go to Prometheus service http://VM1_IP_ADDRESS:9090 and click on the graph. Now, you will get the Query Editor to write the PromoQL queries. You can read the documentation on PromoQL
- Write a query to check the target services in the Prometheus is active or down.
- Query is
up == 0
and click on Execute. - The targets that are in the UP state (running and accessible to Prometheus) will have the value 1, and targets that are not in the UP (or DOWN) state (not running or inaccessible to Prometheus) will have the value 0.
- Query is
- Write a query to compute the current average CPU Utilization of VM2 and VM3.
(1 - avg(irate(node_cpu_seconds_total{mode="idle"}[10m])) by (instance)) * 100
- node_cpu_seconds_total: CPU utilization of a node in MilliCPU measured in seconds.
- mode indicates the CPU time spent in idle, iowait
- irate: irate(v range-vector) calculates the per-second instant rate of increase of the time series in the range vector, for example, 10m in the query.
- avg: average over idle CPU usage for 10m.
- Here, it calculates CPU usage in percent by subtracting the idle CPU with 1.
- Likewise, memory utilization is
100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))
- Write a query to calculate the average 5 minutes load on the node
avg(node_load5{job="node_exporter"}) / count(count(node_cpu_seconds_total{job="node_exporter"}) by (cpu)) * 100
- Now, let us make the systems busy or to consume more resources using the Flask app and locust load testing tool.
- Clone the flask app on VM2
git clone https://shivupoojar87@bitbucket.org/lab2app/lab14.git
- Go to the folder flaskapp, run the application in the background
nohup flask run --host=0.0.0.0 &
(PS!! create venv, install the required packages in requirements.txt. This is similar to Practice Session 1, Task 1.4)- This flask application computes the value of Pi up to 5000 decimal points when a user request is invoked.
- Now, Install locust
pip install locust
. - Run the load test
locust --headless --host http://127.0.0.1:5000 -f locustfile.py
(PS!! locustfile.py is located outside the flaskapp folder)- This load test is carried out for 20 seconds.
- Meanwhile, run the Prometheus queries for CPU utilization and System Load of 5m average, in Prometheus Query Editor.
- Make a screenshot of the Prometheus query execution with a graph showing results and IP should be visible
- Clone the flask app on VM2
Exercise 14.4 Alerting: Configuring alert manager to generate alerts and notify to Zulip channel
In this task, we will study alerts and notifications using Prometheus Alertmanager. Prometheus generates the alerts based on the query and the alertmanager triggers the event to send notifications via webhooks, or other endpoints. Here, we will notify via Zulip Streams.
- Installing AlertManager service on VM1
- Download the alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.22.2/alertmanager-0.22.2.linux-amd64.tar.gz
- Extract the tar file
tar xzf alertmanager-0.22.2.linux-amd64.tar.gz
- Move to required location
sudo mv -v alertmanager-0.22.2.linux-amd64 /opt/alertmanager
- Change the ownership
sudo chown -Rfv root:root /opt/alertmanager
- Create a data directory
sudo mkdir -v /opt/alertmanager/data
- Change the ownership of data directory
sudo chown -Rfv prometheus:prometheus /opt/alertmanager/data
- Create as a service
sudo nano /etc/systemd/system/alertmanager.service
- Add the script
[Unit] Description=Alertmanager for Prometheus [Service] Restart=always User=prometheus ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --storage.path=/opt/alertmanager/data --log.level=debug ExecReload=/bin/kill -HUP $MAINPID TimeoutStopSec=20s SendSIGKILL=no [Install] WantedBy=multi-user.target
- Add the script
- Reload the daemon
sudo systemctl daemon-reload
- Start the service
sudo systemctl start alertmanager.service
- Download the alertmanager
- Now, let us update the alertmanager service in Prometheus.yml
- Edit the Prometheus.yml to add the alertmanager
/etc/prometheus/prometheus.yml
- Append in scrape_configs with following
- job_name: 'alertmanager' static_configs: - targets: ['VM1_IP_ADDRESS:9093']
- Append in scrape_configs with following
- Update the targets element under alerting,
targets: ['VM1_IP_ADDRESS:9093']
- It looks like
alerting: alertmanagers: - static_configs: - targets: ['VM1_IP_ADDRESS:9093']
- It looks like
- Restart the Prometheus service
- Edit the Prometheus.yml to add the alertmanager
- Writing Alert rules
- We will write an alert to notify the status of the targets (services) in the Prometheus
- We will use the up expression to find the state of the targets.
- Write the rules
sudo nano /etc/prometheus/rules.yml
. (PS!! Replace Lastname with your lastname)groups: - name: test rules: - alert: InstanceDown expr: up == 0 for: 1m labels: name: Lastname_Node_Exporter annotations: summary: Node Exporter running in {{$labels.instance}} is down.
- The expression up == 0 is executed for every 1m, and alerts are generated if succeeded to true.
- Add the rules.yml file in the rule_files section of the prometheus.yml configuration file
sudo nano /etc/prometheus/prometheus.yml
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules.yml"
- Restart the Prometheus service
- Now, navigate to the URL http://VM1_IP_ADDRESS:9090/rules from your browser, and you should see the rule InstanceDown that you’ve just added.
- We will write an alert to notify the status of the targets (services) in the Prometheus
- To validate, for a while stop the node-exporter service of VM2.
- Navigate to the URL http://VM1_IP_ADDRESS:9090/alerts , and you should see the state of the alert InstanceDown.
- You can check the logs of the alertmanager using the command
journalctl -u alertmanager --since "1 hour ago"
for debugging.
- You can check the logs of the alertmanager using the command
- Sending alerts to Zulip Stream
- Join the Zulip Stream Cloud Comuting Lab 14-Monitoring.
- Create a Zulip Bot as mentioned in the document.
- Bot Type: Incoming Webhook
- Full Name: "YOUR_Last_name cloud computing bot" (NB! Replace last name with yours)
- Cookie: Lastname
- Click on create
- After creating, notedown the API KEY which is required to construct the webook.
- Create the Webhook URL as mentioned in the document
- URL should look like. (PS!! Replace API_KEY with your API KEY)
https://zulip.cs.ut.ee/api/v1/external/alertmanager?api_key=APIK_KEY&stream=Cloud%20Comuting%20Lab%2014-Monitoring&name=name&desc=summary
- URL should look like. (PS!! Replace API_KEY with your API KEY)
- Configure webhook in the alertmanger
sudo nano /opt/alertmanager/alertmanager.yml
route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'ops-zulip' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' - name: 'ops-zulip' webhook_configs: - url: "https://zulip.cs.ut.ee/api/v1/external/alertmanager?api_key=API_KEY&stream=Cloud%20Comuting%20Lab%2014-Monitoring&name=name&desc=summary" inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
- Alerts to Zulip are fired every 5m as described in
group_interval: 5m
- Restart the alertmanager service
- You can check the status of the alerts in the Prometheus http://VM1_IP_ADDRESS:9090/alerts
- PS!! You can also check the logs of alertmanager for debugging.
- Finally, you should see the notification Zulip
Exercise 14.5 Some more alert rules.
In this task, you will write some more alert rules for CPU utilization. This document provides a good overview of various alert rules for host or nodes
- Create a rule to monitor the CPU utilization of nodes that reaches above 50%. To create a new alert rule in
/etc/prometheus/rules.yml
, you need to add it underrules
.- Create a new rule with
CPU OverUtilzation
as similar toInstanceDown
- For expr: Use the same CPU utilization PromoQL query from Exercise 3. But add the
>50
. - Change the summary as CPU utilization of the node {{$labels.instance}} had reached the threshold .
- Create a new rule with
- Restart the Prometheus.
- Generate the load using the locust to flask app.
- You should see the Zulip notifications. (PS!! You can also check the logs of alertmanager for debugging)
Deliverables:
- Screenshots from Exercise 14.2-2, 14.3-1
- Prometheus configuration file (/etc/prometheus/prometheus.yml) and rules file(/etc/prometheus/rules.yml)
- Delete all the VMs
Possible solutions to potential issues
- If your facing the problem while sending notification to zulip stream by alertmanager, the logs of alertmanager can help for debugging
journalctl -u alertmanager --since "1 hour ago"