Seminar 14: Monitoring Solutions
Goal: Explore the crucial aspects of monitoring and observability within your distributed system.
Introduction
For Seminar 14 of Distributed Systems, we dive into the crucial aspects of monitoring and observability within your distributed system. Your task for this session is to implement instrumentation, adding observability to your services. Building upon the tests created in the previous session, you will observe the behaviour of your system by defining and collecting logs, traces, and metrics. Throughout the development process, you've been diligently adding logs to your services; now it's time to introduce traces and metrics.
Traces are detailed records of request execution paths in a distributed system, providing insights into service interactions, latency, and contextual data. Metrics are quantitative measures used to evaluate system performance and behaviour, including response times, error rates, and resource usage. They should be carefully defined to align with system goals, with clear definitions and thresholds. Tools like OpenTelemetry (https://opentelemetry.io) simplify logs, traces, and metrics collection by offering standardized libraries and instrumentation for various languages and frameworks, facilitating comprehensive monitoring and observability. Your task, this week, is to enable end-to-end observability with OpenTelemetry and related tools.
Task
- The following service should be added to your docker-compose.yaml file.
observability: image: grafana/otel-lgtm ports: - "3000:3000" - "4317:4317" - "4318:4318" environment: - OTEL_METRIC_EXPORT_INTERVAL=1000
- This service is an OpenTelemetry backend (https://github.com/grafana/docker-otel-lgtm ), containing several components that allow us to collect and visualize our traces and metrics. Port 4317 corresponds to the OpenTelemetry gRPC endpoint and port 4318 to the OpenTelemetry HTTP endpoint. There’s no further configuration required. Port 3000 is used by the Grafana UI, where you will visualize your telemetry (http://localhost:3000, default user: admin, password: admin). Read more about it all here: https://grafana.com/blog/2024/03/13/an-opentelemetry-backend-in-a-docker-image-introducing-grafana/otel-lgtm/
- Add OpenTelemetry API & SDK (https://opentelemetry.io/docs/languages/python/instrumentation/ ) as dependencies (in requirements.txt) of the codebase of selected services in your backend, to enable traces and metrics collection. Check the full Python OpenTelemetry SDK documentation: https://opentelemetry-python.readthedocs.io/en/latest/sdk/index.html
- With the new docker service, the OpenTelemetry Collector will be set up for you, missing only to install the exporters in your Python applications. Check the following link on how to add the exporter dependencies (add them also in requirements.txt), and see the usage example: https://opentelemetry.io/docs/languages/python/exporters/#otlp-dependencies . Choose between HTTP or gRPC for the communication protocol, and follow the examples, providing your own endpoint for the above observability service (HTTP ex: http://observability:4318/v1/metrics and http://observability:4318/v1/traces ).
- Add traces and metrics to some of your services, showcasing at least two meaningful examples for each: Span, Counter, UpDownCounter, Histogram, and Asynchronous Gauge.
- Explore the Grafana UI and create a dashboard to visualize the collected traces and metrics. Prometheus is the data source for the metrics, and Tempo the data source for the traces. These tools have default update intervals, so it may take some time to show the metric and trace values in your dashboards.
- Note: The dashboards are not persisted across docker restarts! To save your progress, save the dashboard in Grafana UI (see figure below), go to the dashboard Settings (button near the Save button), and save the JSON model in a local file in your repository (preferably, in the docs folder). If you restart this service, you can import this JSON when creating a new dashboard. During the checkpoint evaluation, you will also start this observability stack from a clean state and import your JSON model.