Use Prometheus
In this exercise, we will deploy an instance of the proksch/showmetrics image to the cluster. You can inspect the source code of showmetrics to better understand its trivial implementation.
Deploy a Service for Monitoring
Create a new file monitoring.yml
somewhere in your file system. We will deploy a Pod of the showmetrics
image that has been discussed in the lecture. A service will be added to make the app discoverable.
# monitoring.yml
apiVersion: v1
kind: Pod
metadata:
name: showmetrics
labels:
app: showmetrics
spec:
containers:
- name: showmetrics
image: proksch/showmetrics:latest
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: showmetrics-serv
spec:
type: NodePort
selector:
app: showmetrics
ports:
- port: 1234
targetPort: 8080
Add the following ServiceMonitor resource for enabling the service discovery feature of Prometheus.
...
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mymonitor
labels:
release: myprom
spec:
selector:
matchLabels:
app: showmetrics-serv
endpoints:
- interval: 1s
Please note a couple of things in these resource configurations:
- A
ServiceMonitor
is not a core type of Kubernetes and provided as an extension. It does not show up as nicely as the built-in resources in, for example, the dashboard, nor is the documentation on par with the official Kubernetes documentation. - Find the label definitions in the three resources and find their matchers. Please note that only the ServiceMonitor needs to be discovered by Prometheus via the
release
label, the other labels are used to find the Pod and Service. - A poll interval of 1s is very short and only used to make it easier to see changes in the short timeframe of the exercise. In production, the value should be (much) higher, the challenge is to find a timeframe that makes sense.
- A ServiceMonitor spec can define the
targetPort
(which must then point to the port of the Pod/Deployment, not to the Service port) and thepath
(which defaults to/metrics
). Both are optional and the default values work in most cases.
Apply monitoring.yml
to your cluster. Make sure that the service is running and run the following command to create a port-forward to the service, keep this command running.
$ kubectl apply -f monitoring.yml
$ minikube service my-service --url
Explore the two trivial pages of the app and reload each page a couple of times. Now visit the /metrics
endpoint, you should see the metrics my_random
, num_requests
, and index_relevance
.
Create a port-forward again to expose the Prometheus backend if you did not keep it running from the previous exercise.
$ minikube service myprom-kube-prometheus-sta-prometheus --url
When you access the dashboard, my-service
should show up as a monitoring target (Status ยป Targets).
Query the Collected Metrics with PromQL
As the metric collection is working now, we can start looking into querying the metrics. For this, open the Prometheus service and access the Graph item in the menu. Switch the view from Table to Graph.
Enter one of the metrics as a query (e.g., my_random
) and make yourself familiar with the interactions of the graph view, e.g., zoom in or out to change the timeframe that you are looking at. It is recommended to have the webapp running in a separate browser window and interact with it every not and then to see changes in the metrics.
You can look at several basic examples of PromQL:
num_requests
This is the number of requests to the web application. Realize that there are two different curves, which have differentpage
labels.num_requests{page="index"}
You can filter the result using one or more labels. In this case, you only see the requests for theindex
page.sum(num_requests)
As an alternative, you can just add the different labels together.rate(num_requests[10s])
A counter is monotonous and can only increase. By providing a timeframe (e.g.,[10s]
), the data points are convered into a range vector. While such a range vector cannot be directly plotted, utility function exist to interpret the contents. For example, therate
function will plot the request rate within the timeframe.
Another set of interesting examples to look at comes from the random gauge.
my_random
By default, the guage is just wildly jumping between 0 and 1.avg_over_time(my_random[5s])
Gauges can be accessed via a timeframe and then need an aggregator function for the range vector. In this example,avg_over_time
shows the average. Even within a 5s window, the average over all random numbers is much more stable already.avg_over_time(my_random[60s])
If you increase the timeframe even more, the variance in the average becomes smaller and smaller (please note that the axis changes!). If the window is large enough, the average should converge towards 0.5 for every serious random number generator.
A second example of a gauge is the index_relevance
, which indicates the percentage of requests that land on the index page.
index_relevance
This metric is easier to inspect thanmy_random
as it actually follows your usage of the service.deriv(index_relevance[5s])
Instead of looking at the absolute numbers, it is possible to derive the change of the number over a timeframe. This makes it much easier to spot and quantify changes.