Kubernetes troubleshooting
Kubernetes cluster
Kubernetes API
Kubernetes stores the cluster state in the ETCD database among other metrics. This information can be obtained through an API. In order to access that data, you must run the commands explained in the following section in a host where the k8s certificates are configured to allow the access to the cluster API.
This section provides some troubleshooting ideas and solutions to common problems you might encounter with the Kubernetes cluster. The most relevant tool to learn here is the kubectl command line tool, which lets you control Kubernetes clusters. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. You can specify other kubeconfig files by setting the KUBECONFIG environment variable or by setting the –kubeconfig flag. For more information about kubectl syntax, full command operations and details about each command, including all the supported flags and subcommands, see the kubectl reference documentation.
The first step should be checking the cluster API connectivity with the following command:
1 kubectl config view
In case that the API access is not set up in the VM where the command is executed, the output will be have no data:
1 quobis@ops-node1-delorean:~$ kubectl config view
2 apiVersion: v1
3 clusters: []
4 contexts: []
5 current-context: ""
6 kind: Config
7 preferences: {}
8 users: []
The complete output must contain all the data related with selected cluster:
1 quobis@OPS-master-delorean:~$ kubectl config view
2 apiVersion: v1
3 clusters:
4 - cluster:
5 certificate-authority-data: DATA+OMITTED
6 server: https://IP:6443
7 name: kubernetes
8 contexts:
9 - context:
10 cluster: kubernetes
11 namespace: dev-nightly
12 user: kubernetes-admin
13 name: kubernetes-admin@kubernetes
14 current-context: kubernetes-admin@kubernetes
15 kind: Config
16 preferences: {}
17 users:
18 - name: kubernetes-admin
19 user:
20 client-certificate-data: REDACTED
21 client-key-data: REDACTED
Make sure the selected context are the cluster we what to manage. The row with an asterisk in the beginning defines the current context and the last column defines the default namespace selected (“my-context” and “dev-nightly” in the example below):
1 console:~/Devops/k8s-installer/ansible$ kubectl config get-contexts
2
3 CURRENT NAME CLUSTER AUTHINFO NAMESPACE
4 akspreGDQuobis akspreGDQuobis clusterUser_preGDQuobis_akspreGDQuobis quobis
5 aksproGDQuobis aksproGDQuobis clusterUser_proGDQuobis_aksproGDQuobis wac-quobis
6 aksstaging aksstaging clusterUser_testGDQuobis_aksstaging stagingquobis
7 cluster.quobis.com cluster.quobis.com cluster.quobis.com quobis-qa
8 clusterquobis.quobis.com clusterquobis.quobis.com clusterquobis.quobis.com kube-system
9 * my-context kubernetes kubernetes-admin dev-nightly
10 preprod preprod clusterUser_test_preprod quobis
11 proGD proGD clusterUser_pro_gemelo_digital_proGD wac-quobis
In order to change to another context, use the following command:
1 kubectl config use-context <desired-context>
In order to change the default namespace, use the following command:
1 kubectl config set-context --current --namespace=<desired-namespace>
Kubernetes cluster state
The main element in a Kubernetes cluster are the nodes (VM) where containers are running, so checking the nodes’ state should be the first step once the k8s API is configured. The cluster node must have a “Ready” state, any other case indicates one of the problems.
1 quobis@OPS-master-delorean:~$ kubectl get nodes -o wide
2 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
3 k8s-master Ready master 140d v1.15.5 10.1.72.104 <none> Debian GNU/Linux 9 (stretch) 4.9.0-16-amd64 docker://18.9.9
4 kubernetes-janus-edu NotReady <none> 140d v1.15.5 10.1.72.107 <none> Debian GNU/Linux 9 (stretch) 4.9.0-8-amd64 docker://18.9.9
5 ops-k3s Ready <none> 107d v1.15.5 10.1.72.71 <none> Debian GNU/Linux 9 (stretch) 4.9.0-9-amd64 docker://18.9.9
6 ops-kubernetes-janus2-edu Ready <none> 140d v1.15.5 10.1.72.108 <none> Debian GNU/Linux 9 (stretch) 4.9.0-9-amd64 docker://18.9.9
7 ops-node1-delorean Ready <none> 140d v1.15.5 10.1.72.105 <none> Debian GNU/Linux 9 (stretch) 4.9.0-12-amd64 docker://18.9.9
8 ops-node2-delorean Ready <none> 140d v1.15.5 10.1.72.106 <none> Debian GNU/Linux 9 (stretch) 4.9.0-9-amd64 docker://18.9.9
Kubernetes node troubleshooting
1 Kubectl describe node <unhealth_node_name>
- Check the state of VM: the Virtual Machine must accomplish the following requirements:
It must be on.
It must have network connectivity to the rest of the cluster machines.
- Check kubernetes dependencies running on the VM:
Kubelet service must be running:
sudo systemctl status kubelet
Kubelet logs:
sudo journalctl -u kubelet
(error logs should not appear)The iptables command must contain the following entries (
sudo iptables -L
), several entries can insert additional delays in the system.
Check the native k8s pod in kube-system namespace:
1 kubectl -n kube-system get pods
Make sure
kube-flannel
is deployed and running.
In order to check the state of the pods of a given namespace pod, we need to use the following command:
1kubectl get pods -n namespace_name
In order to get additional information of a specific pod, the instruction to issue is the following one:
$ kubectl -n <namespace> describe pod <pod_name>
In order to reboot a pod, we just need to delete it and Kubernetes will recreate it again. The command to issue is:
$ kubectl -n <namespace> delete pod <pod_name>
Kubernetes namespaces and pod list
The following namespaces are always available in our platform after a clean installation:
“kube-system” namespace: used internally by Kubernetes.
“monitoring” namespace: host the monitoring services (Grafana, Loki, etc)
“quobis” namespace: hosts the application services (authentication server, signalling server, databases, etc..)
This is the list of the pods that run in the kube-system namespace, that can be listed with the kubectl get pods -n kube-system
:
1 quobis@vm-bar-quo-mas11-1:~$ kubectl get pods -n monitoring
2 NAME READY STATUS RESTARTS AGE
3 controller-6c4449fc67-pfn48 1/1 Running 1 55d
4 coredns-5c98db65d4-9r99t 1/1 Running 1 55d
5 coredns-5c98db65d4-l8v4v 1/1 Running 1 55d
6 etcd-vm-bar-quo-mas11-1 1/1 Running 9 55d
7 kube-apiserver-vm 1/1 Running 10 55d
8 kube-controller-manager-vm 1/1 Running 18 55d
9 kube-flannel-ds-2z2qm 1/1 Running 0 55d
10 kube-proxy-9mflq 1/1 Running 0 55d
11 kube-scheduler-vm-bar 1/1 Running 14 55d
12 nginx-ingress-controller 1/1 Running 0 22d
13 node-exporter-6xm5h 1/1 Running 1 55d
14 speaker-cxf8r 1/1 Running 1 55d
Note
There might be more that one pod with the same name but different suffix (for example,
coredns-5c98db65d4-9r99t
and5c98db65d4-l8v4v
). These are just replicas of the same image running at the same time into this namespace.You might not find all these services in your deployment, as some of them are optional and/or might be deactivated. For example, the “teal” service is not deployed if the deployment does not require blue/green support.
This is the list of the pods that run in the monitoring namespace, that can be listed with the kubectl get pods -n monitoring
:
1 quobis@vm-bar-quo-mas11-1:~$ kubectl get pods -n monitoring
2 NAME READY STATUS RESTARTS AGE
3 grafana-54d5f96758-m4gtj 1/1 Running 0 27d
4 loki-5d8c457556-8qdhb 1/1 Running 0 51d
5 prometheus-b9d4f8f54-dpxtv 1/1 Running 0 21d
6 promtail-c626r 1/1 Running 0 20d
Pod name |
Functionality |
Deployment considerations |
---|---|---|
|
Grafana instance for logging and troubleshooting. |
No HA required. |
|
Loki instance for log collection and management. |
No HA required. |
|
Prometheus instance as a storage for Grafana and Loki |
No HA required. |
|
Agent which ships the contents of local logs to a the Grafana Loki instance |
No HA required. |
This is the list of the pods that run in the quobis namespace, that can be listed with the kubectl get pods -n quobis
:
1 quobis@vm-bar-quo-mas11-1:~$ kubectl get pods -n quobis
2 NAME READY STATUS RESTARTS AGE
3 audiomixer-sfu1-0 1/1 Running 0 22d
4 audiomixer-sfu2-0 1/1 Running 0 22d
5 database-mongo-0 2/2 Running 0 6d4h
6 database-mongo-1 2/2 Running 0 6d4h
7 database-mongo-2 2/2 Running 0 6d4h
8 erebus-787567f497-lnrcr 1/1 Running 0 6d15h
9 kapi-759d58f644-m5db2 2/2 Running 0 22d
10 keycloak-77f6796b89-9ngw7 1/1 Running 0 6d16h
11 kitter-5b5647c888-dbdfw 1/1 Running 0 13d
12 kube-state-metrics-bc95796c9-hb42c 2/2 Running 0 43d
13 message-broker-5675fb65b9-755s6 1/1 Running 0 6d16h
14 mongodb-exporter-548d4dcddd-7bt67 1/1 Running 0 43d
15 mongodb-kubernetes-operator-6c6cbdf9bf-8jwg 1/1 Running 0 6d16h
16 nfs-client-provisioner-5697c964d6-pwgl5 1/1 Running 0 6d16h
17 nginx-gw-0 1/1 Running 0 13d
18 podhandler-1640265360-4n5vw 0/1 Running 0 13d
19 postgresql-5c9f585dc8-bw5t9 1/1 Running 0 6d16h
20 qss-audiomixersio-sfu1-86cf778b84-28ct2 1/1 Running 0 6d15h
21 qss-audiomixersio-sfu2-57454f59d-brswp 1/1 Running 0 6d15h
22 qss-auth-http-679d4475d7-w65ch 1/1 Running 0 6d15h
23 qss-calls-574cbbbcdd-j94wx 1/1 Running 0 6d15h
24 qss-calltransfer-basic-5c66bf9764-lg2vk 1/1 Running 0 6d15h
25 qss-conference-state-7b8d4f448-dnbhk 1/1 Running 0 6d15h
26 qss-invites-rooms-7d7749896f-rhbjw 1/1 Running 0 6d15h
27 qss-io-websockets-6bcd86694c-pzd8p 1/1 Running 0 6d15h
28 qss-io-websockets-6bcd86694c-rv8fq 1/1 Running 0 6d15h
29 qss-log-conference-7fdd67d9f5-gkdrt 1/1 Running 0 6d15h
30 qss-meeting-basic-6d6789b86-h7642 1/1 Running 0 6d15h
31 qss-peer-jt-6468689c87-7t5hg 1/1 Running 0 6d15h
32 qss-quick-conference-844d87c4f5-4v8j8 1/1 Running 0 6d15h
33 qss-registry-authenticated-59dc88c8ff-b6w4d 1/1 Running 0 3h14m
34 qss-resolver-wac-6f4787c4f4-9fcv5 1/1 Running 0 6d15h
35 qss-rooms-basic-6798d49d48-lmf9v 1/1 Running 0 6d15h
36 qss-trunk-85dcc5cb9f-jd87x 1/1 Running 0 3h13m
37 qss-watchdog-invites-6dc97486f6-qstg2 1/1 Running 0 6d15h
38 qss-watchdog-registry-675f8b55c9-58bm7 1/1 Running 0 6d15h
39 sfu-dispatcher-66f58bf667-bll57 1/1 Running 0 8d
40 sfu-wrapper-sfu1-598d59fd79-mplxb 1/1 Running 4 8d
41 sfu1-0 1/1 Running 0 22d
42 sfu2-0 1/1 Running 0 22d
43 sip-proxy-1-0 1/1 Running 0 22d
44 sip-proxy-2-0 1/1 Running 0 22d
45 sip-proxy-database-6bf67bc969-q4v2j 1/1 Running 1 55d
46 sippo-exporter-678ff77cff-4dztt 1/1 Running 0 14d
47 sippo-maintainer-1640228400-tldsm 0/1 Completed 0d10h
48 sippo-server-755f6d8f9b-ckgzx 1/1 Running 7 22d
49 sippo-server-755f6d8f9b-pzmdh 1/1 Running 0 6d15h
50 sippo-server-755f6d8f9b-q7gmr 1/1 Running 7 22d
51 sippo-storage-86c7bcb6d4-bhbnf 1/1 Running 0 6d16h
52 webphone-angular-8677ccbcc8-gkmh9 1/1 Running 0 12d
53 xmpp-server-58ccd4bb86-b6s4t 1/1 Running 0 41d
The list below gives a brief explanation of each service and its impact in case or an issue with this pod.
Pod name |
Functionality |
Deployment considerations |
---|---|---|
|
Audio MCU, handles the mixing of the audio streams and media interconnection with the PSTN/NGN. |
Deployed in HA. If one instance gets down, all the active calls using it will fail. Following calls will use the active instance. |
|
Internal database to storage user data. |
Deployed in HA as a cluster. |
|
Responsible for listening to WebSocket connections from clients and feeding them into our internal message broker |
No HA, needs to be re-created in case of failure as all the clients traffic goes through this element |
|
Provides a REST API used for the Kubernetes cluster maintenance. |
HA not required. No critical impact in the service in case of failure. |
|
Identity and access management solution |
No HA. Needs to be re-created in case of failure, otherwise users without a session token won’t be able to login. |
|
Provides an API to manage user accounts (CRUD). |
No HA. Needs to be re-created in case of failure, otherwise no news users can be added to the system. |
|
Message broker using AMQP protocol based on RabbitMQ |
No HA. Needs to be re-created in case of failure, otherwise there won’t be message interchange between several services. |
|
Exports data from the Mongo database to the monitoring system |
No HA. No impact in the service in case of failure. |
|
Manages the lifecycle events for a MongoDB cluster by using the K8S API |
No HA. No impact in the service in case of failure. |
|
Atomatic provisioner that uses existing NFS server to support dynamic provisioning of Kubernetes Persistent Volumes. |
No HA. No impact in the service in case of failure. |
|
Web server, reverse proxying, caching and load balancing for HTTP, TCP, and UDP servers. |
HA provided by the cluster. |
|
Postgresql database. |
No HA. Used by the chat messaging server, voice and audio calls are not affected by a potential failure. |
|
Quobis Signaling Server |
All QSS microservices can be replicated. More info here. |
|
Responsible for post-call processing to create audio and video recordings. |
No HA, needs to be re-created in case of failure as calls won’t be recorded. |
|
Balances the traffic between different video SFUs |
No HA, needs to be re-created in case of failure. The call information is kept in the DB to be persistent across restarts. |
|
Connector between the SFU dispatcher and the SFU itself |
Supports HA. If one instance gets down, all the active calls using it will fail. Following calls will use the active instance. |
|
Video SFU. |
Supports HA that is managed by the SFU wrapper and dispatcher services. |
|
SIP application server that acts as a gateway to an external telephony network |
Supports HA. |
|
MySQL database backend for the SIP proxy. Supports HA if required (not deployed by default) |
Supports HA if required (not deployed by default) |
|
Exports aggregated data to the monitoring system. |
No HA needed. |
|
Runs daily operations for system maintenance. |
No HA needed. |
|
Main process that orchestrates the sytem. |
HA supported, several replicas can be deployed and traffic is balanced from the message broker. |
|
Storage backend |
No HA needed. |
|
Responsible for parsing login responses to support blue/green canary deployments (not needed for standard deployments) |
No HA. Needs to be re-created in case of failure, otherwise users without a session cooking won’t be able to login. |
|
Responsible for the implementation of TURN media relay |
HA supported, several replicas can be deployed. |
|
Messaging server for chat services. |
No HA. Needs to be re-created in case of failure, otherwise the chat services won’t work. |