Kubernetes troubleshooting

Kubernetes cluster

Kubernetes API

Kubernetes stores the cluster state in the ETCD database among other metrics. This information can be obtained through an API. In order to access that data, you must run the commands explained in the following section in a host where the k8s certificates are configured to allow the access to the cluster API.

This section provides some troubleshooting ideas and solutions to common problems you might encounter with the Kubernetes cluster. The most relevant tool to learn here is the kubectl command line tool, which lets you control Kubernetes clusters. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. You can specify other kubeconfig files by setting the KUBECONFIG environment variable or by setting the –kubeconfig flag. For more information about kubectl syntax, full command operations and details about each command, including all the supported flags and subcommands, see the kubectl reference documentation.

The first step should be checking the cluster API connectivity with the following command:

1 kubectl config view

In case that the API access is not set up in the VM where the command is executed, the output will be have no data:

1 quobis@ops-node1-delorean:~$ kubectl config view
2 apiVersion: v1
3 clusters: []
4 contexts: []
5 current-context: ""
6 kind: Config
7 preferences: {}
8 users: []

The complete output must contain all the data related with selected cluster:

 1 quobis@OPS-master-delorean:~$  kubectl config view
 2 apiVersion: v1
 3 clusters:
 4 - cluster:
 5     certificate-authority-data: DATA+OMITTED
 6     server: https://IP:6443
 7   name: kubernetes
 8 contexts:
 9 - context:
10     cluster: kubernetes
11     namespace: dev-nightly
12     user: kubernetes-admin
13   name: kubernetes-admin@kubernetes
14 current-context: kubernetes-admin@kubernetes
15 kind: Config
16 preferences: {}
17 users:
18 - name: kubernetes-admin
19   user:
20     client-certificate-data: REDACTED
21     client-key-data: REDACTED

Make sure the selected context are the cluster we what to manage. The row with an asterisk in the beginning defines the current context and the last column defines the default namespace selected (“my-context” and “dev-nightly” in the example below):

 1 console:~/Devops/k8s-installer/ansible$ kubectl config get-contexts
 2
 3         CURRENT NAME             CLUSTER                    AUTHINFO                                                NAMESPACE
 4             akspreGDQuobis           akspreGDQuobis                 clusterUser_preGDQuobis_akspreGDQuobis                  quobis
 5             aksproGDQuobis           aksproGDQuobis                 clusterUser_proGDQuobis_aksproGDQuobis                  wac-quobis
 6             aksstaging               aksstaging                     clusterUser_testGDQuobis_aksstaging                     stagingquobis
 7             cluster.quobis.com       cluster.quobis.com             cluster.quobis.com                                      quobis-qa
 8             clusterquobis.quobis.com clusterquobis.quobis.com   clusterquobis.quobis.com                                    kube-system
 9 *           my-context               kubernetes                     kubernetes-admin                                        dev-nightly
10             preprod                  preprod                        clusterUser_test_preprod                                quobis
11             proGD                    proGD                          clusterUser_pro_gemelo_digital_proGD                    wac-quobis

In order to change to another context, use the following command:

1 kubectl config use-context <desired-context>

In order to change the default namespace, use the following command:

1 kubectl config set-context --current --namespace=<desired-namespace>

Kubernetes cluster state

The main element in a Kubernetes cluster are the nodes (VM) where containers are running, so checking the nodes’ state should be the first step once the k8s API is configured. The cluster node must have a “Ready” state, any other case indicates one of the problems.

1 quobis@OPS-master-delorean:~$ kubectl get nodes -o wide
2 NAME                        STATUS   ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION   CONTAINER-RUNTIME
3 k8s-master                  Ready    master   140d   v1.15.5   10.1.72.104   <none>            Debian GNU/Linux 9 (stretch)   4.9.0-16-amd64   docker://18.9.9
4 kubernetes-janus-edu        NotReady <none>   140d   v1.15.5   10.1.72.107   <none>            Debian GNU/Linux 9 (stretch)   4.9.0-8-amd64    docker://18.9.9
5 ops-k3s                     Ready    <none>   107d   v1.15.5   10.1.72.71    <none>            Debian GNU/Linux 9 (stretch)   4.9.0-9-amd64    docker://18.9.9
6 ops-kubernetes-janus2-edu   Ready    <none>   140d   v1.15.5   10.1.72.108   <none>            Debian GNU/Linux 9 (stretch)   4.9.0-9-amd64    docker://18.9.9
7 ops-node1-delorean          Ready    <none>   140d   v1.15.5   10.1.72.105   <none>            Debian GNU/Linux 9 (stretch)   4.9.0-12-amd64   docker://18.9.9
8 ops-node2-delorean          Ready    <none>   140d   v1.15.5   10.1.72.106   <none>            Debian GNU/Linux 9 (stretch)   4.9.0-9-amd64    docker://18.9.9

Kubernetes node troubleshooting

1 Kubectl describe node <unhealth_node_name>

Check the state of VM: the Virtual Machine must accomplish the following requirements:
- It must be on.
- It must have network connectivity to the rest of the cluster machines.
Check kubernetes dependencies running on the VM:
- Kubelet service must be running: sudo systemctl status kubelet
- Kubelet logs: sudo journalctl -u kubelet (error logs should not appear)
- The iptables command must contain the following entries (sudo iptables -L), several entries can insert additional delays in the system.
Check the native k8s pod in kube-system namespace:

1 kubectl -n kube-system get pods

Make sure kube-flannel is deployed and running.

In order to check the state of the pods of a given namespace pod, we need to use the following command:

1kubectl get pods -n namespace_name

In order to get additional information of a specific pod, the instruction to issue is the following one:

$ kubectl -n <namespace> describe pod <pod_name>

In order to reboot a pod, we just need to delete it and Kubernetes will recreate it again. The command to issue is:

$ kubectl -n <namespace> delete pod <pod_name>

Kubernetes namespaces and pod list

The following namespaces are always available in our platform after a clean installation:

“kube-system” namespace: used internally by Kubernetes.

“monitoring” namespace: host the monitoring services (Grafana, Loki, etc)

“quobis” namespace: hosts the application services (authentication server, signalling server, databases, etc..)

This is the list of the pods that run in the kube-system namespace, that can be listed with the kubectl get pods -n kube-system:

 1 quobis@vm-bar-quo-mas11-1:~$ kubectl get pods -n monitoring
 2 NAME                          READY STATUS  RESTARTS AGE
 3 controller-6c4449fc67-pfn48   1/1   Running 1        55d
 4 coredns-5c98db65d4-9r99t      1/1   Running 1        55d
 5 coredns-5c98db65d4-l8v4v      1/1   Running 1        55d
 6 etcd-vm-bar-quo-mas11-1       1/1   Running 9        55d
 7 kube-apiserver-vm             1/1   Running 10       55d
 8 kube-controller-manager-vm    1/1   Running 18       55d
 9 kube-flannel-ds-2z2qm         1/1   Running 0        55d
10 kube-proxy-9mflq              1/1   Running 0        55d
11 kube-scheduler-vm-bar         1/1   Running 14       55d
12 nginx-ingress-controller      1/1   Running 0        22d
13 node-exporter-6xm5h           1/1   Running 1        55d
14 speaker-cxf8r                 1/1   Running 1        55d

Note

There might be more that one pod with the same name but different suffix (for example, coredns-5c98db65d4-9r99t and 5c98db65d4-l8v4v). These are just replicas of the same image running at the same time into this namespace.
You might not find all these services in your deployment, as some of them are optional and/or might be deactivated. For example, the “teal” service is not deployed if the deployment does not require blue/green support.

This is the list of the pods that run in the monitoring namespace, that can be listed with the kubectl get pods -n monitoring:

1 quobis@vm-bar-quo-mas11-1:~$ kubectl get pods -n monitoring
2 NAME                          READY  STATUS    RESTARTS AGE
3 grafana-54d5f96758-m4gtj      1/1    Running   0        27d
4 loki-5d8c457556-8qdhb         1/1    Running   0        51d
5 prometheus-b9d4f8f54-dpxtv    1/1    Running   0        21d
6 promtail-c626r                1/1    Running   0        20d

Pods included in the monitoring namespace
Pod name	Functionality	Deployment considerations
`grafana`	Grafana instance for logging and troubleshooting.	No HA required.
`loki`	Loki instance for log collection and management.	No HA required.
`prometheus`	Prometheus instance as a storage for Grafana and Loki	No HA required.
`promtail`	Agent which ships the contents of local logs to a the Grafana Loki instance	No HA required.

This is the list of the pods that run in the quobis namespace, that can be listed with the kubectl get pods -n quobis:

 1 quobis@vm-bar-quo-mas11-1:~$ kubectl get pods -n quobis
 2 NAME                                        READY   STATUS  RESTARTS AGE
 3 audiomixer-sfu1-0                           1/1     Running 0        22d
 4 audiomixer-sfu2-0                           1/1     Running 0        22d
 5 database-mongo-0                            2/2     Running 0        6d4h
 6 database-mongo-1                            2/2     Running 0        6d4h
 7 database-mongo-2                            2/2     Running 0        6d4h
 8 erebus-787567f497-lnrcr                     1/1     Running 0        6d15h
 9 kapi-759d58f644-m5db2                       2/2     Running 0        22d
10 keycloak-77f6796b89-9ngw7                   1/1     Running 0        6d16h
11 kitter-5b5647c888-dbdfw                     1/1     Running 0        13d
12 kube-state-metrics-bc95796c9-hb42c          2/2     Running 0        43d
13 message-broker-5675fb65b9-755s6             1/1     Running 0        6d16h
14 mongodb-exporter-548d4dcddd-7bt67           1/1     Running 0        43d
15 mongodb-kubernetes-operator-6c6cbdf9bf-8jwg 1/1     Running 0        6d16h
16 nfs-client-provisioner-5697c964d6-pwgl5     1/1     Running 0        6d16h
17 nginx-gw-0                                  1/1     Running 0        13d
18 podhandler-1640265360-4n5vw                 0/1     Running 0        13d
19 postgresql-5c9f585dc8-bw5t9                 1/1     Running 0        6d16h
20 qss-audiomixersio-sfu1-86cf778b84-28ct2     1/1     Running 0        6d15h
21 qss-audiomixersio-sfu2-57454f59d-brswp      1/1     Running 0        6d15h
22 qss-auth-http-679d4475d7-w65ch              1/1     Running 0        6d15h
23 qss-calls-574cbbbcdd-j94wx                  1/1     Running 0        6d15h
24 qss-calltransfer-basic-5c66bf9764-lg2vk     1/1     Running 0        6d15h
25 qss-conference-state-7b8d4f448-dnbhk        1/1     Running 0        6d15h
26 qss-invites-rooms-7d7749896f-rhbjw          1/1     Running 0        6d15h
27 qss-io-websockets-6bcd86694c-pzd8p          1/1     Running 0        6d15h
28 qss-io-websockets-6bcd86694c-rv8fq          1/1     Running 0        6d15h
29 qss-log-conference-7fdd67d9f5-gkdrt         1/1     Running 0        6d15h
30 qss-meeting-basic-6d6789b86-h7642           1/1     Running 0        6d15h
31 qss-peer-jt-6468689c87-7t5hg                1/1     Running 0        6d15h
32 qss-quick-conference-844d87c4f5-4v8j8       1/1     Running 0        6d15h
33 qss-registry-authenticated-59dc88c8ff-b6w4d 1/1     Running 0        3h14m
34 qss-resolver-wac-6f4787c4f4-9fcv5           1/1     Running 0        6d15h
35 qss-rooms-basic-6798d49d48-lmf9v            1/1     Running 0        6d15h
36 qss-trunk-85dcc5cb9f-jd87x                  1/1     Running 0        3h13m
37 qss-watchdog-invites-6dc97486f6-qstg2       1/1     Running 0        6d15h
38 qss-watchdog-registry-675f8b55c9-58bm7      1/1     Running 0        6d15h
39 sfu-dispatcher-66f58bf667-bll57             1/1     Running 0        8d
40 sfu-wrapper-sfu1-598d59fd79-mplxb           1/1     Running 4        8d
41 sfu1-0                                      1/1     Running 0        22d
42 sfu2-0                                      1/1     Running 0        22d
43 sip-proxy-1-0                               1/1     Running 0        22d
44 sip-proxy-2-0                               1/1     Running 0        22d
45 sip-proxy-database-6bf67bc969-q4v2j         1/1     Running 1        55d
46 sippo-exporter-678ff77cff-4dztt             1/1     Running 0        14d
47 sippo-maintainer-1640228400-tldsm           0/1     Completed        0d10h
48 sippo-server-755f6d8f9b-ckgzx               1/1     Running 7        22d
49 sippo-server-755f6d8f9b-pzmdh               1/1     Running 0        6d15h
50 sippo-server-755f6d8f9b-q7gmr               1/1     Running 7        22d
51 sippo-storage-86c7bcb6d4-bhbnf              1/1     Running 0        6d16h
52 webphone-angular-8677ccbcc8-gkmh9           1/1     Running 0        12d
53 xmpp-server-58ccd4bb86-b6s4t                1/1     Running 0        41d

The list below gives a brief explanation of each service and its impact in case or an issue with this pod.

Pods included in quobis namespace
Pod name	Functionality	Deployment considerations
`audiomixer`	Audio MCU, handles the mixing of the audio streams and media interconnection with the PSTN/NGN.	Deployed in HA. If one instance gets down, all the active calls using it will fail. Following calls will use the active instance.
`database-mongo`	Internal database to storage user data.	Deployed in HA as a cluster.
`erebus`	Responsible for listening to WebSocket connections from clients and feeding them into our internal message broker	No HA, needs to be re-created in case of failure as all the clients traffic goes through this element
`kapi`	Provides a REST API used for the Kubernetes cluster maintenance.	HA not required. No critical impact in the service in case of failure.
`keycloack`	Identity and access management solution	No HA. Needs to be re-created in case of failure, otherwise users without a session token won’t be able to login.
`kitter`	Provides an API to manage user accounts (CRUD).	No HA. Needs to be re-created in case of failure, otherwise no news users can be added to the system.
`message-broker`	Message broker using AMQP protocol based on RabbitMQ	No HA. Needs to be re-created in case of failure, otherwise there won’t be message interchange between several services.
`mongodb-exporter`	Exports data from the Mongo database to the monitoring system	No HA. No impact in the service in case of failure.
`mongodb-kubernetes-operator`	Manages the lifecycle events for a MongoDB cluster by using the K8S API	No HA. No impact in the service in case of failure.
`nfs-client-provisioner`	Atomatic provisioner that uses existing NFS server to support dynamic provisioning of Kubernetes Persistent Volumes.	No HA. No impact in the service in case of failure.
`nginx`	Web server, reverse proxying, caching and load balancing for HTTP, TCP, and UDP servers.	HA provided by the cluster.
`postgresql`	Postgresql database.	No HA. Used by the chat messaging server, voice and audio calls are not affected by a potential failure.
`qss`	Quobis Signaling Server	All QSS microservices can be replicated. More info here.
`recording`	Responsible for post-call processing to create audio and video recordings.	No HA, needs to be re-created in case of failure as calls won’t be recorded.
`sfu-dispatcher`	Balances the traffic between different video SFUs	No HA, needs to be re-created in case of failure. The call information is kept in the DB to be persistent across restarts.
`sfu-wrapper`	Connector between the SFU dispatcher and the SFU itself	Supports HA. If one instance gets down, all the active calls using it will fail. Following calls will use the active instance.
`sfu`	Video SFU.	Supports HA that is managed by the SFU wrapper and dispatcher services.
`sip-proxy`	SIP application server that acts as a gateway to an external telephony network	Supports HA.
`sip-proxy-database`	MySQL database backend for the SIP proxy. Supports HA if required (not deployed by default)	Supports HA if required (not deployed by default)
`sippo-exporter`	Exports aggregated data to the monitoring system.	No HA needed.
`sippo-maintainer`	Runs daily operations for system maintenance.	No HA needed.
`sippo-server`	Main process that orchestrates the sytem.	HA supported, several replicas can be deployed and traffic is balanced from the message broker.
`sippo-storage`	Storage backend	No HA needed.
`teal`	Responsible for parsing login responses to support blue/green canary deployments (not needed for standard deployments)	No HA. Needs to be re-created in case of failure, otherwise users without a session cooking won’t be able to login.
`turn-server`	Responsible for the implementation of TURN media relay	HA supported, several replicas can be deployed.
`xmpp-server`	Messaging server for chat services.	No HA. Needs to be re-created in case of failure, otherwise the chat services won’t work.