Skip to main content
Version: latest

Troubleshooting steps for errors during a cluster deployment

The following steps will help you troubleshoot errors in the event issues arise while deploying a cluster.

Scenario - Instances Continuously Delete Every 30 Minutes

An instance is launched and terminated every 30 minutes prior to completion of its deployment, and the Events Tab lists errors with the following message:

Failed to update kubeadmControlPlane Connection timeout connecting to Kubernetes Endpoint

This behavior can occur when Kubernetes services for the launched instance fail to start properly. Common reasons for why a service may fail are:

  • The specified image could not be pulled from the image repository.
  • The cloud init process failed.

Debug Steps

  1. Initiate an SSH session with the Kubernetes instance using the SSH key provided during provisioning, and log in as user spectro. If you are initiating an SSH session into an installer instance, log in as user ubuntu.

    ssh --identity_file <_pathToYourSSHkey_> spectro@X.X.X.X
  2. Elevate the user access.

    sudo -i
  3. Verify the Kubelet service is operational.

    systemctl status kubelet.service
  4. If the Kubelet service does not work as expected, do the following. If the service operates correctly, you can skip this step.

    1. Navigate to the /var/log/ folder.

      cd /var/log/
    2. Scan the cloud-init-output file for any errors. Take note of any errors and address them.

      cat cloud-init-output.log
  5. If the kubelet service works as expected, do the following.

    • Export the kubeconfig file.

      export KUBECONFIG=/etc/kubernetes/admin.conf
    • Connect with the cluster's Kubernetes API.

      kubectl get pods --all-namespaces
    • When the connection is established, verify the pods are in a Running state. Take note of any pods that are not in Running state.

      kubectl get pods -o wide
    • If all the pods are operating correctly, verify their connection with the Palette API.

      • For clusters using Gateway, verify the connection between the Installer and Gateway instance:

        curl -k https://<KUBE_API_SERVER_IP>:6443
      • For Public Clouds that do not use Gateway, verify the connection between the public Internet and the Kube endpoint:

        curl -k https://<KUBE_API_SERVER_IP>:6443
        info

        You can obtain the URL for the Kubernetes API using this command: kubectl cluster-info.

  6. Check stdout for errors. You can also open a support ticket. Visit our support page.

Scenario - Deployment Violates Pod Security

Cluster deployment fails with the following message.

Error creating: pods <name of pod> is forbidden: violates PodSecurity "baseline:v<k8s version>": non-default capabilities …

This can happen when the cluster profile uses Kubernetes 1.25 or later and also includes packs that create pods that require elevated privileges.

Debug Steps

To address this issue, you can change the Pod Security Standards of the namespace where the pod is being created.

  1. Log in to Palette.

  2. Navigate to the left Main Menu and click on Profiles.

  3. Select the profile you are using to deploy the cluster. Palette displays the profile stack and details.

  4. Click on the pack layer in the profile stack that contains the pack configuration.

  5. In the pack's YAML file, add a subfield in the pack section called namespaceLabels if it does not already exist.

  6. In the namespaceLabels section, add a line with the name of your namespace as the key and add pod-security.kubernetes.io/enforce=privileged,pod-security.kubernetes.io/enforce-version=v<k8s_version> as its value. Replace <k8s_version> with the version of Kubernetes on your cluster and only include the major and minor version following the lowercase letter v. For example, v1.25 and v1.28.

  7. If a key matching your namespace already exists, add the labels to the value corresponding to that key.

warning

We recommend only applying the labels to namespaces where pods fail to be created. If your pack creates multiple namespaces, and you are unsure which ones contain pods that need the elevated privileges, you can access the cluster with the kubectl CLI and use the kubectl get pods command. This command lists pods and their namespaces so you can identify the pods that are failing at creation.

For guidance in using the CLI, review Access Cluster with CLI. To learn more about kubectl pod commands, refer to the Kubernetes documentation.

Examples

The following example shows a pack that creates a namespace called "monitoring". In this example, the monitoring namespace does not have any pre-existing labels. You need to add the namespaceLabels line as well as the corresponding key-value pair under it to apply the labels to the monitoring namespace.

pack:
namespace: "monitoring"

namespaceLabels:
"monitoring": "pod-security.kubernetes.io/enforce=privileged,pod-security.kubernetes.io/enforce-version=v1.28"

This second example is similar to the first one. However, in this example, the monitoring key already exists under namespaceLabels, with its original value being "org=spectro,team=dev". Therefore, you add the labels to the existing value:

pack:
namespace: "monitoring"

namespaceLabels:
"monitoring": "org=spectro,team=dev,pod-security.kubernetes.io/enforce=privileged,pod-security.kubernetes.io/enforce-version=v1.28"

Scenario - Nutanix CAPI Deployment Updates

In the event that the internal Nutanix cluster Cluster API (CAPI) configurations are updated, there is a possibility that the cluster's Kubernetes deployments may encounter issues, resulting in an unhealthy cluster. This can occur when the CAPI changes may be incompatible with the newer version of Palette. The following steps will help you troubleshoot and resolve this issue.

Debug Steps

  1. Open a terminal session and ensure you have the kubectl CLI installed. If you do not have the CLI installed, you can download it from the Kubernetes website.

  2. Set up your terminal session to use the kubeconfig file for your Nutanix cluster. You can find the kubeconfig for your cluster in the Palette UI by visiting the Nutanix cluster's details page. Check out the Access Cluster with CLI guide for guidance on how to set up your terminal session to use the kubeconfig file.

  3. To restore the cluster to a healthy state, you need to delete the following deployments so that Palette can re-create them with the updated machine template. Issue the following commands to delete the following three deployments.

     kubectl delete deployment capi-controller-manager --namespace capi-system
     kubectl delete deployment capi-kubeadm-bootstrap-controller-manager --namespace capi-kubeadm-bootstrap-system
     kubectl delete deployment capi-kubeadm-control-plane-controller-manager --namespace capi-kubeadm-control-plane-system
  4. Palette will automatically re-create the deleted deployments with the updated machine template. You can monitor the progress of the re-creation by checking the status of the deployments using the following command.

    kubectl get deployments --all-namespaces