Troubleshooting steps for errors during a cluster deployment

The following steps will help you troubleshoot errors in the event issues arise while deploying a cluster.

Scenario - Unable to Upgrade EKS Worker Nodes from AL2 to AL2023

AWS does not provide a direct upgrade path from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023) for EKS worker nodes. This is due to significant changes between AL2 and AL2023, including differences in worker node initialization and bootstrapping prior to joining an EKS cluster. Refer to the AWS documentation for more details.

You can use the following debug steps for existing clusters that were deployed using AL2 worker nodes and need to be upgraded to AL2023 worker nodes.

info

After January 10, 2026, you can only create node pools with the AL2023 AMI type in Palette. If AL2 is needed, consider using custom AMIs. Ensure you have accounted for this change in any of your automation, such as Terraform, API, etc.

Debug Steps

Check the Compatibility Requirements for AL2023 to ensure your applications run correctly on AL2023.

If your applications are not ready to run on AL2023, continue with the following steps but use a custom AL2 AMI instead of AL2023.
Log in to Palette.
Ensure you are in the correct project scope.
From the left main menu, click Profiles and select the profile used to deploy your EKS cluster.
Create a new version of the cluster profile.
Select the Kubernetes layer of your cluster profile.
On the Edit Pack page, select Values under Pack Details.

Add your IRSA roles to the managedControlPlane.irsaRoles and managedMachinePool.roleAdditionalPolicies sections if they are not already configured.

Example configuration
managedControlPlane:
  ...
  irsaRoles:
    - name: "{{.spectro.system.cluster.name}}-irsa-cni"
      policies:
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
      serviceAccount:
        name: aws-node
        namespace: kube-system
    - name: "{{.spectro.system.cluster.name}}-irsa-csi" # optional, defaults to audience sts.amazonaws.com
      policies:
        - arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
...
managedMachinePool:
  ...
  roleAdditionalPolicies:
    - "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"

Refer to the Scenario - PV/PVC Stuck in Pending Status for EKS Cluster Using AL2023 AMI section for further information on why it is necessary to configure IRSA roles for AL2023.

Click Confirm Updates after editing.
Select the Storage layer of your cluster profile.
On the Edit Pack page, select Values under Pack Details.

Use the YAML editor to add an IAM role ARN annotation to the AWS EBS CSI Driver so that the IRSA role is correctly referenced. Replace <aws-account-id> with your AWS account ID.

Example configuration
charts:
...
aws-ebs-csi-driver:
...
  controller:
    ...
    serviceAccount:
      # A service account will be created for you if set to true. Set to false if you want to use your own.
      create: true
      name: ebs-csi-controller-sa
      annotations: {
        "eks.amazonaws.com/role-arn":"arn:aws:iam::<aws-account-id>:role/{{.spectro.system.cluster.name}}-irsa-csi"
      }
      ## Enable if EKS IAM for SA is used
      # eks.amazonaws.com/role-arn: arn:<partition>:iam::<account>:role/ebs-csi-role
      automountServiceAccountToken: true

Click Confirm Updates after editing.
Click Save Changes on the cluster profile page.
From the left main menu, click Clusters and select your EKS cluster.
Select the Profile tab.
Click the version drop-down for INFRASTRUCTURE LAYERS and select the new version of the cluster profile that you created with these changes.
Click Review & Save. In the pop-up window, click Review changes in Editor.
Review your changes and click Apply Changes when ready.
Select the Nodes tab.
Click New Node Pool.
Fill out the input fields in the Add node pool page as per your requirements.

Ensure that you select an Amazon Linux 2023 AMI type, or, if you are using a custom AL2 AMI, select Custom AMI and provide the AMI ID.
Click Confirm to create the new node pool.
Wait for the new nodes to be Healthy (green tick) in the Health column and show a Status of Running.
Repeat steps 20-24 to create additional AL2023 node pools as needed. Ensure that the total number of nodes in the AL2023 node pools meets your requirements to replace the AL2 node pools.
On the Nodes tab, click the Edit option for your existing AL2 node pool.
Click the Add New Taint option and add a taint to the AL2 node pool. Use the NoExecute effect to evict workloads from the AL2 nodes.

Example:
- Key = migrate-to-al2023
- Value = true
- Effect = NoExecute
Click Confirm to update the AL2 node pool.

Wait for the workloads to be evicted from the AL2 nodes and rescheduled on the AL2023 nodes.

You can check for running pods on the AL2 nodes by issuing the following command. Replace <al2-node-identifier> with part of the name of one of your AL2 nodes.

Example command
kubectl get pods --all-namespaces --output wide --field-selector spec.nodeName=<al2-node-identifier>

The AL2 nodes will display only system pods, such as aws-node, ebs-csi-controller, ebs-csi-node, kube-proxy, and potentially palette-webhook.

Example output from AL2 node
NAMESPACE        NAME                                  READY   STATUS    RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
kube-system      aws-node-rvkbv                        2/2     Running   0          29m   10.0.206.110   ip-10-11-12-13.ec2.internal    <none>           <none>
kube-system      ebs-csi-controller-6f6f7d776b-q7hlj   5/5     Running   0          58m   10.0.196.103   ip-10-11-12-13.ec2.internal    <none>           <none>
kube-system      ebs-csi-node-tbm7t                    3/3     Running   0          29m   10.0.215.130   ip-10-11-12-13.ec2.internal    <none>           <none>
kube-system      kube-proxy-xqm5w                      1/1     Running   0          29m   10.0.206.110   ip-10-11-12-13.ec2.internal    <none>           <none>
palette-system   palette-webhook-86c7b5f99d-tdcf7      1/1     Running   0          29m   10.0.205.155   ip-10-11-12-13.ec2.internal    <none>           <none>

You can compare this output with the output from the same command issued for one of your AL2023 nodes to confirm that the workloads have been successfully migrated.

Example output from AL2023 node
NAMESPACE                          NAME                                                             READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
capi-webhook-system                capa-controller-manager-59c947b948-lwwrv                         1/1     Running   0          8m22s   10.0.201.130   ip-20-21-22-23.ec2.internal   <none>           <none>
capi-webhook-system                capi-controller-manager-5455d67696-pv69l                         1/1     Running   0          8m21s   10.0.215.178   ip-20-21-22-23.ec2.internal   <none>           <none>
capi-webhook-system                capi-kubeadm-control-plane-controller-manager-67b7d996cd-96lqp   1/1     Running   0          8m21s   10.0.252.246   ip-20-21-22-23.ec2.internal   <none>           <none>
cert-manager                       cert-manager-webhook-6b5c469577-wwsqb                            1/1     Running   0          8m22s   10.0.234.189   ip-20-21-22-23.ec2.internal   <none>           <none>
cluster-68ee17be1ccd1304cf843c0f   cluster-management-agent-645f84964f-sw9n4                        1/1     Running   0          8m21s   10.0.212.43    ip-20-21-22-23.ec2.internal   <none>           <none>
cluster-68ee17be1ccd1304cf843c0f   metrics-server-56594bcd99-mkjlp                                  1/1     Running   0          8m17s   10.0.209.82    ip-20-21-22-23.ec2.internal   <none>           <none>
cluster-68ee17be1ccd1304cf843c0f   palette-controller-manager-68c698776c-xjnn4                      3/3     Running   0          8m22s   10.0.204.59    ip-20-21-22-23.ec2.internal   <none>           <none>
kube-system                        aws-node-77hv6                                                   2/2     Running   0          43m     10.0.200.120   ip-20-21-22-23.ec2.internal   <none>           <none>
kube-system                        coredns-6b9575c64c-trp5v                                         1/1     Running   0          8m22s   10.0.204.97    ip-20-21-22-23.ec2.internal   <none>           <none>
kube-system                        ebs-csi-controller-6f6f7d776b-q7hlj                              5/5     Running   0          42m     10.0.196.103   ip-20-21-22-23.ec2.internal   <none>           <none>
kube-system                        ebs-csi-node-mjth6                                               3/3     Running   0          42m     10.0.216.41    ip-20-21-22-23.ec2.internal   <none>           <none>
kube-system                        kube-proxy-544fg                                                 1/1     Running   0          40m     10.0.200.120   ip-20-21-22-23.ec2.internal   <none>           <none>
palette-system                     palette-webhook-86c7b5f99d-f8n5v                                 1/1     Running   0          40m     10.0.206.77    ip-20-21-22-23.ec2.internal   <none>           <none>

Once all workloads have been successfully migrated to the AL2023 nodes, you can delete the AL2 node pools.

Scenario - PV/PVC Stuck in Pending Status for EKS Cluster Using AL2023 AMI

After deploying an Amazon EKS cluster using an Amazon Linux 2023 (AL2023) Amazon Machine Image (AMI), PersistentVolumes (PVs) or PersistentVolumeClaims (PVCs) are stuck in a pending status.

Example
NAMESPACE   NAME                                 STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS            VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
wordpress   data-wordpress-wordpress-mariadb-0   Pending                                      spectro-storage-class   <unset>                 16m   Filesystem
wordpress   wordpress-wordpress                  Pending                                      spectro-storage-class   <unset>                 16m   Filesystem

This issue can arise when an add-on pack or Helm chart that requires a PV or PVC is deployed to an existing cluster or while creating a new cluster.

The PV or PVC provisioning fails because IAM Roles for Service Accounts (IRSA) have not been configured for the AWS CSI packs such as Amazon EBS CSI, Amazon EFS CSI, and Amazon Cloud Native. It is also required if using the AWS Application Loadbalancer.

For instances launched on AL2023, IMDSv2 is enforced by default, and IRSA is the recommended approach for providing IAM permissions to Amazon EBS CSI and Amazon EFS CSI.

Debug Steps

Log in to Palette.
Ensure you are in the correct project scope.
From the left main menu, navigate to the Profiles page. Find and click on your cluster profile.
Create a new version of the cluster profile.
Select the Kubernetes layer of your cluster profile.

Use the YAML editor to configure IRSA roles for the managedControlPlane and managedMachinePool resources.

AWS EBS CSI
AWS EFS CSI

Example configuration
managedControlPlane:
---
irsaRoles:
  - name: "{{.spectro.system.cluster.name}}-irsa-cni"
    policies:
      - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
    serviceAccount:
      name: aws-node
      namespace: kube-system
  - name: "{{.spectro.system.cluster.name}}-irsa-csi"
    policies:
      - arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
---
managedMachinePool:
  roleAdditionalPolicies:
    - "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"

Example configuration
managedControlPlane:
---
irsaRoles:
  - name: "{{.spectro.system.cluster.name}}-irsa-cni"
    policies:
      - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
    serviceAccount:
      name: aws-node
      namespace: kube-system
  - name: "{{.spectro.system.cluster.name}}-irsa-csi"
    policies:
      - arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy
---
managedMachinePool:
  roleAdditionalPolicies:
    - "arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy"

Click Confirm Updates after editing.
Select the Storage layer of your cluster profile.

Use the YAML editor to add an IAM role ARN annotation to the AWS CSI Driver so that the IRSA role is correctly referenced. Replace <aws-account-id> with your AWS account ID.

AWS EBS CSI
AWS EFS CSI

Example configuration
charts:
...
  aws-ebs-csi-driver:
  ...
    controller:
    ...
      serviceAccount:
        # A service account will be created for you if set to true. Set to false if you want to use your own.
        create: true
        name: ebs-csi-controller-sa
        annotations: {
          "eks.amazonaws.com/role-arn":"arn:aws:iam::<aws-account-id>:role/{{.spectro.system.cluster.name}}-irsa-csi"
        }
        ## Enable if EKS IAM for SA is used
        # eks.amazonaws.com/role-arn: arn:<partition>:iam::<account>:role/ebs-csi-role
        automountServiceAccountToken: true

Example configuration
charts:
...
  aws-efs-csi-driver:
  ...
    controller:
    ...
      serviceAccount:
        # A service account will be created for you if set to true. Set to false if you want to use your own.
        create: true
        name: efs-csi-controller-sa
        annotations: {
          "eks.amazonaws.com/role-arn":"arn:aws:iam::<aws-account-id>:role/{{.spectro.system.cluster.name}}-irsa-csi"
        }
        ## Enable if EKS IAM for SA is used
        #  eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/efs-csi-role

Update the custom labels for the AWS CSI Driver to retrigger the deployment.

AWS EBS CSI
AWS EFS CSI

Example
charts:
...
  aws-ebs-csi-driver:
  ...
    customLabels: {
      restart: "true"
    }

Example
charts:
...
  aws-efs-csi-driver:
  ...
    controller:
      podLabels: {
        restart: "true"
      }
    ...
    node:
      podLabels: {
        restart: "true"
      }

Click Confirm Updates after editing.
Click Save Changes on the cluster profile page.
Update your cluster to use the new cluster profile version that you created with these changes. Refer to Update a Cluster for guidance.
Wait for the nodes to be repaved and the AWS CSI Driver to be redeployed.

Check that the PV or PVC status is Bound by issuing one of the following kubectl commands.

Example command for PVs
kubectl get pv --output wide

Example output for PVs
NAME               CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                         STORAGECLASS            VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
pv-xyz...          10Gi       RWO            Delete           Bound    wordpress/data-wordpress-wordpress-mariadb-0  spectro-storage-class   <unset>                 16m   Filesystem
pv-abc...          8Gi        RWO            Delete           Bound    wordpress/wordpress-wordpress                 spectro-storage-class   <unset>                 16m   Filesystem

Example command for PVCs
kubectl get pvc --all-namespaces --output wide

Example output for PVCs
NAMESPACE   NAME                                 STATUS    VOLUME      CAPACITY   ACCESS MODES   STORAGECLASS            VOLUMEATTRIBUTESCLASS   AGE   VOLUMEMODE
wordpress   data-wordpress-wordpress-mariadb-0   Bound     pvc-xyz...  10Gi       RWO            spectro-storage-class   <unset>                 16m   Filesystem
wordpress   wordpress-wordpress                  Bound     pvc-abc...  8Gi        RWO            spectro-storage-class   <unset>                 16m   Filesystem

Scenario - Instances Continuously Delete Every 30 Minutes

An instance is launched and terminated every 30 minutes prior to completion of its deployment, and the Events Tab lists errors with the following message:

Failed to update kubeadmControlPlane Connection timeout connecting to Kubernetes Endpoint

This behavior can occur when Kubernetes services for the launched instance fail to start properly. Common reasons for why a service may fail are:

The specified image could not be pulled from the image repository.
The cloud init process failed.

Debug Steps

Initiate an SSH session with the Kubernetes instance using the SSH key provided during provisioning, and log in as user spectro. If you are initiating an SSH session into an installer instance, log in as user ubuntu.
```
ssh --identity_file <_pathToYourSSHkey_> spectro@X.X.X.X
```
Elevate the user access.
```
sudo -i
```
Verify the Kubelet service is operational.
```
systemctl status kubelet.service
```
If the Kubelet service does not work as expected, do the following. If the service operates correctly, you can skip this step.
1. Navigate to the /var/log/ folder.
```
cd /var/log/
```
2. Scan the cloud-init-output file for any errors. Take note of any errors and address them.
```
cat cloud-init-output.log
```
If the kubelet service works as expected, do the following.
- Export the kubeconfig file.
```
export KUBECONFIG=/etc/kubernetes/admin.conf
```
- Connect with the cluster's Kubernetes API.
```
kubectl get pods --all-namespaces
```
- When the connection is established, verify the pods are in a Running state. Take note of any pods that are not in Running state.
```
kubectl get pods -o wide
```
- If all the pods are operating correctly, verify their connection with the Palette API.
  - For clusters using Gateway, verify the connection between the Installer and Gateway instance:
    curl -k https://<KUBE_API_SERVER_IP>:6443
  - For Public Clouds that do not use Gateway, verify the connection between the public Internet and the Kubernetes endpoint:
    curl -k https://<KUBE_API_SERVER_IP>:6443
    info
    You can obtain the URL for the Kubernetes API using this command: kubectl cluster-info.
Check stdout for errors. You can also open a support ticket. Visit our support page.

Scenario - Deployment Violates Pod Security

Cluster deployment fails with the following message.

Error creating: pods <name of pod> is forbidden: violates PodSecurity "baseline:v<k8s version>": non-default capabilities …

This can happen when the cluster profile uses Kubernetes 1.25 or later and also includes packs that create pods that require elevated privileges.

Debug Steps

To address this issue, you can change the Pod Security Standards of the namespace where the pod is being created.

Log in to Palette.
Navigate to the left Main Menu and click on Profiles.
Select the profile you are using to deploy the cluster. Palette displays the profile stack and details.
Click on the pack layer in the profile stack that contains the pack configuration.
In the pack's YAML file, add a subfield in the pack section called namespaceLabels if it does not already exist.
In the namespaceLabels section, add a line with the name of your namespace as the key and add pod-security.kubernetes.io/enforce=privileged,pod-security.kubernetes.io/enforce-version=v<k8s_version> as its value. Replace <k8s_version> with the version of Kubernetes on your cluster and only include the major and minor version following the lowercase letter v. For example, v1.25 and v1.28.
If a key matching your namespace already exists, add the labels to the value corresponding to that key.

warning

We recommend only applying the labels to namespaces where pods fail to be created. If your pack creates multiple namespaces, and you are unsure which ones contain pods that need the elevated privileges, you can access the cluster with the kubectl CLI and use the kubectl get pods command. This command lists pods and their namespaces so you can identify the pods that are failing at creation.

For guidance in using the CLI, review Access Cluster with CLI. To learn more about kubectl pod commands, refer to the Kubernetes documentation.

Examples

The following example shows a pack that creates a namespace called "monitoring". In this example, the monitoring namespace does not have any pre-existing labels. You need to add the namespaceLabels line as well as the corresponding key-value pair under it to apply the labels to the monitoring namespace.

pack:
  namespace: "monitoring"

  namespaceLabels:
    "monitoring": "pod-security.kubernetes.io/enforce=privileged,pod-security.kubernetes.io/enforce-version=v1.28"

This second example is similar to the first one. However, in this example, the monitoring key already exists under namespaceLabels, with its original value being "org=spectro,team=dev". Therefore, you add the labels to the existing value:

pack:
  namespace: "monitoring"

  namespaceLabels:
    "monitoring": "org=spectro,team=dev,pod-security.kubernetes.io/enforce=privileged,pod-security.kubernetes.io/enforce-version=v1.28"

Scenario - Nutanix CAPI Deployment Updates

In the event that the internal Nutanix cluster Cluster API (CAPI) configurations are updated, there is a possibility that the cluster's Kubernetes deployments may encounter issues, resulting in an unhealthy cluster. This can occur when the CAPI changes may be incompatible with the newer version of Palette. The following steps will help you troubleshoot and resolve this issue.

Debug Steps

Open a terminal session and ensure you have the kubectl CLI installed. If you do not have the CLI installed, you can download it from the Kubernetes website.
Set up your terminal session to use the kubeconfig file for your Nutanix cluster. You can find the kubeconfig for your cluster in the Palette UI by visiting the Nutanix cluster's details page. Check out the Access Cluster with CLI guide for guidance on how to set up your terminal session to use the kubeconfig file.

To restore the cluster to a healthy state, you need to delete the following deployments so that Palette can re-create them with the updated machine template. Issue the following commands to delete the following three deployments.

 kubectl delete deployment capi-controller-manager --namespace capi-system

 kubectl delete deployment capi-kubeadm-bootstrap-controller-manager --namespace capi-kubeadm-bootstrap-system

 kubectl delete deployment capi-kubeadm-control-plane-controller-manager --namespace capi-kubeadm-control-plane-system

Palette will automatically re-create the deleted deployments with the updated machine template. You can monitor the progress of the re-creation by checking the status of the deployments using the following command.
```
kubectl get deployments --all-namespaces
```

Scenario - Unable to Upgrade EKS Worker Nodes from AL2 to AL2023​

Debug Steps​

Scenario - PV/PVC Stuck in Pending Status for EKS Cluster Using AL2023 AMI​

Debug Steps​

Scenario - Instances Continuously Delete Every 30 Minutes​

Debug Steps​

Scenario - Deployment Violates Pod Security​

Debug Steps​

Examples​

Scenario - Nutanix CAPI Deployment Updates​

Debug Steps​

Scenario - Unable to Upgrade EKS Worker Nodes from AL2 to AL2023

Debug Steps

Scenario - PV/PVC Stuck in Pending Status for EKS Cluster Using AL2023 AMI

Debug Steps

Scenario - Instances Continuously Delete Every 30 Minutes

Debug Steps

Scenario - Deployment Violates Pod Security

Debug Steps

Examples

Scenario - Nutanix CAPI Deployment Updates

Debug Steps