Overview

Kubernetes clusters in Spectro Cloud are instantiated from cluster profiles. A cluster definition in Spectro Cloud consists of a reference to a cluster profile, cloud configuration, as well as the cluster size and placement configuration. The following high-level tasks are performed as part of the cluster creation:

  • Orchestration of computing, network, and storage resources on the cloud environments along with the required placement infrastructure.
  • Installation and configuration of various Kubernetes components like Kubelet, API servers, etcd, scheduler, etc.
  • Installation and configuration of the cloud-specific network (CNI) and storage (CSI) plugins.
  • Securing of the cluster infrastructure and configuration in accordance with the relevant OS, Kubernetes, and cloud security best practices.
  • Deployment of additional add-ons such as Prometheus, Permissions Manager, Vault, etc., as specified in the cluster profile.

Images

Spectro Cloud provides VM images for cluster computing infrastructure out of the box for the most recent versions of operating systems such as Ubuntu, CentOS, RHEL. These images are security-hardened based on the respective CIS Benchmarks. Kubernetes components such as kubelet, kubeadm, etc. are pre-installed in these images. The specific image for a cluster is derived from the Operating System and Kubernetes packs configured in the cluster profile.

The out of the box images are hosted either in the public cloud (AWS - AMI, Azure - VHD) or Spectro Cloud's storage repository (vSphere - OVA). During provisioning, the image is copied (if missing) to the desired cloud region or downloaded onto a private datacenter.

Customization

Spectro Cloud provides various forms of customization options for VM images. All these customization options require a private pack registry to be set up with customized OS packs.

Customize out of the box images

Spectro Cloud's out of the box images are security-hardened and have Kubernetes components pre-installed. Additional components can be installed on the images at runtime by defining one or more Ansible roles in the customized OS pack. Spectro Cloud’s orchestration engine creates a new image by instantiating a VM instance from the out of box image and executing the specified Ansible roles on the instance. This custom image is used for cluster provisioning. The customized image is tagged with a unique signature generated from the pack definition so that it can be reused for future cluster provisioning requests.

Bring your own Image

Users can bring used their own OS image by building custom OS packs and providing a reference to the desired image in pack annotations. These images can be:

  • Pre-configured with all desired OS packages and Kubernetes components for the desired version installed. No Ansible roles are specified in the OS pack. The “skip k8s installation” option in the OS pack is set to true. ("skipK8sInstall": "true")

  • Base images with none of the desired packages or Kubernetes components installed. Ansible roles are specified in the OS pack to install additional packages. The “skip K8s installation” option in the OS pack is set to false ("skipK8sInstall": "false")

  • A combination of the two options above.

Spectro Cloud’s orchestration engine examines the OS pack configuration and determines if customization is required. If customization is required, a VM instance is launched from the image reference provided in the pack. Installation of Kubernetes components and/or execution of additional Ansible roles is performed on this VM instance. A VM image is then created from this instance and used for cluster provisioning. The customized image is tagged with a unique signature generated from the pack definition so that it can be reused for future cluster provisioning requests.

Security

Spectro Cloud secures the Kubernetes clusters provisioned by following security best practices at the OS, Kubernetes, and Cloud Infrastructure level.

Operating System

Spectro Cloud’s out of the box VM images are hardened in accordance with the relevant OS CIS benchmark. Additionally, the images are scanned for vulnerabilities regularly and fixes are applied to these images when available from the provider. The upgraded images are released in the form of updated OS packs in Spectro Cloud’s pack registry and are available to the users to apply to their existing clusters at the time convenient to them.

Kubernetes

Kubernetes components and configuration are hardened in accordance with the Kubernetes CIS Benchmark. Spectro Cloud executes Kubebench, a CIS Benchmark scanner by Aqua Security, for every Kubernetes pack to ensure the master and worker nodes are configured securely.

Cloud

Spectro Cloud follows security best practices recommended by the various cloud providers when provisioning and configuring the computing, network, and storage infrastructure for the Kubernetes clusters. These include practices such as isolating master and worker nodes in dedicated network domains, limiting access through use constructs like security groups. etc.

The security measures mentioned above are implemented for Spectro Cloud’s out of the box OS and Kubernetes packs. For customized OS Kubernetes packs, users are responsible for taking the relevant measures to secure their clusters.

Day-2 Management

Spectro Cloud provides several options to manage Kubernetes clusters on an ongoing basis. These include options to scale up/down the cluster by adding/reducing the number of nodes in a node pool, add additional worker pools, resize nodes in a node pool by modifying the instance type, and add additional fault domains such as availability zones to a node pool.

Cluster management operations result in the update of cluster definitions in Spectro Cloud’s database. The updated definition is retrieved by the management agent running in the cluster. The cluster control plane subsequently reconciles the changes to bring associated clusters to their desired state.

Updates

Spectro Cloud supports various kids of updates to running clusters. Based on the nature of the change, one of the following two mechanisms can be used to apply cluster updates to the cluster.

Cluster update notifications

Fundamental changes to the cluster’s definition, such as upgrading Kubernetes versions, installing new packs, uninstalling previously installed packs, and updating default pack configuration, need to be applied to the cluster profile. These changes result in update notifications on the clusters and can be propagated to the clusters at an appropriate time. The update notification consists of detailed information about all the changes applied to the profile since the initial installation or since the previous update.

Updates to pack configuration may result in a conflict if the configuration was previously overridden in the cluster. The conflicts are presented to the user and need to be resolved before changes are applied to the cluster.

Configuration overrides

Configuration for packs can be updated in a cluster at any time. The changes are applied immediately to the cluster.

Cluster Health

Spectro Cloud monitors cluster infrastructure on a regular basis and reports health on the management console. Overall health is computed based on the following factors:

  • Heartbeat - Spectro Cloud's management agent, which runs inside the cluster periodically sends a heartbeat to the management console. Missing heartbeats are typically indicative of a problem such as a cluster infrastructure going down, lack of network connectivity, etc. Failure to detect heartbeat over a period of time results in an unhealthy status for the cluster.
  • Node Conditions - Kubernetes maintains status for each cluster node in the form of conditions such as DiskPressure, MemoryPressure, NetworkUnavailable, etc. Spectro Cloud monitors these conditions and reports back to the management console. Any node condition indicating a problem with the node results in an unhealthy status for the cluster.
  • Metrics - Spectro Cloud collects usage metrics such as CPU, Disk, Memory, etc. The cluster is marked as unhealthy if the usage metrics cross specific thresholds over a period of time.

Usage Monitoring

Spectro Cloud continuously monitors cluster resources and reports the usage for the cluster as well as individual nodes. The following metrics are reported on the cluster overview page of the management console. By default the metrics are only displayed for the worker nodes in the cluster:

  • Cores Used - A cluster-wise break down of the number of cores used.
  • CPU Usage - Current CPUs used across all cluster nodes. Additionally, usage over a period of time is presented as a chart
  • Memory Usage - Current memory used across all cluster nodes. Additionally, usage over a period of time is presented as a chart
  • CPU Requests - Total CPUs requested across all pods.
  • Memory Requests - Total memory requested across all pods.

Additionally, usage metrics for individual nodes as well as node conditions are accessible from the node details page.

Application Services

Spectro Cloud enables quick access to the application services installed on the Kubernetes clusters by providing a link to those on the management console. These include not only the applications and services deployed through Spectro Cloud but also the ones deployed through any other means. Services are monitored on an ongoing basis and all services of the type LoadBalancer or NodePort are displayed on the management console.

Troubleshooting

Typically when a cluster lifecycle action such as provisioning, upgrade, or deletion runs into a failure, it does not result in an outright error on the cluster. The Spectro Cloud orchestration engine follows the reconciliation pattern wherein the system repeatedly tries to perform various orchestration tasks to bring the cluster to its desired state until it succeeds. Initial cluster provisioning or subsequent updates can run into a variety of issues related to cloud infrastructure availability, lack of resources, networking issues, etc.

Cluster conditions

Spectro Cloud maintains specific milestones in a lifecycle and presents them as “conditions”. Examples include: Creating Infrastructure, Adding Control Plane Node, Customizing Image, etc. The active condition indicates what task Spectro Cloud’s orchestration system is trying to perform. If a task results in failures, the condition is marked as failed, with relevant error messages. Reconciliation however continues behind the scenes and continuous attempts are made to perform the task. Failed conditions are a great source of troubleshooting provisioning issues.

For example, failure to create a virtual machine in AWS due to the vCPU limit being exceeded would cause this error is shown to the end-users. They could choose to bring down some workloads in the AWS cloud to free up space. The next time a VM creation task is attempted, it would succeed and the condition would be marked as a success.

Event Stream

Spectro Cloud maintains an event stream with low-level details of the various orchestration tasks being performed. This event stream is a good source for identifying issues in the event an operation does not complete for a long time.

Due to Spectro Cloud’s reconciliation logic, intermittent errors show up in the event stream. As an example, after launching a node, errors might show up in the event stream regarding being unable to reach the node. However, the errors clear up once the node comes up.

Error messages that persist over a long time or errors indicating issues with underlying infrastructure are an indication of a real problem.

Proxy Whitelists

This table lists the proxy requirements for enabling the Spectro Cloud management console.

Top-level DomainPortDescription
spectrocloud.com443For the Spectro Cloud SaaS.
s3.amazonaws.com443To access the Spectro Cloud VMware OVA files.
gcr.io443To access the Spectro Cloud image files.
docker.io443To access the Spectro Cloud Pack Registries.
googleapis.com443For pulling Spectro Cloud images.
docker.com443To access the Spectro Cloud docker images.
raw.githubusercontent.com443
projectcalico.org443For egress management.
quay.io443Container image registry access.
grafana.com443To provide access to the dashboard metrics.
github.com443

Creating Clusters

The detailed steps for creating clusters (as well as scaling/removing/reconfiguring etc.) on different cloud service providers are detailed here.

Overview

The following is the deployment architecture for an AWS cluster.

The Kubernetes nodes are distributed across multiple AZs to achieve high availability. For each of the AZ that you choose, a public subnet and a private subnet is created.

All the control plane nodes and worker nodes are created within the private subnets so there is no direct public access available.

A NAT gateway is created in the public subnet of each AZ, to allow nodes in the private subnet to be able to go out to the internet or call other AWS services.

An Internet gateway is created for each VPC, to allow SSH access to the bastion node for debugging purposes. SSH into Kubernetes nodes is only available through the Bastion node. A bastion node helps to provide access to the ec2 instances. This is because the ec2 instances are created in a private subnet and the bastion node operates as a secure, single point of entry into the infrastructure. The bastion node can be accessed via SSH or RDP.

The APIServer endpoint is accessible through an ELB, which load balancing across all the control plane nodes.

aws_cluster_architecture.png

Prerequisites

Spectro Cloud creates compute, network, and storage resources on AWS during the provisioning of Kubernetes clusters. The following pre-requisites should be met for the successful creation of clusters.

Resource Capacity

Sufficient capacity in the desired AWS region should exist for the creation of the following resources:

  • vCpu
  • VPC
  • Elastic IP
  • Internet Gateway
  • Elastic Load Balancers
  • NAT Gateway

AWS Cloud Account Permissions

To create an AWS cloud account, an access key as well as a secret access key will be needed.

Ensure that the IAM user or the ROOT user has the following minimum permissions:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudformation:CreateStack",
"cloudformation:DescribeStacks",
"cloudformation:UpdateStack",
"ec2:AllocateAddress",
"ec2:AssociateRouteTable",
"ec2:AttachInternetGateway",
"ec2:AttachVolume",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateInternetGateway",
"ec2:CreateNatGateway",
"ec2:CreateRoute",
"ec2:CreateRouteTable",
"ec2:CreateSecurityGroup",
"ec2:CreateSubnet",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateVpc",
"ec2:DeleteInternetGateway",
"ec2:DeleteNatGateway",
"ec2:DeleteRoute",
"ec2:DeleteRouteTable",
"ec2:DeleteSecurityGroup",
"ec2:DeleteSubnet",
"ec2:DeleteTags",
"ec2:DeleteVolume",
"ec2:DeleteVpc",
"ec2:DescribeAccountAttributes",
"ec2:DescribeAddresses",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeImages",
"ec2:DescribeInstances",
"ec2:DescribeInternetGateways",
"ec2:DescribeKeyPairs",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkInterfaceAttribute",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeRegions",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVolumes",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcs",
"ec2:DetachInternetGateway",
"ec2:DetachVolume",
"ec2:DisassociateAddress",
"ec2:DisassociateRouteTable",
"ec2:ModifyInstanceAttribute",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:ModifySubnetAttribute",
"ec2:ModifyVolume",
"ec2:ModifyVpcAttribute",
"ec2:ReleaseAddress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:TerminateInstances",
"elasticloadbalancing:AddTags",
"elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
"elasticloadbalancing:AttachLoadBalancerToSubnets",
"elasticloadbalancing:ConfigureHealthCheck",
"elasticloadbalancing:CreateListener",
"elasticloadbalancing:CreateLoadBalancer",
"elasticloadbalancing:CreateLoadBalancerListeners",
"elasticloadbalancing:CreateLoadBalancerPolicy",
"elasticloadbalancing:CreateTargetGroup",
"elasticloadbalancing:DeleteListener",
"elasticloadbalancing:DeleteLoadBalancer",
"elasticloadbalancing:DeleteLoadBalancerListeners",
"elasticloadbalancing:DeleteTargetGroup",
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:DescribeListeners",
"elasticloadbalancing:DescribeLoadBalancerAttributes",
"elasticloadbalancing:DescribeLoadBalancerPolicies",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeTags",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeTargetHealth",
"elasticloadbalancing:DetachLoadBalancerFromSubnets",
"elasticloadbalancing:ModifyListener",
"elasticloadbalancing:ModifyLoadBalancerAttributes",
"elasticloadbalancing:ModifyTargetGroup",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer",
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:RemoveTags",
"elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
"elasticloadbalancing:SetLoadBalancerPoliciesOfListener",
"iam:AddRoleToInstanceProfile",
"iam:AddUserToGroup",
"iam:AttachGroupPolicy",
"iam:AttachRolePolicy",
"iam:CreateGroup",
"iam:CreateInstanceProfile",
"iam:CreatePolicy",
"iam:CreateRole",
"iam:CreateServiceLinkedRole",
"iam:CreateUser",
"iam:DeleteGroup",
"iam:DeleteInstanceProfile",
"iam:DeletePolicy",
"iam:DeleteRole",
"iam:DetachGroupPolicy",
"iam:DetachRolePolicy",
"iam:GetGroup",
"iam:GetInstanceProfile",
"iam:GetPolicy",
"iam:GetRole",
"iam:GetUser",
"iam:PassRole",
"iam:RemoveRoleFromInstanceProfile",
"iam:RemoveUserFromGroup",
"kms:DescribeKey",
"pricing:GetProducts",
"tag:GetResources"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"secretsmanager:CreateSecret",
"secretsmanager:DeleteSecret",
"secretsmanager:GetSecretValue",
"secretsmanager:TagResource"
],
"Resource": [
"arn:aws:secretsmanager:*:*:secret:aws.cluster.x-k8s.io/*"
]
}
]
}
The policy below cannot be used as an inline policy, as it exceeds the 2048 non-whitespace character limit by AWS.
The following warning on this policy is expected:

This policy defines some actions, resources, or conditions that do not provide permissions. To grant access, policies must have an action that has an applicable resource or condition.

Create an AWS Cluster

The following steps need to be performed to provision a new AWS cluster:

  • Provide basic cluster information like name, description, and tags. Tags on a cluster are propagated to the VMs deployed on the cloud/data center environments.
  • Select a cluster profile created for AWS cloud. The profile definition will be used as the cluster construction template.
  • Review and override pack parameters as desired. By default, parameters for all packs are set with values defined in the cluster profile.
  • Provide the AWS Cloud account and placement information.
    • Cloud Account - Select the desired cloud account. AWS cloud accounts with AWS credentials need to be pre-configured in project settings.
    • Region - Choose the desired AWS region where you would like the clusters to be provisioned.
    • SSH Key Pair Name - Choose the desired SSH Key pair. SSH key pairs need to be pre-configured on AWS for the desired regions. The selected key is inserted into the VMs provisioned.
    • Static Placement - By default, Spectro Cloud uses dynamic placement wherein a new VPC with a public and private subnet is created to place cluster resources for every cluster. These resources are fully managed by Spectro Cloud and deleted when the corresponding cluster is deleted. Turn on the Static Placement option if its desired to place resources into preexisting VPCs and subnets.
  • Configure the master and worker node pools. A master and a worker node pool are configured by default.
    • Name - a descriptive name for the node pool.
    • Size - Number of VMs to be provisioned for the node pool. For the master pool, this number can be 1, 3, or 5.
    • Allow worker capability (master pool) - Select this option for allowing workloads to be provisioned on master nodes.
    • Instance type - Select the AWS instance type to be used for all nodes in the node pool.
    • Availability Zones - Choose one or more availability zones. Spectro Cloud provides fault tolerance to guard against failures like hardware failures, network failures, etc. by provisioning nodes across availability zones if multiple zones are selected.
  • Review settings and deploy the cluster. Provisioning status with details of ongoing provisioning tasks is available to track progress.
New worker pools may be added if its desired to customize certain worker nodes to run specialized workloads. As an example, the default worker pool may be configured with the ‘m3.large’ instance types for general-purpose workloads, and another worker pool with instance type ‘g2.2xlarge’ can be configured to run GPU workloads.

Scaling an AWS Cluster

Scaling a cluster up or down involves changing the size of node pools. The following steps need to be performed to scale up/down an AWS cluster.

  • Access the ‘Nodes’ view of the cluster.
  • For the desired node pool change the size directly from the nodes panel or by editing node pool settings.
  • After the node pool configuration is updated, the scale-up/down operation is initiated in a few minutes.
  • Provisioning status is updated with the ongoing progress of the scale operation.
Master node pool may be scaled from 1 to 3 or 3 to 5 nodes. Scale down operation is not supported for master nodes.

Add an AWS worker pool

The following steps need to be performed to add a new worker node pool to a cluster:-

  • Invoke the option to ‘Add Node Pool’ from the cluster’s node information page.

  • Provide node pool settings as follows:

    • A descriptive name for the node pool.
    • The number of nodes in the node pool.
    • One or more availability zones. Nodes are distributed across availability zones when multiple zones are selected.
    • The instance type to be used for all the nodes launched in the node pool.
    • Save the node pool settings. New worker pool settings are updated and cluster updates begin within a few minutes. The provisioning status is updated with the ongoing progress of tasks related to the addition of new nodes.

Remove an AWS worker pool

The following steps need to be performed to remove a worker pool from the cluster:-

  • Access the 'Nodes' view of the cluster.
  • Delete the desired worker pool and confirm the deletion.
  • Upon confirmation, the worker node deletion begins in a few minutes.

Reconfigure AWS nodes

The following steps need to be performed to reconfigure worker pool nodes:-

  • Access the 'Nodes' view of the cluster.
  • Edit the settings of the desired node pool.
  • Change the instance type to the desired instance type.
  • Save the node pool settings. After the node pool settings are updated, the node pool reconfiguration begins within a few minutes. The older nodes in the node pool are deleted one by one and replaced by new nodes launched with the new instance type configured.
  • The provisioning status is updated with the ongoing progress of nodes being deleted and added.