1 - Cluster Provisioning Tools Contract

Cloud provider assumptions on Azure resources that provisioning tools should follow.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Here is a list of Azure resource assumptions that are required for cloud provider Azure:

  • All Azure resources MUST be under the same tenant.
  • All virtual machine names MUST be the same as their hostname.
  • The cluster name set for kube-controller-manager --cluster-name=<cluster-name> MUST not end with -internal.

After the cluster is provisioned, cloud provider Azure MAY update the following Azure resources based on workloads:

  • New routes would be added for each node if --configure-cloud-routes is enabled.
  • New LoadBalancer (including external and internal) would be created if they’re not existing yet.
  • Virtual machines and virtual machine scale sets would be added to LoadBalancer backend address pools if they’re not added yet.
  • New public IPs and NSG rules would be added when LoadBalancer typed services are created.

2 - Azure LoadBalancer

Azure LoadBalancer basics.

The way Azure defines a LoadBalancer is different from GCE or AWS. Azure’s LB can have multiple frontend IP refs. GCE and AWS only allow one, if you want more, you would need multiple LBs. Since Public IP’s are not part of the LB in Azure, an NSG is not part of the LB in Azure either. However, you cannot delete them in parallel, a Public IP can only be deleted after the LB’s frontend IP ref is removed.

The different Azure Resources such as LB, Public IP, and NSG are the same tier of Azure resources and circular dependencies need to be avoided. In other words, they should only depend on service state.

By default the basic SKU is selected for a load balancer. Services can be annotated to allow auto selection of available load balancers. Service annotations can also be used to provide specific availability sets that host the load balancers. Note that in case of auto selection or specific availability set selection, services are currently not auto-reassigned to an available loadbalancer when the availability set is lost in case of downtime or cluster scale down.

LoadBalancer annotations

Below is a list of annotations supported for Kubernetes services with type LoadBalancer:

AnnotationValueDescriptionKubernetes Version
service.beta.kubernetes.io/azure-load-balancer-internaltrue or falseSpecify whether the load balancer should be internal. It’s defaulting to public if not set.v1.10.0 and later
service.beta.kubernetes.io/azure-load-balancer-internal-subnetName of the subnetSpecify which subnet the internal load balancer should be bound to. It’s defaulting to the subnet configured in cloud config file if not set.v1.10.0 and later
service.beta.kubernetes.io/azure-load-balancer-modeauto, {vmset-name}Specify the Azure load balancer selection algorithm based on vm sets (VMSS or VMAS). There are currently three possible load balancer selection modes : default, auto or “{vmset-name}”. This is only working for basic LB (see below for how it works)v1.10.0 and later
service.beta.kubernetes.io/azure-dns-label-nameName of the PIP DNS labelSpecify the DNS label name for the service’s public IP address (PIP). If it is set to empty string, DNS in PIP would be deleted. Because of a bug, before v1.15.10/v1.16.7/v1.17.3, the DNS label on PIP would also be deleted if the annotation is not specified.v1.15.0 and later
service.beta.kubernetes.io/azure-shared-securityruletrue or falseSpecify that the service should be exposed using an Azure security rule that may be shared with another service, trading specificity of rules for an increase in the number of services that can be exposed. This relies on the Azure “augmented security rules” feature.v1.10.0 and later
service.beta.kubernetes.io/azure-load-balancer-resource-groupName of the PIP resource groupSpecify the resource group of the service’s PIP that are not in the same resource group as the cluster.v1.10.0 and later
service.beta.kubernetes.io/azure-allowed-service-tagsList of allowed service tagsSpecify a list of allowed service tags separated by comma.v1.11.0 and later
service.beta.kubernetes.io/azure-load-balancer-tcp-idle-timeoutTCP idle timeouts in minutesSpecify the time, in minutes, for TCP connection idle timeouts to occur on the load balancer. Default and minimum value is 4. Maximum value is 30. Must be an integer.v1.11.4, v1.12.0 and later
service.beta.kubernetes.io/azure-pip-nameName of PIPSpecify the PIP that will be applied to load balancer. It is used for IPv4 or IPv6 in a single stack cluster.v1.16 and later
service.beta.kubernetes.io/azure-pip-name-ipv6Name of IPv6 PIPAfter v1.27, specify the IPv6 PIP that will be applied to load balancer in a dual stack cluster. For single stack clusters, this annotation will be ignored.v1.27 and later
service.beta.kubernetes.io/azure-pip-prefix-idID of Public IP PrefixSpecify the Public IP Prefix that will be applied to load balancer. It is for IPv4 or IPv6 in a single stack cluster.v1.21 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-pip-prefix-id-ipv6ID of IPv6 Public IP PrefixAfter v1.27, specify the IPv6 Public IP Prefix that will be applied to load balancer in a dual stack cluster. For single stack clusters, this annotation will be ignored.v1.27 and later
service.beta.kubernetes.io/azure-pip-tagsTags of the PIPSpecify the tags of the PIP that will be associated to the load balancer typed service. Docv1.20 and later
service.beta.kubernetes.io/azure-load-balancer-health-probe-intervalHealth probe intervalRefer to the detailed docs herev1.21 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probeThe minimum number of unhealthy responses of health probeRefer to the detailed docs herev1.21 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-pathRequest path of the health probeRefer to the detailed docs herev1.20 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-load-balancer-ipv4Load balancer IPv4 addressSpecify the load balancer IP of IPv4, deprecating Service.spec.loadBalancerIPv1.21 and later
service.beta.kubernetes.io/azure-load-balancer-ipv6Load balancer IPv6 addressSpecify the load balancer IP of IPv6, deprecating Service.spec.loadBalancerIPv1.21 and later
service.beta.kubernetes.io/port_{port}_no_lb_ruletrue/false{port} is the port number in the service. When it is set to true, no lb rule and health probe rule for this port will be generated. health check service should not be exposed to the public internet(e.g. istio/envoy health check service)v1.24 and later with out-of-tree cloud provider
service.beta.kubernetes.io/port_{port}_no_probe_ruletrue/false{port} is the port number in the service. When it is set to true, no health probe rule for this port will be generated.v1.24 and later with out-of-tree cloud provider
service.beta.kubernetes.io/port_{port}_health-probe_protocolHealth probe protocol{port} is the port number in the service. Explicit protocol for the health probe for the service port {port}, overriding port.appProtocol if set. Refer to the detailed docs herev1.24 and later with out-of-tree cloud provider
service.beta.kubernetes.io/port_{port}_health-probe_portport number or port name in service manifest{port} is the port number in the service. Explicit port for the health probe for the service port {port}, overriding the default value. Refer to the detailed docs herev1.24 and later with out-of-tree cloud provider
service.beta.kubernetes.io/port_{port}_health-probe_intervalHealth probe interval{port} is port number of service. Refer to the detailed docs herev1.21 and later with out-of-tree cloud provider
service.beta.kubernetes.io/port_{port}_health-probe_num-of-probeThe minimum number of unhealthy responses of health probe{port} is port number of service. Refer to the detailed docs herev1.21 and later with out-of-tree cloud provider
service.beta.kubernetes.io/port_{port}_health-probe_request-pathRequest path of the health probe{port} is port number of service. Refer to the detailed docs herev1.20 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-load-balancer-enable-high-availability-portsEnable high availability ports on internal SLBHA ports is required when applications require IP fragmentsv1.20 and later
service.beta.kubernetes.io/azure-deny-all-except-load-balancer-source-rangestrue or falseDeny all traffic to the service. This is helpful when the service.Spec.LoadBalancerSourceRanges is set to an internal load balancer typed service. When set the loadBalancerSourceRanges field on the service in order to whitelist ip src addresses, although the generated NSG has added the rules for loadBalancerSourceRanges, the default rule (65000) will allow any vnet traffic, basically meaning the whitelist is of no use. This annotation solves this issue.v1.21 and later
service.beta.kubernetes.io/azure-additional-public-ipsExternal public IPs besides the service’s own public IPIt is mainly used for global VIP on Azure cross-region LoadBalancerv1.20 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-disable-load-balancer-floating-iptrue or falseDisable Floating IP configuration for load balancerv1.21 and later with out-of-tree cloud provider
service.beta.kubernetes.io/azure-pip-ip-tagscomma separated key-value pairs a=b,c=d, for example RoutingPreference=InternetRefer to the docv1.21 and later with out-of-tree cloud provider

Please note that

  • When loadBalancerSourceRanges have been set on service spec, service.beta.kubernetes.io/azure-allowed-service-tags won’t work because of DROP iptables rules from kube-proxy. The CIDRs from service tags should be merged into loadBalancerSourceRanges to make it work.
  • When allocateLoadBalancerNodePorts is set to false, ensure the following conditions are met:
    • Set externalTrafficPolicy to Local.
    • And enable the FloatingIP feature by either not setting annotation service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip, or setting its value to false.

Setting LoadBalancer IP

If you want to specify an IP address for the load balancer, there are two ways:

  • Recommended: Set Service annotations service.beta.kubernetes.io/azure-load-balancer-ipv4 for an IPv4 address and service.beta.kubernetes.io/azure-load-balancer-ipv6 for an IPv6 address. Dual-stack support will be implemented soon. It is highly recommended for new Services.
  • Deprecating: Set Service field: Service.Spec.LoadbalancerIP. This field is deprecating following upstream kubernetes and it cannot support dual-stack. However, current usage remains the same and existing Services are expected to work without modification.

Load balancer selection modes

This is only useful for cluster with basic SKU load balancers. There are currently three possible load balancer selection modes:

  1. Default mode - service has no annotation (“service.beta.kubernetes.io/azure-load-balancer-mode”). In this case the Loadbalancer of the primary Availability set is selected
  2. auto” mode - service is annotated with __auto__ value. In this case, services would be associated with the Loadbalancer with the minimum number of rules.
  3. “{vmset-name}” mode - service is annotated with the name of a VMSS/VMAS. In this case, only load balancers of the specified VMSS/VMAS would be selected, and services would be associated with the one with the minimum number of rules.

Note that the “auto” mode is valid only if the service is newly created. It is not allowed to change the annotation value to __auto__ of an existed service.

The selection mode for a load balancer only works for basic load balancers. Following is the detailed information of allowed number of VMSS/VMAS in a load balancer.

  • Standard SKU supports any virtual machine in a single virtual network, including a mix of virtual machines, availability sets, and virtual machine scale sets. So all the nodes would be added to the same standard LB backend pool with a max size of 1000.
  • Basic SKU only supports virtual machines in a single availability set, or a virtual machine scale set. Only nodes with the same availability set or virtual machine scale set would be added to the basic LB backend pool.

LoadBalancer SKUs

Azure cloud provider supports both basic and standard SKU load balancers, which can be set via loadBalancerSku option in cloud config file. A list of differences between these two SKUs can be found here.

Note that the public IPs used in load balancer frontend configurations should be the same SKU. That is a standard SKU public IP for standard load balancer and a basic SKU public IP for a basic load balancer.

Azure doesn’t support a network interface joining load balancers with different SKUs, hence migration dynamically between them is not supported.

If you do require migration, please delete all services with type LoadBalancer (or change to other type)

Outbound connectivity

Outbound connectivity is also different between the two load balancer SKUs:

  • For the basic SKU, the outbound connectivity is opened by default. If multiple frontends are set, then the outbound IP is selected randomly (and configurable) from them.

  • For the standard SKU, the outbound connectivity is disabled by default. There are two ways to open the outbound connectivity: use a standard public IP with the standard load balancer or define outbound rules.

Standard LoadBalancer

Because the load balancer in a Kubernetes cluster is managed by the Azure cloud provider, and it may change dynamically (e.g. the public load balancer would be deleted if no services defined with type LoadBalancer), outbound rules are the recommended path if you want to ensure the outbound connectivity for all nodes.

Especially note:

  • In the context of outbound connectivity, a single standalone VM, all the VM’s in an Availability Set, all the instances in a VMSS behave as a group. This means, if a single VM in an Availability Set is associated with a Standard SKU, all VM instances within this Availability Set now behave by the same rules as if they are associated with Standard SKU, even if an individual instance is not directly associated with it.

  • Public IP’s used as instance-level public IP are mutually exclusive with outbound rules.

Here is the recommended way to define the outbound rules when using separate provisioning tools:

  • Create a separate IP (or multiple IPs for scale) in a standard SKU for outbound rules. Make use of the allocatedOutboundPorts parameter to allocate sufficient ports for your desired scenario scale.
  • Create a separate pool definition for outbound, and ensure all virtual machines or VMSS virtual machines are in this pool. Azure cloud provider will manage the load balancer rules with another pool, so that provisioning tools and the Azure cloud provider won’t affect each other.
  • Define inbound with load balancing rules and inbound NAT rules as needed, and set disableOutboundSNAT to true on the load balancing rule(s). Don’t rely on the side effect from these rules for outbound connectivity. It makes it messier than it needs to be and limits your options. Use inbound NAT rules to create port forwarding mappings for SSH access to the VM’s rather than burning public IPs per instance.

Exclude nodes from the load balancer

Excluding nodes from Azure LoadBalancer is supported since v1.20.0.

The kubernetes controller manager supports excluding nodes from the load balancer backend pools by enabling the feature gate ServiceNodeExclusion. To exclude nodes from Azure LoadBalancer, label node.kubernetes.io/exclude-from-external-load-balancers=true should be added to the nodes.

  1. To use the feature, the feature gate ServiceNodeExclusion should be on (enabled by default since its beta on v1.19).

  2. The labeled nodes would be excluded from the LB in the next LB reconcile loop, which needs one or more LB typed services to trigger. Basically, users could trigger the update by creating a service. If there are one or more LB typed services existing, no extra operations are needed.

  3. To re-include the nodes, just remove the label and the update would be operated in the next LB reconcile loop.

Limitations

  • Excluding nodes from LoadBalancer is not supported on AKS managed nodes.

Using SCTP

SCTP protocol services are only supported on internal standard LoadBalancer, hence annotation service.beta.kubernetes.io/azure-load-balancer-internal: "true" should be added to SCTP protocol services. See below for an example:

apiVersion: v1
kind: Service
metadata:
  name: sctpservice
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
spec:
  type: LoadBalancer
  selector:
    app: sctpserver
  ports:
    - name: sctpserver
      protocol: SCTP
      port: 30102
      targetPort: 30102

Custom Load Balancer health probe

As documented here, Tcp, Http and Https are three protocols supported by load balancer service.

Currently, the default protocol of the health probe varies among services with different transport protocols, app protocols, annotations and external traffic policies.

  1. for local services, HTTP and /healthz would be used. The health probe will query NodeHealthPort rather than actual backend service
  2. for cluster TCP services, TCP would be used.
  3. for cluster UDP services, no health probes.

Note: For local services with PLS integration and PLS proxy protocol enabled, the default HTTP+/healthz health probe does not work. Thus health probe can be customized the same way as cluster services to support this scenario. For more details, please check PLS Integration Note.

Since v1.20, service annotation service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path is introduced to determine the health probe behavior.

  • For clusters <=1.23, spec.ports.appProtocol would only be used as probe protocol when service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path is also set.
  • For clusters >1.24, spec.ports.appProtocol would be used as probe protocol and / would be used as default probe request path (service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path could be used to change to a different request path).

Note that the request path would be ignored when using TCP or the spec.ports.appProtocol is empty. More specifically:

loadbalancer skuexternalTrafficPolicyspec.ports.Protocolspec.ports.AppProtocolservice.beta.kubernetes.io/azure-load-balancer-health-probe-request-pathLB Probe ProtocolLB Probe Request Path
standardlocalanyanyanyhttp/healthz
standardclusterudpanyanynullnull
standardclustertcp(ignored)tcpnull
standardclustertcptcp(ignored)tcpnull
standardclustertcphttp/httpsTCP(<=1.23) or http/https(>=1.24)null(<=1.23) or /(>=1.24)
standardclustertcphttp/https/custom-pathhttp/https/custom-path
standardclustertcpunsupported protocol/custom-pathtcpnull
basiclocalanyanyanyhttp/healthz
basicclustertcp(ignored)tcpnull
basicclustertcptcp(ignored)tcpnull
basicclustertcphttpTCP(<=1.23) or http/https(>=1.24)null(<=1.23) or /(>=1.24)
basicclustertcphttp/custom-pathhttp/custom-path
basicclustertcpunsupported protocol/custom-pathtcpnull

Since v1.21, two service annotations service.beta.kubernetes.io/azure-load-balancer-health-probe-interval and load-balancer-health-probe-num-of-probe are introduced, which customize the configuration of health probe. If service.beta.kubernetes.io/azure-load-balancer-health-probe-interval is not set, Default value of 5 is applied. If load-balancer-health-probe-num-of-probe is not set, Default value of 2 is applied. And total probe should be less than 120 seconds.

Custom Load Balancer health probe for port

Different ports in a service may require different health probe configurations. This could be because of service design (such as a single health endpoint controlling multiple ports), or Kubernetes features like the MixedProtocolLBService.

The following annotations can be used to customize probe configuration per service port.

port specific annotationglobal probe annotationUsage
service.beta.kubernetes.io/port_{port}_no_lb_ruleN/A (no equivalent globally)if set true, no lb rules and probe rules will be generated
service.beta.kubernetes.io/port_{port}_no_probe_ruleN/A (no equivalent globally)if set true, no probe rules will be generated
service.beta.kubernetes.io/port_{port}_health-probe_protocolN/A (no equivalent globally)Set the health probe protocol for this service port (e.g. Http, Https, Tcp)
service.beta.kubernetes.io/port_{port}_health-probe_portN/A (no equivalent globally)Sets the health probe port for this service port (e.g. 15021)
service.beta.kubernetes.io/port_{port}_health-probe_request-pathservice.beta.kubernetes.io/azure-load-balancer-health-probe-request-pathFor Http or Https, sets the health probe request path. Defaults to /
service.beta.kubernetes.io/port_{port}_health-probe_num-of-probeservice.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probeNumber of consecutive probe failures before the port is considered unhealthy
service.beta.kubernetes.io/port_{port}_health-probe_intervalservice.beta.kubernetes.io/azure-load-balancer-health-probe-intervalThe amount of time between probe attempts

For following manifest, probe rule for port httpsserver is different from the one for httpserver because annotations for port httpsserver are specified.

apiVersion: v1
kind: Service
metadata:
  name: appservice
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-health-probe-num-of-probe: "5"
    service.beta.kubernetes.io/port_443_health-probe_num-of-probe: "4"
spec:
  type: LoadBalancer
  selector:
    app: server
  ports:
    - name: httpserver
      protocol: TCP
      port: 80
      targetPort: 30102
    - name: httpsserver
      protocol: TCP
      appProtocol: HTTPS
      port: 443
      targetPort: 30104

In this manifest, the https ports use a different node port, an HTTP readiness check at port 10256 on /healthz(healthz endpoint of kube-proxy).

apiVersion: v1
kind: Service
metadata:
  name: istio
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/port_443_health-probe_protocol: "http"
    service.beta.kubernetes.io/port_443_health-probe_port: "10256"
    service.beta.kubernetes.io/port_443_health-probe_request-path: "/healthz"
spec:
  ports:
    - name: https
      protocol: TCP
      port: 443
      targetPort: 8443
      nodePort: 30104
      appProtocol: https
  selector:
    app: istio-ingressgateway
    gateway: istio-ingressgateway
    istio: ingressgateway
  type: LoadBalancer
  sessionAffinity: None
  externalTrafficPolicy: Local
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  allocateLoadBalancerNodePorts: true
  internalTrafficPolicy: Cluster

In this manifest, the https ports use a different health probe endpoint, an HTTP readiness check at port 30000 on /healthz/ready.

apiVersion: v1
kind: Service
metadata:
  name: istio
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/port_443_health-probe_protocol: "http"
    service.beta.kubernetes.io/port_443_health-probe_port: "30000"
    service.beta.kubernetes.io/port_443_health-probe_request-path: "/healthz/ready"
spec:
  ports:
    - name: https
      protocol: TCP
      port: 443
      targetPort: 8443
      appProtocol: https
  selector:
    app: istio-ingressgateway
    gateway: istio-ingressgateway
    istio: ingressgateway
  type: LoadBalancer
  sessionAffinity: None
  externalTrafficPolicy: Local
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  allocateLoadBalancerNodePorts: true
  internalTrafficPolicy: Cluster

Probing kube-proxy with a shared health probe

This feature is supported since v1.28.5

In externalTrafficPolicy: Local, SLB directly probes kube-proxy – the thing it is directing traffic to. If kube-proxy is experiencing an issue on a given node, this will be visible on the healthCheckNodePort and SLB will stop sending traffic to this node.

In externalTrafficPolicy: Cluster, the probes are directed to the backend application, and thus SLB can only know kube-proxy’s health indirectly – by whether the probes are forwarded to a backend application and answered successfully. This indirection causes confusion and causes problems in multiple different ways.

It is provided since v1.28.5 that a shared health probe can be used to probe kube-proxy. This feature is enabled by setting clusterServiceLoadBalancerHealthProbeMode: "shared" in the cloud provider configuration. When this feature is enabled, the health probe will be configured to probe kube-proxy on the healthCheckNodePort. This will allow SLB to directly probe kube-proxy and thus detect kube-proxy issues more quickly and accurately. The customization options are listed as below:

ConfigurationDefaultDescription
clusterServiceLoadBalancerHealthProbeModeservicenodeportSupported values are shared and servicenodeport. All ETP cluster service will share one health probe if shared is set. Otherwise, each ETP cluster service will have its own health probe.
clusterServiceSharedLoadBalancerHealthProbePort10256Default to kube-proxy healthCheckNodePort.
clusterServiceSharedLoadBalancerHealthProbePath/healthzDefault to kube-proxy health check path.

When a service is integrated with a private link service and uses the proxy protocol, the health check requests to the kube-proxy will fail. A new cloud-node-manager sidecar health-probe-proxy is introduced to solve this issue. The sidecar will forward the health check requests to the kube-proxy and return the response to the load balancer. The sidecar will read these requests, parse the proxy protocol header, and forward the request to the kube-proxy. If the proxy protocol is not used, this daemonset will forward the request to the kube-proxy without any modification. To enable health-probe-proxy sidecar, turn on cloudNodeManager.enableHealthProbeProxy in the helm chart, or deploy it as a daemonset manually. To read more, check this documentation.

Configure Load Balancer backend

This feature is supported since v1.23.0

The backend pool type can be configured by specifying loadBalancerBackendPoolConfigurationType in the cloud configuration file. There are three possible values:

  1. nodeIPConfiguration (default). In this case we attach nodes to the LB by calling the VMSS/NIC API to associate the corresponding node IP configuration with the LB backend pool.
  2. nodeIP. In this case we attach nodes to the LB by calling the LB API to add the node private IP addresses to the LB backend pool.
  3. podIP (not supported yet). In this case we do not attach nodes to the LB. Instead we directly adding pod IPs to the LB backend pool.

To migrate from one backend pool type to another, just change the value of loadBalancerBackendPoolConfigurationType and re-apply the cloud configuration file. There will be downtime during the migration process.

Migration API from nodeIPConfiguration to nodeIP

This feature is supported since v1.24.0

The migration from nodeIPConfiguration to nodeIP can be done without downtime by configuring "enableMigrateToIPBasedBackendPoolAPI": true in the cloud configuration file.


## Load balancer limits

The limits of the load balancer related resources are listed below:

**Standard Load Balancer**

| Resource                                | Limit                                           |
| --------------------------------------- | ----------------------------------------------- |
| Load balancers                          | 1,000                                           |
| Rules per resource                      | 1,500                                           |
| Rules per NIC (across all IPs on a NIC) | 300                                             |
| Frontend IP configurations              | 600                                             |
| Backend pool size                       | 1,000 IP configurations, single virtual network |
| Backend resources per Load Balancer     | 150                                             |
| High-availability ports                 | 1 per internal frontend                         |
| Outbound rules per Load Balancer        | 600                                             |
| Load Balancers per VM                   | 2 (1 Public and 1 internal)                     |

The limit is up to 150 resources, in any combination of standalone virtual machine resources, availability set resources, and virtual machine scale-set placement groups.

**Basic Load Balancer**

| Resource                                | Limit                                          |
| --------------------------------------- | ---------------------------------------------- |
| Load balancers                          | 1,000                                          |
| Rules per resource                      | 250                                            |
| Rules per NIC (across all IPs on a NIC) | 300                                            |
| Frontend IP configurations              | 200                                            |
| Backend pool size                       | 300 IP configurations, single availability set |
| Availability sets per Load Balancer     | 1                                              |
| Load Balancers per VM                   | 2 (1 Public and 1 internal)                    |

3 - Azure Permissions

Permissions required to set up Azure resources.

Azure cloud provider requires a set of permissions to manage the Azure resources. Here is a list of all permissions and reasons of why they’re required.

// Required to create, delete or update LoadBalancer for LoadBalancer service
Microsoft.Network/loadBalancers/delete
Microsoft.Network/loadBalancers/read
Microsoft.Network/loadBalancers/write
Microsoft.Network/loadBalancers/backendAddressPools/read
Microsoft.Network/loadBalancers/backendAddressPools/write
Microsoft.Network/loadBalancers/backendAddressPools/delete

// Required to allow query, create or delete public IPs for LoadBalancer service
Microsoft.Network/publicIPAddresses/delete
Microsoft.Network/publicIPAddresses/read
Microsoft.Network/publicIPAddresses/write

// Required if public IPs from another resource group are used for LoadBalancer service
// This is because of the linked access check when adding the public IP to LB frontendIPConfiguration
Microsoft.Network/publicIPAddresses/join/action

// Required to create or delete security rules for LoadBalancer service
Microsoft.Network/networkSecurityGroups/read
Microsoft.Network/networkSecurityGroups/write

// Required to create, delete or update AzureDisks
Microsoft.Compute/disks/delete
Microsoft.Compute/disks/read
Microsoft.Compute/disks/write
Microsoft.Compute/locations/DiskOperations/read

// Required to create, update or delete storage accounts for AzureFile or AzureDisk
Microsoft.Storage/storageAccounts/delete
Microsoft.Storage/storageAccounts/listKeys/action
Microsoft.Storage/storageAccounts/read
Microsoft.Storage/storageAccounts/write
Microsoft.Storage/operations/read

// Required to create, delete or update routeTables and routes for nodes
Microsoft.Network/routeTables/read
Microsoft.Network/routeTables/routes/delete
Microsoft.Network/routeTables/routes/read
Microsoft.Network/routeTables/routes/write
Microsoft.Network/routeTables/write

// Required to query information for VM (e.g. zones, faultdomain, size and data disks)
Microsoft.Compute/virtualMachines/read

// Required to attach AzureDisks to VM
Microsoft.Compute/virtualMachines/write

// Required to query information for vmssVM (e.g. zones, faultdomain, size and data disks)
Microsoft.Compute/virtualMachineScaleSets/read
Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read
Microsoft.Compute/virtualMachineScaleSets/virtualmachines/instanceView/read

// Required to add VM to LoadBalancer backendAddressPools
Microsoft.Network/networkInterfaces/write
// Required to add vmss to LoadBalancer backendAddressPools
Microsoft.Compute/virtualMachineScaleSets/write
// Required to attach AzureDisks and add vmssVM to LB
Microsoft.Compute/virtualMachineScaleSets/virtualmachines/write
// Required to upgrade VMSS models to latest for all instances
// only needed for Kubernetes 1.11.0-1.11.9, 1.12.0-1.12.8, 1.13.0-1.13.5, 1.14.0-1.14.1
Microsoft.Compute/virtualMachineScaleSets/manualupgrade/action

// Required to query internal IPs and loadBalancerBackendAddressPools for VM
Microsoft.Network/networkInterfaces/read
// Required to query internal IPs and loadBalancerBackendAddressPools for vmssVM
microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces/read
// Required to get public IPs for vmssVM
Microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces/ipconfigurations/publicipaddresses/read

// Required to check whether subnet existing for ILB in another resource group
Microsoft.Network/virtualNetworks/read
Microsoft.Network/virtualNetworks/subnets/read

// Required to create, update or delete snapshots for AzureDisk
Microsoft.Compute/snapshots/delete
Microsoft.Compute/snapshots/read
Microsoft.Compute/snapshots/write

// Required to get vm sizes for getting AzureDisk volume limit
Microsoft.Compute/locations/vmSizes/read
Microsoft.Compute/locations/operations/read

// Required to create, update or delete PrivateLinkService for Service
Microsoft.Network/privatelinkservices/delete
Microsoft.Network/privatelinkservices/privateEndpointConnections/delete
Microsoft.Network/privatelinkservices/read
Microsoft.Network/privatelinkservices/write
Microsoft.Network/virtualNetworks/subnets/write

4 - Use Availability Zones

Use availability zones in provider azure.

Feature Status: Alpha since v1.12.

Kubernetes v1.12 adds support for Azure availability zones (AZ). Nodes in availability zone will be added with label failure-domain.beta.kubernetes.io/zone=<region>-<AZ> and topology-aware provisioning is added for Azure managed disks storage class.

TOC:

Pre-requirements

Because only standard load balancer is supported with AZ, it is a prerequisite to enable AZ for the cluster. It should be configured in Azure cloud provider configure file (e.g. /etc/kubernetes/cloud-config/azure.json):

{
    "loadBalancerSku": "standard",
    ...
}

If topology-aware provisioning feature is used, feature gate VolumeScheduling should be enabled on master components (e.g. kube-apiserver, kube-controller-manager and kube-scheduler).

Node labels

Both zoned and unzoned nodes are supported, but the value of node label failure-domain.beta.kubernetes.io/zone are different:

  • For zoned nodes, the value is <region>-<AZ>, e.g. centralus-1.
  • For unzoned nodes, the value is faultDomain, e.g. 0.

e.g.

$ kubectl get nodes --show-labels
NAME                STATUS    AGE   VERSION    LABELS
kubernetes-node12   Ready     6m    v1.11      failure-domain.beta.kubernetes.io/region=centralus,failure-domain.beta.kubernetes.io/zone=centralus-1,...

Load Balancer

loadBalancerSku has been set to standard in cloud provider configure file, so standard load balancer and standard public IPs will be provisioned automatically for services with type LoadBalancer. Both load balancer and public IPs are zone redundant.

Managed Disks

Zone-aware and topology-aware provisioning are supported for Azure managed disks. To support these features, a few options are added in AzureDisk storage class:

  • zoned: indicates whether new disks are provisioned with AZ. Default is true.
  • allowedTopologies: indicates which topologies are allowed for topology-aware provisioning. Only can be set if zoned is not false.

StorageClass examples

An example of zone-aware provisioning storage class is:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    kubernetes.io/cluster-service: "true"
  name: managed-premium
parameters:
  kind: Managed
  storageaccounttype: Premium_LRS
  zoned: "true"
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer

Another example of topology-aware provisioning storage class is:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
  labels:
    kubernetes.io/cluster-service: "true"
  name: managed-premium
parameters:
  kind: Managed
  storageaccounttype: Premium_LRS
provisioner: kubernetes.io/azure-disk
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - centralus-1
    - centralus-2

PV examples

When feature gate VolumeScheduling disabled, no NodeAffinity set for zoned PV:

$ kubectl describe pv
Name:              pvc-d30dad05-9ad8-11e8-94f2-000d3a07de8c
Labels:            failure-domain.beta.kubernetes.io/region=southeastasia
                   failure-domain.beta.kubernetes.io/zone=southeastasia-2
Annotations:       pv.kubernetes.io/bound-by-controller=yes
                   pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
                   volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      default
Status:            Bound
Claim:             default/pvc-azuredisk
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          5Gi
Node Affinity:
  Required Terms:
    Term 0:        failure-domain.beta.kubernetes.io/region in [southeastasia]
                   failure-domain.beta.kubernetes.io/zone in [southeastasia-2]
Message:
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     k8s-5b3d7b8f-dynamic-pvc-d30dad05-9ad8-11e8-94f2-000d3a07de8c
    DiskURI:      /subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Compute/disks/k8s-5b3d7b8f-dynamic-pvc-d30dad05-9ad8-11e8-94f2-000d3a07de8c
    Kind:         Managed
    FSType:
    CachingMode:  None
    ReadOnly:     false
Events:           <none>

When feature gate VolumeScheduling enabled, NodeAffinity will be populated for zoned PV:

$ kubectl describe pv
Name:              pvc-0284337b-9ada-11e8-a7f6-000d3a07de8c
Labels:            failure-domain.beta.kubernetes.io/region=southeastasia
                   failure-domain.beta.kubernetes.io/zone=southeastasia-2
Annotations:       pv.kubernetes.io/bound-by-controller=yes
                   pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
                   volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      default
Status:            Bound
Claim:             default/pvc-azuredisk
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          5Gi
Node Affinity:
  Required Terms:
    Term 0:        failure-domain.beta.kubernetes.io/region in [southeastasia]
                   failure-domain.beta.kubernetes.io/zone in [southeastasia-2]
Message:
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     k8s-5b3d7b8f-dynamic-pvc-0284337b-9ada-11e8-a7f6-000d3a07de8c
    DiskURI:      /subscriptions/<subscription-id>/resourceGroups/<rg-name>/providers/Microsoft.Compute/disks/k8s-5b3d7b8f-dynamic-pvc-0284337b-9ada-11e8-a7f6-000d3a07de8c
    Kind:         Managed
    FSType:
    CachingMode:  None
    ReadOnly:     false
Events:           <none>

While unzoned disks are not able to attach in zoned nodes, NodeAffinity will also be set for them so that they will only be scheduled to unzoned nodes:

$ kubectl describe pv pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
Name:              pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
Labels:            <none>
Annotations:       pv.kubernetes.io/bound-by-controller=yes
                   pv.kubernetes.io/provisioned-by=kubernetes.io/azure-disk
                   volumehelper.VolumeDynamicallyCreatedByKey=azure-disk-dynamic-provisioner
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      azuredisk-unzoned
Status:            Bound
Claim:             default/unzoned-pvc
Reclaim Policy:    Delete
Access Modes:      RWO
Capacity:          5Gi
Node Affinity:
  Required Terms:
    Term 0:        failure-domain.beta.kubernetes.io/region in [southeastasia]
                   failure-domain.beta.kubernetes.io/zone in [0]
    Term 1:        failure-domain.beta.kubernetes.io/region in [southeastasia]
                   failure-domain.beta.kubernetes.io/zone in [1]
    Term 2:        failure-domain.beta.kubernetes.io/region in [southeastasia]
                   failure-domain.beta.kubernetes.io/zone in [2]
Message:
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     k8s-5b3d7b8f-dynamic-pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
    DiskURI:      /subscriptions/<subscription>/resourceGroups/<rg-name>/providers/Microsoft.Compute/disks/k8s-5b3d7b8f-dynamic-pvc-bdf93a67-9c45-11e8-ba6f-000d3a07de8c
    Kind:         Managed
    FSType:
    CachingMode:  None
    ReadOnly:     false
Events:           <none>

Appendix

Note that unlike most cases, fault domain and availability zones mean different on Azure:

  • A Fault Domain (FD) is essentially a rack of servers. It consumes subsystems like network, power, cooling etc.
  • Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking.

An Availability Zone in an Azure region is a combination of a fault domain, and an update domain (Same like FD, but for updates. When upgrading a deployment, it is carried out one update domain at a time). For example, if you create three or more VMs across three zones in an Azure region, your VMs are effectively distributed across three fault domains and three update domains.

Reference

See design docs for AZ in KEP for Azure availability zones.

5 - Support Multiple Node Types

Node type description in provider azure.

Kubernetes v1.26 adds support for using Azure VMSS Flex VMs as the cluster nodes. Besides, mixing up different VM types in the same cluster is also supported. There is no API change expected from end users’ perspective when manipulating the Kubernetes cluster, however, users can choose to specify the VM type when configuring the Cloud Provider to further optimize the API calls in Cloud Controller Manager. Below are the configurations suggested based on the cluster modes.

Node TypeConfigurationsComments
Standalone VMs or AvailabilitySet VMsvmType == standardThis will bypass the node type check and assume all the nodes in the cluster are standalone VMs / AvailabilitySet VMs. This should only be used for pure standalone VM / AvailabilitySet VM clusters.
VMSS Uniform VMsvmType==vmss && DisableAvailabilitySetNodes==true && EnbleVmssFlexNodes==falseThis will bypass the node type check and assume all the nodes in the cluster are VMSS Uniform VMs. This should only be used for pure VMSS Uniform VM clusters.
VMSS Flex VMsvmType==vmssflexThis will bypass the node type check and assume all the nodes in the cluster are VMSS Flex VMs. This should only be used for pure VMSS Flex VM clusters (since v1.26.0).
Standalone VMs, AvailabilitySet VMs, VMSS Uniform VMs and VMSS Flex VMsvmType==vmss && (DisableAvailabilitySetNodes==false || EnbleVmssFlexNodes==true)This should be used the clusters of which the nodes are mixed from standalone VMs, AvailabilitySet VMs, VMSS Flex VMs (since v1.26.0) and VMSS Uniform VMs. Node type will be checked and corresponding cloud provider API will be called based on the node type.

6 - Deploy Cross Resource Group Nodes

Deploy cross resource group nodes.

Feature status: GA since v1.21.

Kubernetes v1.21 adds support for cross resource group (RG) nodes and unmanaged (such as on-prem) nodes in Azure cloud provider. A few assumptions are made for such nodes:

  • Cross-RG nodes are in same region and set with required labels (as clarified in the following part)
  • Nodes will not be part of the load balancer managed by cloud provider
  • Both node and container networking should be configured properly by provisioning tools
  • AzureDisk is supported for Azure cross-RG nodes, but not for on-prem nodes

Pre-requirements

Because cross-RG nodes and unmanaged nodes won’t be added to Azure load balancer backends, feature gate ServiceNodeExclusion should be enabled for master components (ServiceNodeExclusion has been GA and enabled by default since v1.21).

Cross-RG nodes

Cross-RG nodes should register themselves with required labels together with cloud provider:

  • node.kubernetes.io/exclude-from-external-load-balancers, which is used to exclude the node from load balancer.
    • alpha.service-controller.kubernetes.io/exclude-balancer=true should be used if the cluster version is below v1.16.0.
  • kubernetes.azure.com/resource-group=<rg-name>, which provides external RG and is used to get node information.
  • cloud provider config
    • --cloud-provider=azure when using kube-controller-manager
    • --cloud-provider=external when using cloud-controller-manager

For example,

kubelet ... \
  --cloud-provider=azure \
  --cloud-config=/etc/kubernetes/cloud-config/azure.json \
  --node-labels=node.kubernetes.io/exclude-from-external-load-balancers=true,kubernetes.azure.com/resource-group=<rg-name>

Unmanaged nodes

On-prem nodes are different from Azure nodes, all Azure coupled features (such as load balancers and Azure managed disks) are not supported for them. To prevent the node being deleted, Azure cloud provider will always assumes the node existing.

On-prem nodes should register themselves with labels node.kubernetes.io/exclude-from-external-load-balancers=true and kubernetes.azure.com/managed=false:

  • node.kubernetes.io/exclude-from-external-load-balancers=true, which is used to exclude the node from load balancer.
  • kubernetes.azure.com/managed=false, which indicates the node is on-prem or on other clouds.

For example,

kubelet ...\
  --cloud-provider= \
  --node-labels=node.kubernetes.io/exclude-from-external-load-balancers=true,kubernetes.azure.com/managed=false

Limitations

Cross resource group nodes and unmanaged nodes are unsupported when joined to an AKS cluster. Using these labels on AKS-managed nodes is not supported.

Reference

See design docs for cross resource group nodes in KEP 20180809-cross-resource-group-nodes.

7 - Multiple Services Sharing One IP Address

Bind one IP address to multiple services.

This feature is supported since v1.20.0.

Provider Azure supports sharing one IP address among multiple load balancer typed external or internal services. To share an IP address among multiple public services, a public IP resource is needed. This public IP could be created in advance or let the cloud provider provision it when creating the first external service. Specifically, Azure would create a public IP resource automatically when an external service is discovered.

apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: default
spec:
  ports:
    - port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

Note that the annotations service.beta.kubernetes.io/azure-load-balancer-ipv4, service.beta.kubernetes.io/azure-load-balancer-ipv6, field Service.Spec.LoadBalancerIP are not set, or Azure would find a pre-allocated public IP with the address. After obtaining the IP address of the service, you could create other services using this address.

apiVersion: v1
kind: Service
metadata:
  name: https
  namespace: default
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-ipv4: 1.2.3.4 # the IP address could be the same as it is of `nginx` service
spec:
  ports:
    - port: 443
      protocol: TCP
      targetPort: 443
  selector:
    app: https
  type: LoadBalancer

Note that if you specify the annotations service.beta.kubernetes.io/azure-load-balancer-ipv4, service.beta.kubernetes.io/azure-load-balancer-ipv6 or field Service.Spec.LoadBalancerIP but there is no corresponding public IP pre-allocated, an error would be reported.

DNS

Even if multiple services can refer to one public IP, the DNS label cannot be re-used. The public IP would have the label kubernetes-dns-label-service: <svcName> to indicate which service is binding to the DNS label. In this case if there is another service sharing this specific IP address trying to refer to the DNS label, an error would be reported. For managed public IPs, this label will be added automatically by the cloud provider. For static public IPs, this label should be added manually.

Using public IP name instead of IP address to share the public IP

This feature is supported since v1.24.0.

In addition to using the IP address annotation, you could also use the public IP name to share the public IP. The public IP name could be specified by the annotation service.beta.kubernetes.io/azure-pip-name. You can point to a system-created public IP or a static public IP.

apiVersion: v1
kind: Service
metadata:
  name: https
  namespace: default
  annotations:
    service.beta.kubernetes.io/azure-pip-name: pip-1
spec:
  ports:
    - port: 443
      protocol: TCP
      targetPort: 443
  selector:
    app: https
  type: LoadBalancer

Restrictions

The cloud provider azure manages the lifecycle of the system-created public IPs. By default, there are two kinds of system managed tags: kubernetes-cluster-name and service (see the picture below). The controller manager would add the service name to the service if a service is trying to refer to the public IP, and remove the name from the service if the service is deleted. The public IP would be deleted if there is no service in the tag service. However, according to the docs of azure tags, there are several restrictions:

  • Each resource, resource group, and subscription can have a maximum of 50 tag name/value pairs. If you need to apply more tags than the maximum allowed number, use a JSON string for the tag value. The JSON string can contain many values that are applied to a single tag name. A resource group or subscription can contain many resources that each have 50 tag name/value pairs.

  • The tag name is limited to 512 characters, and the tag value is limited to 256 characters. For storage accounts, the tag name is limited to 128 characters, and the tag value is limited to 256 characters.

Based to that, we suggest to use static public IPs when there are more than 10 services sharing the IP address.

tags on the public IP

8 - Tagging resources managed by Cloud Provider Azure

This feature is supported since v1.20.0.

We could use tags to organize your Azure resources and management hierarchy. Cloud Provider Azure supports tagging managed resource through configuration file or service annotation.

Specifically, the shared resources (load balancer, route table and security group) could be tagged by setting tags in azure.json:

{
  "tags": "a=b,c=d"
}

the controller manager would parse this configuration and tag the shared resources once restarted.

The non-shared resource (public IP) could be tagged by setting tags in azure.json or service annotation service.beta.kubernetes.io/azure-pip-tags. The format of the two is similar and the tags in the annotation would be considered first when there are conflicts between the configuration file and the annotation.

The annotation service.beta.kubernetes.io/azure-pip-tags only works for managed public IPs. For BYO public IPs, the cloud provider would not apply any tags to them.

When the configuration, file or annotation, is updated, the old ones would be updated if there are conflicts. For example, after updating {"tags": "a=b,c=d"} to {"tags": "a=c,e=f"}, the new tags would be a=c,c=d,e=f.

Integrating with system tags

This feature is supported since v1.21.0.

Normally the controller manager don’t delete the existing tags even if they are not included in the new version of azure configuration files, because the controller manager doesn’t know which tags should be deleted and which should not (e.g., tags managed by cloud provider itself). We can leverage the config systemTags in the cloud configuration file to control what tags can be deleted. Here are the examples:

TagsSystemTagsexisting tags on resourcesnew tags on resources
“a=b,c=d”""{}{“a”: “b”, “c”: “d”}
“a=b,c=d”""{“a”: “x”, “c”: “y”}{“a”: “b”, “c”: “d”}
“a=b,c=d”""{“e”: “f”}{“a”: “b”, “c”: “d”, “e”: “f”} /* won’t delete e because the SystemTags is empty */
“c=d”“a”{“a”: “b”}{“a”: “b”, “c”: “d”} /* won’t delete a because it’s in the SystemTags */
“c=d”“x”{“a”: “b”}{“c”: “d”} /* will delete a because it’s not in Tags or SystemTags */

Please consider migrating existing “tags” to “tagsMap”, the support of “tags” configuration would be removed in a future release.

Including special characters in tags

This feature is supported since v1.23.0.

Normally we don’t support special characters such as = or , in key-value pairs. These characters will be treated as separator and will not be included in the key/value literal. To solve this problem, tagsMap is introduced since v1.23.0, in which a JSON-style tag is acceptable.

{
  "tags": "a=b,c=d",
  "tagsMap": {"e": "f", "g=h": "i,j"}
}

tags and tagsMap will be merged, and similarly, they are case-insensitive.

9 - Kubelet Credential Provider

Detailed steps to setup out-of-tree Kubelet Credential Provider.

Note: The Kubelet credential provider feature is still in alpha and shouldn’t be used in production environments. Please use --azure-container-registry-config=/etc/kubernetes/cloud-config/azure.json if you need pulling images from ACR in production.

As part of Out-of-Tree Credential Providers, the kubelet builtin image pulling from ACR (which could be enabled by setting kubelet --azure-container-registry-config=<config-file>) would be moved out-of-tree credential plugin acr-credential-provider. Please refer the original KEP for details.

In order to switch the kubelet credential provider to out-of-tree, you’ll have to

  • Remove --azure-container-registry-config from kubelet configuration options.
  • Add --feature-gates=KubeletCredentialProviders=true to kubelet configuration options.
  • Create directory /var/lib/kubelet/credential-provider, download ‘acr-credential-provider’ binary to this directory and add --image-credential-provider-bin-dir=/var/lib/kubelet/credential-provider to kubelet configuration options.
  • Create the following credential-provider-config.yaml file and add --image-credential-provider-config=/var/lib/kubelet/credential-provider-config.yaml to kubelet configuration options.
# cat /var/lib/kubelet/credential-provider-config.yaml
kind: CredentialProviderConfig
apiVersion: kubelet.config.k8s.io/v1
providers:
- name: acr-credential-provider
  apiVersion: credentialprovider.kubelet.k8s.io/v1
  defaultCacheDuration: 10m
  matchImages:
  - "*.azurecr.io"
  - "*.azurecr.cn"
  - "*.azurecr.de"
  - "*.azurecr.us"
  args:
  - /etc/kubernetes/azure.json

10 - Node IPAM controller

Usage of out-of-tree Node IPAM allocator.

This feature is supported since v1.21.0.

Background

The in-tree Node IPAM controller only supports a fixed node CIDR mask size for all nodes, while in multiple node pool (VMSS) scenarios, different mask sizes are required for different node pools. There is a GCE-specific cloud CIDR allocator for a similar scenario, but that is not exposed in cloud provider API and it is planned to be moved out-of-tree.

Hence this docs proposes an out-of-tree node IPAM controller. Specifically, allocate different pod CIDRs based on different CIDR mask size for different node pools (VMSS or VMAS).

Usage

There are two kinds of CIDR allocator in the node IPAM controller, which are RangeAllocator and CloudAllocator. The RangeAllocator is the default one which allocates the pod CIDR for every node in the range of the cluster CIDR. The CloudAllocator allocates the pod CIDR for every node in the range of the CIDR on the corresponding VMSS or VMAS.

The pod CIDR mask size of each node that belongs to a specific VMSS or VMAS is set by a specific tag {"kubernetesNodeCIDRMaskIPV4": "24"} or {"kubernetesNodeCIDRMaskIPV6": "64"}. Note that the mask size tagging on the VMSS or VMAS must be within the cluster CIDR, or an error would be thrown.

When the above tag doesn’t exist on VMSS/VMAS, the default mask size (24 for ipv4 and 64 for ipv6) would be used.

To turn on the out-of-tree node IPAM controller:

  1. Disable the in-tree node IPAM controller by setting --allocate-node-cidrs=false in kube-controller-manager.
  2. Enable the out-of-tree counterpart by setting --allocate-node-cidrs=true in cloud-controller-manager.
  3. To use RangeAllocator:
    • configure the --cluster-cidr, --service-cluster-ip-range and --node-cidr-mask-size;
    • if you enable the ipv6 dualstack, setting --node-cidr-mask-size-ipv4 and --node-cidr-mask-size-ipv6 instead of --node-cidr-mask-size. An error would be reported if --node-cidr-mask-size and --node-cidr-mask-size-ipv4 (or --node-cidr-mask-size-ipv6) are set to non-zero values at a time. If only --node-cidr-mask-size is set, which is not recommended, the --node-cidr-mask-size-ipv4 and --node-cidr-mask-size-ipv6 would be set to this value by default.
  4. To use CloudAllocator:
    • set the --cidr-allocator-type=CloudAllocator;
    • configure mask sizes of each VMSS/VMAS by tagging {"kubernetesNodeCIDRMaskIPV4": "custom-mask-size"} and {"kubernetesNodeCIDRMaskIPV4": "custom-mask-size"} if necessary.

Configurations

kube-controller-manager

kube-controller-manager would be configured with option --allocate-node-cidrs=false to disable the in-tree node IPAM controller.

cloud-controller-manager

The following configurations from cloud-controller-manager would be used as default options:

nametypedefaultdescription
allocate-node-cidrsbooltrueShould CIDRs for Pods be allocated and set on the cloud provider.
cluster-cidrstring“10.244.0.0/16”CIDR Range for Pods in cluster. Requires –allocate-node-cidrs to be true. It will be ignored when enabling dualstack.
service-cluster-ip-rangestring""CIDR Range for Services in cluster, this would get excluded from the allocatable range. Requires –allocate-node-cidrs to be true.
node-cidr-mask-sizeint24Mask size for node cidr in cluster. Default is 24 for IPv4 and 64 for IPv6.
node-cidr-mask-size-ipv4int24Mask size for IPv4 node cidr in dual-stack cluster. Default is 24.
node-cidr-mask-size-ipv6int64Mask size for IPv6 node cidr in dual-stack cluster. Default is 64.
cidr-allocator-typestring“RangeAllocator”The CIDR allocator type. “RangeAllocator” or “CloudAllocator”.

Limitations

  1. We plan to integrate out-of-tree node ipam controller with cluster-api-provider-azure to provider a better experience. Before that, the manual configuration is required.
  2. It is not supported to change the custom mask size value on the tag once it is set.
  3. For now, there is no e2e test covering this feature, so there can be potential bugs. It is not recommended enabling it in the production environment.

11 - Azure Private Link Service Integration

Connect Azure Private Link service to Azure Standard Load Balancer.

Azure Private Link Service (PLS) is an infrastructure component that allows users to privately connect via a Private Endpoint (PE) in a VNET in Azure and a Frontend IP Configuration associated with an Azure Load Balancer (ALB). With Private Link, users as service providers can securely provide their services to consumers who can connect from within Azure or on-premises without data exfiltration risks.

Before Private Link Service integration, users who wanted private connectivity from on-premises or other VNETs to their services in the Azure Kubernetes cluster were required to create a Private Link Service (PLS) to reference the Azure LoadBalancer. The user would then create a Private Endpoint (PE) to connect to the PLS to enable private connectivity. With this feature, a managed PLS to the LB would be created automatically, and the user would only be required to create PE connections to it for private connectivity.

Note: When PLS has TCP proxy protocol V2 enabled (service.beta.kubernetes.io/azure-pls-proxy-protocol: true) and service externalTrafficPolicy is set to Local, LB health probe is down. This is because when PLS has proxy protocol enabled, the corresponding LB HTTP health probe would use proxy protocol as well. When service’s externalTrafficPolicy is set to Local, health probe depends on kube-proxy’s health check service which does not accept proxy protocol and all health probes fail. PR #3931 allows users to customize health probe when externalTrafficPolicy is set to Local and thus provides the workaround. It will be released soon.

PrivateLinkService annotations

Below is a list of annotations supported for Kubernetes services with Azure PLS created:

AnnotationValueDescriptionRequiredDefault
service.beta.kubernetes.io/azure-pls-create"true"Boolean indicating whether a PLS needs to be created.Required
service.beta.kubernetes.io/azure-pls-name<PLS name>String specifying the name of the PLS resource to be created.Optional"pls-<LB frontend config name>"
service.beta.kubernetes.io/azure-pls-resource-groupResource Group nameString specifying the name of the Resource Group where the PLS resource will be createdOptionalMC_ resource
service.beta.kubernetes.io/azure-pls-ip-configuration-subnet<Subnet name>String indicating the subnet to which the PLS will be deployed. This subnet must exist in the same VNET as the backend pool. PLS NAT IPs are allocated within this subnet.OptionalIf service.beta.kubernetes.io/azure-load-balancer-internal-subnet, this ILB subnet is used. Otherwise, the default subnet from config file is used.
service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address-count[1-8]Total number of private NAT IPs to allocate.Optional1
service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address"10.0.0.7 ... 10.0.0.10"A space separated list of static IPv4 IPs to be allocated. (IPv6 is not supported right now.) Total number of IPs should not be greater than the ip count specified in service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address-count. If there are fewer IPs specified, the rest are dynamically allocated. The first IP in the list is set as Primary.OptionalAll IPs are dynamically allocated.
service.beta.kubernetes.io/azure-pls-fqdns"fqdn1 fqdn2"A space separated list of fqdns associated with the PLS.Optional[]
service.beta.kubernetes.io/azure-pls-proxy-protocol"true" or "false"Boolean indicating whether the TCP PROXY protocol should be enabled on the PLS to pass through connection information, including the link ID and source IP address. Note that the backend service MUST support the PROXY protocol or the connections will fail.Optionalfalse
service.beta.kubernetes.io/azure-pls-visibility"sub1 sub2 sub3 … subN" or "*"A space separated list of Azure subscription ids for which the private link service is visible. Use "*" to expose the PLS to all subs (Least restrictive).OptionalEmpty list [] indicating role-based access control only: This private link service will only be available to individuals with role-based access control permissions within your directory. (Most restrictive)
service.beta.kubernetes.io/azure-pls-auto-approval"sub1 sub2 sub3 … subN"A space separated list of Azure subscription ids. This allows PE connection requests from the subscriptions listed to the PLS to be automatically approved.Optional[]

For more details about each configuration, please refer to Azure Private Link Service Documentation.

Design Details

Creating managed PrivateLinkService

When a LoadBalancer typed service is created without the annotations service.beta.kubernetes.io/azure-load-balancer-ipv4, service.beta.kubernetes.io/azure-load-balancer-ipv6 or field Service.Spec.LoadBalancerIP set, an LB frontend IP configuration is created with a dynamically generated IP. If the service has the annotation service.beta.kubernetes.io/azure-load-balancer-ipv4 or service.beta.kubernetes.io/azure-load-balancer-ipv6 set, an existing LB frontend IP configuration may be reused if one exists; otherwise a static configuration is created with the specified IP. When a service is created with annotation service.beta.kubernetes.io/azure-pls-create set to true or updated later with the annotation added, a PLS resource attached to the LB frontend is created in the default resource group or the resource group user set in config file with key PrivateLinkServiceResourceGroup.

The Kubernetes service creating the PLS is assigned as the owner of the resource. Azure cloud provider tags the PLS with cluster name and service name kubernetes-owner-service: <namespace>/<service name>. Only the owner service can later update the properties of the PLS resource.

If there’s a managed PLS already created for the LB frontend, the same PLS is reused automatically since each LB frontend can be referenced by only one PLS. If the LB frontend is attached to a user defined PLS, service creation should fail with proper error logged.

For now, Azure cloud provider does not manage any Private Link Endpoint resources. Once a PLS is created, users can create their own PEs to connect to the PLS.

Deleting managed PrivateLinkService

Once a PLS is created, it shares the lifetime of the LB frontend IP configuration and is deleted only when its corresponding LB frontend gets deleted. As a result, a PLS may still exist even when its owner service is deleted. This is out of the consideration that multiple Kubernetes services can share the same LB frontend IP configuration and thus share the PLS automatically. More details are discussed in next section.

If there are active PE connections to the PLS, all connections are removed and the PEs become obsolete. Users are responsible for cleaning up the PE resources.

Sharing managed PrivateLinkService

Multiple Kubernetes services can share the same LB frontend by specifying the same annotations service.beta.kubernetes.io/azure-load-balancer-ipv4, service.beta.kubernetes.io/azure-load-balancer-ipv6 or field Service.Spec.LoadBalancerIP (for more details, please refer to Multiple Services Sharing One IP Address). Once a PLS is attached to the LB frontend, these services automatically share the PLS. Users can access these services via the same PE but different ports.

Azure cloud provider tags the service creating the PLS as the owner (kubernetes-owner-service: <namespace>/<service name>) and only allows that service to update the configurations of the PLS. If the owner service is deleted or if user wants some other service to take control, user can modify the tag value to a new service in <namespace>/<service name> pattern.

PLS is only automatically deleted when the LB frontend IP configuration is deleted. One can delete a service while preserving the PLS by creating a temporary service referring to the same LB frontend.

Managed PrivateLinkService Creation example

Below we provide an example for creating a Kubernetes service object with Azure ILB and PLS created:

apiVersion: v1
kind: Service
metadata:
  name: myService
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true" # Use an internal LB with PLS
    service.beta.kubernetes.io/azure-pls-create: "true"
    service.beta.kubernetes.io/azure-pls-name: myServicePLS
    service.beta.kubernetes.io/azure-pls-ip-configuration-subnet: pls-subnet
    service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address-count: "1"
    service.beta.kubernetes.io/azure-pls-ip-configuration-ip-address: 10.240.0.9 # Must be available in pls-subnet
    service.beta.kubernetes.io/azure-pls-fqdns: "fqdn1 fqdn2"
    service.beta.kubernetes.io/azure-pls-proxy-protocol: "false"
    service.beta.kubernetes.io/azure-pls-visibility: "*"
    service.beta.kubernetes.io/azure-pls-auto-approval: "subId1"
spec:
  type: LoadBalancer
  selector:
    app: myApp
  ports:
    - name: myAppPort
      protocol: TCP
      port: 80
      targetPort: 80

Restrictions

  • PLS does not support basic Load Balancer or IP-based Load Balancer.
  • PLS connectivity is broken with Azure external Standard Load Balancer and floating ip enabled (default). To use managed private link service, users can either create an internal service by setting annotation service.beta.kubernetes.io/azure-load-balancer-internal to true or disable floating ip by setting annotation service.beta.kubernetes.io/service.beta.kubernetes.io/azure-disable-load-balancer-floating-ip to true (more details here).
  • Due to limitation of kubernetes#95555, when the service’s externalTrafficPolicy set to Local, PLS need to use a different subnet from Pod’s subnet. If the same subnet is required, then the service should use Cluster externalTrafficPolicy.
  • PLS only works with IPv4 and cannot be deployed to an SLB with IPv6 frontend ipConfigurations. In dual-stack clusters, users cannot create a service with PLS if there’s existing IPv6 service deployed on the same load balancer.
  • For other limitations, please check Azure Private Link Service Doc.

12 - Multiple Standard LoadBalancers

Multiple Standard LoadBalancers.

Multiple Standard LoadBalancers

Backgrounds

There will be only a single Standard Load Balancer and a single Internal Load Balancer (if required) per cluster by default. This imposes a number of limits on clusters based on Azure Load Balancer limits, the largest being based on the 300 rules per NIC limitation. Any IP:port combination in a frontEndIPConfiguration that maps to a member of a backend pool counts as one of the 300 rules for that node. This limits any AKS cluster to a maximum of 300 LoadBalancer service IP:port combinations (so a maximum of 300 services with one port, or fewer if services have multiple ports). Load balancers are also limited to no more than 8 private link services targeting a given load balancer.

Configuration

Introduce a new cloud configuration option multipleStandardLoadBalancerConfigurations. Example:

{
  ...
  "loadBalancerBackendPoolConfigurationType": "nodeIP",
  "multipleStandardLoadBalancerConfigurations": [
    {
      "name": "<clusterName>",
      "autoPlaceServices": true
    },
    {
      "name": "lb-2",
      "autoPlaceServices": false,
      "serviceNamespaceSelector": [
        "matchExpressions": [
          {
            "key": "key1",
            "operator": "In",
            "values": [
              "val1"
            ]
          }
        ]
      ],
      "nodeSelector": {
        "matchLabels": {
          "key1": "val1"
        }
      },
      "primaryVMSet": "vmss-1"
    }
  ]
}

To enable the multiple standard load balancers, set loadBalancerSKU to Standard, loadBalancerBackendPoolConfigurationType to nodeIP and at least one multipleStandardLoadBalancerConfiguration. If one or more conditions are not met, the cloud provider will either throw an error or fall back to the single standard load balancer.

default lbs

The default lb <clustername> is required in loadBalancerProfiles. The cloud provider will check if there is an lb config named <clustername>. If not, an error will be reported in the service event.

internal lbs

The behavior of internal lbs remains the same as is. It shares the same config as its public counterpart and will be automatically created if needed with the name <external-lb-name>-internal. Internal lbs are not required in the loadBalancerProfiles, all lb names in it are considered public ones.

Service selection

In the cases of basic lb and the previous revision of multiple slb design, we use service annotation service.beta.kubernetes.io/azure-load-balancer-mode to decide which lb the service should be attached to. It can be set to an agent pool name, and the service will be attached to the lb belongs to that agent pool. If set to __auto__, we pick an lb with the fewest number of lb rules for the service. This selection logic will be replaced by the following:

  1. New service annotation service.beta.kubernetes.io/azure-load-balancer-configurations: <lb-config-name1>,<lb-config-name2> will replace the old annotation service.beta.kubernetes.io/azure-load-balancer-mode which will only be useful for basic SKU load balancers. If all selected lbs are not eligible, an error will be reported in the service events. If multiple eligible lbs are provided, choose one with the lowest number of rules.

  2. AllowServicePlacement This load balancer can have services placed on it. Defaults to true, can be set to false to drain and eventually remove a load balancer. This will not impact existing services on the load balancer.

  3. ServiceNamespaceSelector Only services created in namespaces that match the selector will be allowed to select that load balancer, either manually or automatically. If not supplied, services created in any namespaces can be created on that load balancer. If the value is changed, all services on this slb will be moved onto another one with the public/internal IP addresses unchanged. If the services have no place to go, an error should be thrown in the service event.

  4. ServiceLabelSelector Similar to ServiceNamespaceSelector. Services must match this selector to be placed on this load balancer.

Node selection

When the cluster is initially migrated to or created with multiple standard load balancers, each node will be evaluated to see what load balancer it should be placed into.

Valid placement targets will be determined as follows (rules match from top to bottom, first match wins):

  1. If this node is in an agent pool that is selected as a primary agent pool for a load balancer, that load balancer will be the only potential placement target.
  2. If the nodeSelectors on any load balancer configurations match this node, then all load balancer configurations that match it will be potential placement targets.
  3. If no nodeSelectors on any load balancer configurations match this node, then all load balancers that do not have any nodeSelectors will be potential placement targets.

After the list of potential placement targets has been calculated, the node will be placed into the kubernetes backend pool of the load balancer with the fewest number of nodes already assigned.

Service with ExternalTrafficPolicy=Local

Each local service owns a backend pool named after the service name. The backend pool will be created in the service reconciliation loop when the service is created or updated from external traffic policy cluster. It will be deleted in the service reconciliation loop when: 1, the service is deleted; 2, the service is changed to etp cluster; 3, the cluster is migrated from multi-slb to single-slb; and 4, the service is moved to another load balancer.

Besides the service reconciliation loop, an endpointslice informer is also responsible for updating the dedicated backend pools. It watches all endpointslices of local services, monitors any updating events, and updates the corresponding backend pool when service endpoints changes. Considering local services may churn quickly, the informer will send backend pool updating operations to a buffer queue. The queue merges operations targeting to the same backend pool, and updates them every 30s by default. The updating interval can be adjusted by changing loadBalancerBackendPoolUpdateIntervalInSeconds in cloud configurations.

Local service dedicated backend pool and <clusterName> backend pool cannot be reconciled in one loop. Hence, the operation triggered by the update of local service or its endpoints will not affect <clusterName> backend pool.