We follow below Best Microsoft practices for AKS Cluster.
Physical: Multiple clusters are deployed to separate application environment such as Dev, Staging and Production.
Logical: With logical isolation, a single AKS cluster can be used for multiple workloads, teams, or environments. Kubernetes Namespaces are used to form define different environment and allocate resources accordingly.
Resource Quota: Define CPU, Memory, Total number of Volume or disk space, Total Number of secrets, jobs etc quota on namespace level to reserve and limit Resources.
Request and Limits are specified on pod deployment.
Involuntary Disruptions: This includes hardware failures on physical machines. This can be mitigated by using replica sets and multiple nodes.
Voluntary Disruptions: This includes Cluster upgrades, update deployments, accidental deletion of containers. It can be mitigated by using PodDisruptionBudget. If a cluster is to be upgraded or a deployment template updated, the Kubernetes scheduler makes sure additional pods are scheduled on other nodes. The scheduler waits before a node is rebooted until the defined number of pods are successfully scheduled on other nodes in the cluster.
Taints and Tolerations: Define tolerations on pods to schedule pods only on specific defined node.
Labels: Node Selectors and Node affinity is followed to schedule pods on nodes.
Azure AD Authentication, RBAC, Pod managed Identities are leveraged to define access levels for Cluster to respective users.
Version and Updates:
Maintain latest version of supported AKS and Nodes updates.
Image Security :
Use private registries and create new images from official base images.
Leverage Container Security in Security Centre.
Define secrets and use of Azure Key vault to control credential Exposure.
Use of Kubenet and Azure CNI.
Distrubute HTTP/S requests using ingress controllers.
Using Application Gateway along with WAF to provide extra layer of security.
Monitoring and Debugging:
Use of Azure Metrics and Insights to monitor Workloads and Set Alerts accordingly.
Use of open source metric monitoring solution such as Prometheus.
Regularly run the latest version of kube-advisor open source tool to detect issues in cluster.