First things first. Assuming that you have a project budget and a willingness to get this done, where do you start? I typically begin with three things:
- Establish requirements, both functional and non-functional
- Establish security guidelines
- Establish architectural governance
Your choice of platform is a key decision. Understanding the strengths and weaknesses of each may help to influence your decision, although in my experience there is no clear datacentre or a preferred Cloud Services Provider (CSP). If you don't have a preferred CSP you might like to read another of our articles, How to choose a Cloud Provider.
Document your choices, highlighting the pros and cons as well as any reasoning and logic behind your decisions. However you arrive at your decision, we will assume for the purposes of this article that you already have a preferred CSP.
Before you can begin to establish your platform, you will conduct a design review to ensure that you have a shared understanding of what you are building, as well as a security audit to de-risk your deployment. Regular security reviews and audits are key to ensuring that you fully understand your security landscape, and that you haven’t left an exploitable gap (or at the very least understand what gaps you have and any mitigating factors you can employ in order to minimise the risk). Similarly, regular design reviews and a build and patching strategy will minimise any attack vectors and subsequent blast radius might have penetrated the defences.
We are going to assume at this point that your connectivity and your authentication methods are well defined and secure. We will also assume that you are creating technical documentation as you go.
Hand-in-hand with good security practices are good architectural and governance practices. So far, you will have generated high-level architectural design docs that explain what you are building or have already produced. As well as identifying the building blocks involved, these designs will call out in detail the data flow. Right from the beginning, it is important to understand the flow and security of your data. If a potential exploit is uncovered, you may have to pinpoint your weak spots and target them for remediation. Not having clear documentation that details security patterns exposes you to additional risk and makes remediation potentially difficult and time consuming. Having clear and detailed architectural design drawings will help to communicate the vision and allow for critique by a colleague. A regular peer review of architectural designs at several stages of maturity is good practice, and should be an integral part of your design release process.
I have often built a Minimum Viable Product (MVP) manually to prove my ideas. If this works for you too, then great. One extremely important thing to remember is that generally speaking, an MVP can be rushed or undocumented, and may hide several manual shortcuts. A demonstrable MVP should always be no more than a temporary and disposable environment. You should always build a Production environment from scratch following prescriptive documents.
Once your MVP is completed, it is time to look at build orchestration.
Build Orchestration is a universally adopted approach that relies on Infrastructure as Code (IaC). This essentially means that everything you build is version controlled (perhaps using GitHub) and effortlessly repeatable.
I use both Ansible and Terraform to provision entire environments and use modules wherever possible. Modules are reusable blocks of code (think repeatable functions) and save time for coders and code reviewers. For example, a module can be created to install a specific piece of software. This module can then be reused multiple times when building out servers and containers. Terragrunt is a powerful Terraform wrapper that can further enhance and simplify the process.
Any deployment tooling generally depends on a reliable backend code repository. Ensure that this is structured correctly and secure. All code changes should be backed by a Code Review process, especially if that code is going anywhere near your production environment. There should be zero opportunity for any malicious or unvalidated code to be pushed to any of your repositories!
Our first environment
The first environment build will be a development environment. Development teams often have many parallel development environments as they can be spun up for specific use cases, and can be torn down again once their usefulness has expired. Perhaps you will have one development environment per project group or per regional workforce? There is no right answer; it’s down to whatever suits your organisation. Generally speaking, organisations like to start with a single shared development environment and review performance impact later.
We can test our build procedures in development too. Let’s explore the following statement:
It can be a good idea to establish a frequent rebuild cadence (as often as practically possible, but at least between major releases) for these low-level environments for two reasons:
- Exercise our Disaster Recovery (DR) procedures in a non-critical environment
- Reset our baseline deployment and reverse any platform changes made outside of change control
Rehearsing DR scenarios by following our own documentation word-for-word is a must. I’m pretty sure that I don’t have to explain the benefits in detail! By executing our DR plan often, we give ourselves and our stakeholders confidence (and highlight any gaps in our procedure).
Resetting our baseline highlights dependencies - i.e., any platform changes that are required to persist beyond a rebuild – is also important. Lightweight change control can be used to good effect in these circumstances. A pull request can be raised to update the terraform code in order to include the persistent change which enables a code-security or functionality concerns and ensures that we maintain an audit trail. This is a light-touch approach in support of a simple and logical procedure, which can ensure that necessary dependencies are included, and that the platform is built consistently across environments. For example: DEV >= TEST >= QA >= STAGE >= PROD
With our development environment in place, we are now ready to build something! We know that we want to deploy Kubernetes, and that we will use our Orchestration tool to do it. What we need to understand next are the methods by which Kubernetes can be executed and supported.
In AWS, the Managed Service offering is EKS. In Azure, it is AKS, and in GCP it is GKS. Alternatively, with each of these cloud providers it is possible to install Kubernetes onto virtual servers. I know of one large organisation who opted for this strategy in the belief that it was the low-cost option. In my opinion, the managed service should be your first option. Let the service provider take the pain until you are ready and prepared to adopt that workload for yourself. While you are developing and maturing your container build process and release strategy, and developing your resilience and DR capability, you might find the bandwidth to decide whether you should self-manage your Kubernetes platform. By this time, you might hope to have learned a few things about your application ecosystem, good and bad, and will be able to make better-informed decisions.
If you opt for a managed service, you should dedicate some time to exploring additional capabilities. The Kubernetes Web UI is not deployed by default, although it is extremely easy to deploy. My first choice is always to install this, as it is the fastest way to establish single-pane-of-glass visibility into the platform. With some visibility into the platform, you can begin to get some insight into how your platform is performing.
Now that we have a way to observe real-time behaviour inside our cluster, it is time to consider platform metrics. Remember: failure can and will happen. Prepare for it and set yourself up for success.
Before onboarding tenant workloads, take time to implement some basic checks and alerts so that the SRE team can be the first to respond. SREs themselves can introduce automated remediation; for example, creating an event trigger/response to scale up or down in response to resource demand.
Email or Slack notifications - ranging from your typical system events, warnings and errors to security and critical alerts - can be integrated with a ticketing system such as Jira, which can raise tickets for anything that happens. This enables you to prioritise your workload accordingly and never lose track of historical incidents.
Last but not least - in terms of observation - is logging. Again, for the purposes of retrospective analysis, it is a good idea to implement a logging solution because Kubernetes log retention is typically limited to 30 days, and even then, only if you have adequate storage. Moving container logs and Kubernetes system logs off your cluster and onto external storage is fairly easy to achieve, and this too should be set up before you begin to onboard tenants and services.
Choosing a Kubernetes platform is not an easy decision. To offer you a bit of background and highlight a few popular choices, we have written other articles that you might like to read before taking your next step:
Airwalk Consulting has designed and deployed Kubernetes infrastructure, controls and pipelines, for some of the world’s largest organisations. Contact us for more information, advice or assistance.