This article is a detailed explanation of the talk I gave at the Silicon Chalet Meetup on 1 October 2024 on the security topic.
If you’d like to find out more about this event, you can get the slides and (re)watch the presentation from Youtube (French talk).
A few words to begin with#
Cloud…
I’ve been discussing about it for a while now, whether through CI/CD, infrastructure as code, containers, etc.
As a consultant, it’s my responsibility to ensure that the various customers I support in their transition to the Cloud are well-informed about the various issues, the two that come up most often today are FinOps and security.
Although security is often taken into account at the start of a migration or transition to the Cloud, it is not necessarily present in all the layers of what will later be deployed.
Frequently, security is seen as a set of configurations to be deployed or tools to be installed with the goal of passing the final audit.
Obviously, the tooling part is essential, but it’s not the only part!
According to Bruce Schneier, a renowned IT security expert:
Security is a process, not a product
In other words, security should be seen as a mindset to be adopted from the beginning, regardless of the team you are part of, you are responsible for the resources deployed and the configurations you create. This approach is intended to be iterative, so you need to start with the essential basics and add certain concepts as you progress.
It’s also an approach, known as DevSecOps, aiming to promote security from the development of the application to the implementation of the underlying infrastructure.
In this article, I wanted to share with you the various points I recommend before or during a transition to the Cloud. To do this, I’ve selected the three major Cloud providers at the moment: AWS, Azure and Google Cloud.
Of course, these tips are not exhaustive, but they can give you a sense of the best practices to put in place in the areas mentioned.
What the word security evokes…#
Clearly, security is a broad concept that can be addressed at different levels of the infrastructure. Nonetheless, there are certain concepts that frequently arise:
Reducing the attack surface#
This involves reducing the potential entry points for attackers seeking to compromise your infrastructure. It includes:
- Opening only the necessary ports;
- Limiting public access to services, particularly those of the *Platform as a Service (PaaS) * type that contain this kind of feature by default;
- Setting up a policy of hardening virtual machines, containers and even binaries;
- Applying the principle of least privilege for user permissions or service accounts.
Protecting against vulnerabilities#
When talking about protection against vulnerabilities, this implies:
- Keeping all components up to date;
- Having the tools to identify vulnerabilities as early as possible in the development phase by testing them, which is what the Shift Left approach highlights;
- Keeping an eye on your applications, even once they’ve been deployed, thanks to daily scanning tools;
- Integrating security as an essential element before starting a project by adopting the DevSecOps approach;
- Remaining up-to-date with the latest threats and patches for critical vulnerabilities.
Preventing data leakage#
In order to do this, several points need to be defined, particularly based on the criticality of the use case(s):
- Using encryption at rest, in transit and even at runtime;
- Managing encryption keys and sensitive information (secrets);
- Applying strict access controls;
- Anonymising sensitive data and information;
- Classifying data to apply appropriate security policies.
Partitioning, filtering and tracing flows#
During the architecture stage, network segmentation and flow monitoring are essential for detecting and preventing malicious activity:
- Defining networks and sub-networks;
- Using network equipment to filter outgoing access, such as IDS/IPS intrusion detection systems, but also incoming access, such as Web Application Firewalls (WAF);
- Implementing logging and log analysis on your network components (Flow Logs, Firewall Logs, etc.);
As mentioned above, this list represents just a few of the essential points when it comes to the concept of security. However, it’s clear that with these first four points alone, there’s a lot to expand upon, especially when starting a Cloud project or migration.
Good practice when starting up#
In this section, I’d like to outline a few fundamental points that you should consider before deploying your first Cloud service.
Choosing your naming convention#
Although not directly related to security, a specific naming convention will give you the ability to quickly identify your Cloud resources and enhance visibility when analysing logs and metrics in order to detect abnormal behaviour.
To do this, I recommend using each Cloud provider’s guidelines as a starting point. For example, in its Cloud Adoption Framework, Azure gives you a clear approach if you’re lacking inspiration.
A good naming convention must contain at least:
- Key information about the resource concerned:
- The type of resource;
- The name of the application;
- The environment (dev, test, prod, etc.);
- The region;
- An increment if there are several resources with the same name (optional);
- The implementation of understandable abbreviations to reduce the size of the name, it is quite common to reduce the resource type or region name. For instance: switzerland-north will become chn (Azure) or europe-west6 will become ew6 (Google Cloud) ;
- The use of hyphens (-) to separate different information where possible.
It’s up to you, of course, to add your own features to suit your needs.
In Azure, for instance, it might be worth adopting this format:
In this example, the resource type refers to a public ip for the Silicon Chalet application, a production project located in the Switzerland North region. Finally, the increment indicates that this is the first resource with this format.
Defining tags or labels#
Tags (AWS and Azure) or labels (Google Cloud) can help you to identify your resources with a set of key/value attributes.
Managing these can also provide you with a FinOps approach, to increase visibility of the invoicing aspect, as well as setting up alerts to monitor abnormal behaviour.
In fact, it’s interesting to monitor this data and understand the decreases or increases in resource consumption depending on the type of environment.
For example, an excessively high level of consumption could mean that an unauthorised person has potentially gained access to your environment. Just as a drop in consumption would indicate that someone in the team may have removed essential resources required for an application to work properly.
As you can see, financial monitoring can be used to detect waste and improve visibility, but it can also contribute to the overall security of your environments.
In addition to this aspect, tags and labels are major assets for the observability of your platform and the management of your infrastructure, just like your applications.
Three classic tags or labels are strongly recommended:
- environment: type of environment
- application: name of the application
- owner or team: resource manager
Of course, it is important to apply this convention across all your resources and to automate this process as much as possible to ensure consistency.
In the case of AWS, here is an example of tags:
In addition to the above, the Name
is essential in AWS for the name of the resource, and the CostCenter
can give you an indication of the billing details. It may also be useful to add other tags, in particular to indicate who created the resource, the date it was created, the confidentiality of associated data, etc.
Interestingly, in this example, the AutoShutdown
tag can be linked to the execution of a lambda to shut down your EC2 instances according to a CRON expression.
Managing users and permissions#
This part is often underappreciated by most users, but it is crucial when it comes to implementing a security strategy. In particular, by using the principle of least privilege, requiring you to apply only the necessary permissions to those resources closest to the user’s needs.
A few good practices round off this point:
- Using groups rather than defining roles at user level for improved maintainability;
- Linking predefined roles with a restricted scope based on user needs. The use of personalised roles should only be considered as a last resort, as you will have to manage their maintenance;
- Reviewing permissions regularly and use services for temporary elevation of privileges if you need to access a resource very quickly:
- Privileged Identity Management (PIM) for Azure ;
- TEAM with IAM Identity Center for AWS ;
- Privileged Access Manager (PAM) for Google Cloud.
- Implement appropriate mechanisms to grant permissions between services and avoid hard-coded identities:
- Managed Identity for Azure ;
- IAM Roles for AWS ;
- Service account for Google Cloud.
The last bullet point is explained in this example:
A service account was created and attached to the virtual machine in order to be able to retrieve one or more items from the Google Cloud Storage bucket via the roles/storage.objectViewer
role, without mounting sensitive information such as identification information in JSON format within the machine.
Separating projects#
In order to isolate your resources, it is crucial to define an organisation, the well-known element that serves as the foundation for the Landing Zone.
These foundations must have two advantages: they must be scalable and modular. In particular, because this is where you can define your IAM permissions, policies (this point will be discussed later), budgets (to monitor your costs), quotas (to define resource creation limits), etc. as closely as possible to your needs.
It’s also important to consider separating your production environments from your non-production environments, similar to having a dedicated node in the organisation for your shared services, such as the network or your observability platform.
Here is an example of an Azure organisation that is part of the Cloud Adoption Framework recommended by Microsoft. It is made up of different Management Groups and Subscriptions to distribute resources.
This ensures scalability while separating the responsibilities of each resource.
Deactivating services or features#
Another key feature, once you have defined your organisation, is the ability to activate security policies. It can be useful to lock down certain behaviours across the different nodes of the established hierarchy.
This takes different forms depending on the Cloud provider:
- AWS: Service Control Policies (SCPs)
- Azure: Azure Policy
- Google Cloud: Organization Policy
The purpose is to ensure that security and compliance standards are upheld and to limit the attack surface in the event of an attacker gaining a foothold in your organisation.
An important point is the inheritance mechanism that is implemented by default, as well as the restriction logic in child nodes, as in this example on AWS:
In the Organizational Unit Production, an SCP has just applied a Deny with respect to a higher-level node, so Account X is deprived of functionality.
Below are some sample strategies with different behaviours:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"ec2:DeleteFlowLogs",
"logs:DeleteLogGroup",
"logs:DeleteLogStream"
],
"Resource": "*"
}
]
}
In AWS, this SCP prevents users from deleting VPC Flow Logs.
{
"properties": {
"displayName": "Disable public network access for Azure Key Vault",
"policyType": "BuiltIn",
[...]
"version": "1.1.0",
"parameters": {
"effect": {
"type": "String",
"metadata": {
"displayName": "Effect",
"description": "Enable or disable the execution of the policy"
},
"allowedValues": [
"Audit",
"Deny",
"Disabled"
],
"defaultValue": "Audit"
}
},
"policyRule": {
"if": {
"allOf": [
{
"field": "type",
"equals": "Microsoft.KeyVault/vaults"
},
{
"not": {
"field": "Microsoft.KeyVault/vaults/createMode",
"equals": "recover"
}
}
[...]
]
},
"then": {
"effect": "[parameters('effect')]"
}
}
},
"id": "/providers/Microsoft.Authorization/policyDefinitions/405c5871-3e91-4644-8a63-58e19d68ff5b/versions/1.1.0",
"type": "Microsoft.Authorization/policyDefinitions/versions",
"name": "1.1.0"
}
In Azure, this Policy does not allow you to deploy a Key Vault with public access.
name: organizations/01234567890/policies/gcp.resourceLocations
spec:
rules:
- values:
allowedValues:
- in:europe-locations
In Google Cloud, this Organization Policy restricts the number of regions across which you can deploy, covering only European regions.
These strategies need to be continually tested, and these tests need to be conducted on separate nodes from where your production applications are hosted. It is often advisable to have a dedicated organisation to test this part or to duplicate your nodes.
Encrypting and protecting data#
Now, it’s time to talk about data and the various functions you can use to protect the sensitive parts of your infrastructure:
- Enabling encryption at rest, in transit and also at runtime if necessary, like Google Cloud’s Confidential Computing;
- Storing your sensitive information in dedicated services such as AWS Secrets Manager, Azure Key Vault or Google Cloud Secret Manager;
- Use your own encryption keys, especially if you have legal constraints or critical data that requires you to use this mechanism;
- Don’t forget to rotate your encryption keys and secrets at set intervals. And don’t neglect to configure your applications to take these new secrets into account!
Observability#
You can’t understand what you can’t see! So observability is key!
The idea is to be able to track all the actions you perform in the Cloud:
- Enabling full logging of activities (CloudTrail for AWS, Azure Monitor, Cloud Audit Logs for Google Cloud);
- Using integrated tools to send alerts or centralise logs in a SIEM (Security Information and Event Management) system;
- Deploying monitoring and logging agents across all your services;
- Designing dashboards to monitor your infrastructure right down to your applications;
- Implementing remediation services to circumvent external threats and the repercussions of misconfiguration.
This remediation aspect can be handled by AWS Config, as it can remove an SSH rule from a Security Group if it does not comply with the security constraints set by the company:
Network architecture#
An essential stage in any Cloud project is the definition of a network architecture that will meet your needs, whether in terms of maintenance or scalability. I’d like to take you through the different architectures most commonly used by the three main Cloud providers.
In AWS#
For some time now, AWS has been offering the Transit Gateway as the central component for enabling communication between your VPCs by removing the limitations of peering, this service also has the capacity to be a sort of router associating your connections with Direct Connect or VPN via the on-prem or other networks.
Its advantage is the ability to configure routes and, if necessary, set up a VPC to manage traffic inspection, handled by the Network Firewall.
To add a layer of security between VPCs, it is useful to configure Network ACLs at subnetwork level and also Security Groups for the services that support them.
Remember that in AWS, you can have private or public subnets depending on your requirements.
To increase visibility of flows and traffic, VPC FLow Logs are mandatory.
In Azure#
In a similar vein to AWS, Azure offers Virtual WAN to interconnect your Virtual Networks, especially if you want to operate on a multi-region basis. It is also possible to link VPN or Express Route connections to add a hybrid dimension.
The Azure Firewall will be responsible for filtering both incoming and outgoing traffic. You’ll be able to write three types of rule from layers 3 to 7 of the OSI model (network rules through to domain name filtering). The Premium version also lets you add TLS inspection and intrusion detection and prevention (IDPS).
In this type of architecture, routing tables are essential elements, particularly for redirecting all traffic to the Firewall.
The Network Watcher section will give you the visibility you need for your traffic and also for diagnosing configuration problems.
Finally, the Network Security Groups (NSG) will act as firewalls both at subnet level and at the level of your services’ network interfaces.
In Google Cloud#
Let’s conclude with Google Cloud in a different configuration. The Shared VPC associated or not with a Hub in the context of a Hub & Spoke architecture will provide the ability to delegate one or more sub-networks to a team within your own Google Cloud project. In other words, you can have a dedicated team for managing the network and security of the host network (configuring firewall rules, routing, etc.) while granting the necessary autonomy to teams wishing to deploy resources within it.
If you want to filter outgoing HTTP/S traffic, the Secure Web Proxy service will be very useful.
A very specific yet highly complex service on Google Cloud, VPC Service Controls allows you to restrict communications between defined perimeters, like a firewall for APIs.
The VPC Flow Logs will provide full traffic observation capability, whether from a Compute Engine virtual machine or a Google Kubernetes Engine (GKE) cluster.
Finally, segmentation within your VPCs and also between VPCs can be achieved using Firewall Rules, particularly by using Firewall Policies if you need to apply this at organisational level.
Internet exposure#
One topic that arises frequently when discussing about architecture is the way in which services are exposed over the Internet. One of the challenges is to maintain private connectivity between all the services and expose only what is necessary, while adding a layer of security.
In this example on Azure, Front Door is the infrastructure’s gateway. It acts as a Content Delivery Network (CDN) and a pass-through in the case of a multi-region architecture. This can be enhanced by a Web Application Firewall (WAF) to protect against common threats.
The Application Gateway operates as a load balancer on layer 7 of the OSI model, providing internal and Internet exposure for applications. It is also essential to secure the connections between Front Door and the Application Gateway to prevent your users from reaching your services without going through Front Door. To achieve this, you can add a WAF rule to check that the request has been transferred using your Front Door identifier.
In the case of a Premium Firewall, it will be possible to inspect traffic and protect against threats.
Another crucial feature is Private Endpoints, giving you the option of using the default public services in private mode only. This prevents unnecessary exposure of services on the Internet and retains the traffic visualisation capabilities offered by Network Watcher.
How can I build this?#
One thing is sure, the concepts are well-defined and it’s now time to configure and deploy them. Naturally, Infrastructure as Code (IaC) comes into play!
Regardless of the tool used, from Terraform to Crossplane via OpenTofu and Pulumi, IaC has several advantages:
- Auditability: Simplified tracking of changes and modifications using version control (Git);
- Error minimisation: Reduction in human error due to manual actions;
- Consistency: Uniform deployment from one environment to another;
- Compliance: Enforcement of security best practices such as the principle of least privilege and encryption across the entire infrastructure;
- Security analysis: Detection of vulnerabilities within infrastructure as code scripts, such as public IP creation or insecure configurations.
These tools are the ideal candidates for deploying your infrastructure.
To complete the picture, an automated CI/CD pipeline will have the advantage of consistently testing both the good practices of the language and indentation, not to mention the security tools that can be associated with it, such as TerraScan, checkov or Trivy (formerly Tfsec).
In addition to the security aspect, it is recommended that unit and non-regression tests be run using a tool such as Terratest.
To conclude on this topic, a good practice, especially when automating your infrastructure as much as possible, is to prevent manual actions by granting only read permissions to platform users. This means only the automated chain has the authorisation to create, update or delete resources.
Of course, it will always be possible to use temporary privileges (#manage-users-and-permissions), as mentioned above, in the event of a crash or execution error.
The aim of this is to avoid any drift between your code and the actual infrastructure without the intent of ensuring that what’s in your code is the infrastructure actually deployed (lGitOps approach). Furthermore, this approach promotes code reviews between colleagues before deployment.
Deploying applications#
Now that we’ve talked enough about infrastructure and deployment, it’s time to discuss the different deployment methods and associated services.
Different formats can be deployed in the Cloud:
- Source code: to be deployed in Functions **(Serverless) services;
- Binary: widely used for deployment on a virtual machine or numerous PaaS-type services such as App Engine;
- Container: the most portable unit, suitable for the vast majority of Cloud services, from virtual machines to Kubernetes clusters.
As you can see, whatever your format, you can choose from a wide range of services. These should be determined on the basis of your knowledge, the team’s ability to operate it on a day-to-day basis and your security constraints.
If your security policies require a hardening of the operating system layer, then the virtual machine may be more appropriate for your use case.
This picks up on everyone’s roles with the responsibility model shared between you and the Cloud provider. The more you choose an Infrastructure as a Service (IaS) type product, the more you will have to secure all the layers from the operating system to your application data.
Speaking of hardening, it can be useful to set up an image factory in the form of an automated chain to build these according to your needs.
The concept behind this is to use the public images provided by Cloud providers and to iteratively create, using tools such as Packer and Ansible, a hardened image that aligns with the company’s security policies, such as the application of the CIS Benchmark, without omitting the appropriate observability layer. The purpose of this first image is to serve as a base image and then inject whatever you want to deploy with it.
In the example above, two images are generated using this hardened image: one to be used as a worker within a Kubernetes cluster and another to deploy GitLab.
With the aim of regular patching, it would be interesting to recreate these images at a defined frequency (for example, every month).
This concept would create a real library where the IT teams use these images as a source instead of using the default one.
In the case of containers, the target is the same.
Based on one or more Dockerfile(s), a continuous integration pipeline will perform the following actions:
- Syntax checking: analysis of best practices for optimising image layers;
- Build phase: use of Kaniko instead of Docker in Docker to avoid privilege elevation and security issues;
- Security analysis: Trivy performs a vulnerability scan within the image;
- Image signing: CoSign ensures that, when deployed in the future, the image deployed is the image signed by this tool;
- Moving to a registry: the final step involves pushing the image into a container registry to make it available to the teams.
Without delving into too much detail, it would be much the same with the creation of a binary.
The aim in each case is to adopt the Shift Left approach and to protect against what is known as the Supply chain attack by checking the dependencies and libraries used within applications or infrastructures.
The last point I wanted to mention is Kubernetes clusters. Kubernetes is a fascinating ecosystem, but setting it up is a whole world in itself.
A number of points must be highlighted:
- Use private clusters to avoid making cluster nodes or the endpoint accessible;
- Configure a Policy as Code tool such as Kyverno to deploy
ClusterPolicy
, for example, to prevent the use of the root user;
- Use a network layer (CNI) supporting the use of
NetworkPolicy
orCiliumNetworkPolicy
to filter network connections and isolate applications from each other; - Automate the detection of container vulnerabilities, even during execution, with the Trivy Operator, for example.
The last part is absolutely crucial: an image that doesn’t contain any vulnerabilities today does not guarantee that it won’t in the future. You must constantly monitor the inventory of your images and define alerts in the event of critical or high-level vulnerabilities.
Final word#
Through this talk and this post, I wanted to give you an overview of the current state of applying security in the Cloud, even though several parts can be applied as well in an on-premise world. As you may have noticed, security is relevant at every level, without exception, from the development of your application to the installation of your infrastructure.
Frequently forgotten and neglected, security is essential to guarantee the long-term viability of what you deploy and to maintain the trust of your customers. So think about its implementation as early as possible to avoid increasing your technical debt.
Obviously, there is no such thing as zero risk, so you need to keep abreast of the latest news and practices, even if it’s just common sense overall.
Bear in mind that this is an iterative process, the most important thing being to make your actions visible and your future work convincing.
In short, security is much more than a set of tools - it’s a real mindset!