Tag: security

Are You Ready To Manage Access At Scale?

Post author By Yossi Cohen
Post date August 7, 2022
1 Comment on Are You Ready To Manage Access At Scale?

Back then in the university, I tended to enjoy the math courses. The first year though wasn’t great, I often lost points in exams merely because I made some pretty lame mistakes. I used to have monumentally bad handwriting so taking the effort to write/draw my math legibly made a gigantic difference. Neatness was a key factor in making my exams significantly less corrupted by errors.

Neatness is also a key factor when coming to design a solution architecture. It helps identifying systematic patterns, ones that you can repeat and eventually automate. Automation helps cloud operations teams reduce the volume of routine tasks that must be manually completed.

AWS SSO helps you securely connect your workforce identities and manage their permissions centrally across AWS accounts and possibly other business cloud applications. How can it fit into a solution that allows you to manage access at-scale? Let’s try to figure it out.

Figure-1 demonstrates a common use-case for an AWS SSO based solution architecture in a multi-account AWS environment:

A developer login to the external identity provider (e.g., Okta) and launches the AWS App. assigned to her Okta‘s group.
A successful login redirect the developer to AWS SSO console with a SAML2 assertion token. Depending on the developer’s association with Okta‘s user groups, the developer get to select a specific PermissionSet she is allowed to use in a specific AWS account.
The developer uses the AWS IAM temporary credentials returned from the previous step to access AWS resources/services.

Figure-1 – A User With A Job-Function Access A Workload – Example

A resource-based policy can result in a final decision of “Allow”, even if an implicit “Deny” in an identity-based policy, permissions boundary, or session policy is present (see AWS IAM Policy Evaluation Logic).
AWS Single Sign-On (AWS SSO) is now called AWS IAM Identity Center.

Figure-2 visualizes user’s AWS access permissions that are granted by first adding users to groups in an an external identity provider, and then by creating tuples of group, AWS SSO PermissionSet and AWS account. In the absence of “Neatness” in place, aka, a solid model for these permissions’ assignments, it is clear how this can grow chaotically. For example, an implementation that includes 20 groups, 20 permission-sets and 20 accounts, can end up having 8,000 unique combinations that represent users’ permissions. What AWS permissions are granted to
User 2 when she is added to User-Group 2? In which accounts? It’s not so easy to figure it out. Not only security operations might be slowed down by this cumbersome process but it also makes things much more error-prone if humans have to manually configure it all.

Figure-2 – AWS SSO Permission-Sets’ Assignments – Illustration

Luckily, this is not too hard to solve.

The “AWS SSO Operator” Solution

We start by defining the conceptual model (see Figure-3) for managing our AWS SSO Permission-Sets Assignments. In addition to Neatness, the model also takes into account the Segregation of Duties principle and the advantage of isolating SDLC environments.

aws sso operator to manage permission assignments at scale — Figure-3 – The Conceptual Model For Permission Assignments

In this conceptual model, a user is associated with one or more job-functions. The model globally fixates the SDLC environments each job-function is allowed to access, e.g., the ‘tester’ job-function is limited to development and staging environments only. Each user is assigned to zero or more user-groups. Each user-group is associated with a single unique combination of job-function and workload. Each PermissionSet is associated with only one job-function but one-or-more workloads. Each account on the other hand, is associated with a single unique combination of workload and SDLC environment.

It simply means that user-groups, PermissionSets and accounts are associated based on their workload, job-function and SDLC environment attributions.

The model may change according to your company needs and policies. Companies choose to define and implement SDLC environments slightly differently. Job-functions also varies from one company to another. Thus it is important to keep the model highly configurable.

Figure-4 demonstrates the realization of the model and the matching process by which AWS SSO PermissionSet Assignments are created.

User-groups follow a naming convention, a syntax, that incorporates job-function and workload identifiers in their names. User profile attributes (e.g., ‘team’, ‘project’) may also be implemented to enable ABAC for finer grained permissions.
Each AWS SSO PermissionSet is tagged with ‘workloads’, a list of one or more workload names. Reserved values allows us to keep this list short by indicating the population of workloads to use, e.g.:
- workload_all – the PermissionSet matches all workloads
- workload_match – the PermissionSet matches multiple pairs of workload and user-group that share the same workload identifier.
There is a special treatment for Sandbox accounts, which are assigned to a single owner.
Each AWS SSO PermissionSet is also tagged with ‘job_function’, which semantically is similar to an AWS IAM Job Function.
Each AWS Account is tagged with ‘workload’.
Each AWS Account can represent a single workload only.
AWS Organizational Units (OUs) in the second level of the hierarchy use reserved names that represent the SDLC environments your company uses.
More on the “awsssooperator” workload is coming next..

Now that our model is implemented and can be used to systematically produce AWS SSO PermissionSet Assignments, wouldn’t it be nice to automate everything? Figure-5 presents the “AWS SSO Operator”, an opinionated solution, that automates the provisioning/de-provisioning of AWS SSO PermissionSet Assignments using the metadata from our implemented model. It keeps the AWS SSO PermissionSet Assignments in-sync with the model pretty much at all times. It continuously evaluates the actual PermissionSet Assignments state against the desired state and acts to remediate the gap.

aws sso operator architecture — Figure 5 – AWS SSO Operator Solution Architecture

The main two flows are divided by “event-based” and “time-based” triggers. They perform similar logic except the time-based one (B) run periodically and it analyzes the entire model to ensure the AWS SSO PermissionSet Assignments are in their desired state so it can initiate remediation actions as needed. The event-based flow (A), analyze and remediate only the AWS SSO PermissionSet Assignments that are in the context of the event. The main steps for the PermissionSet Assignments event-based flow:

Perform cross-region event re-routing for the AWS Organizations MoveAccount event. This is only required if your AWS SSO Operator runs in a region other than N. Virginia. A separate SAM application is used for this purpose.
Event Rules are implemented in AWS CloudWatch Events (similar to AWS EventBridge) to intercept the relevant events and target them to the appropriate AWS Lambda Functions (“Event Handlers”), e.g., MoveAccount (organizations.amazonaws.com), TagResource (source: sso.amazonaws.com), etc.
AWS Lambda Functions, retrieve Okta & AWS SSO SCIM access tokens. The Event Handlers need those to access Okta and AWS SSO SCIM API.
Okta user-groups are loaded by the event handlers so they can be matched against AWS SSO PermissionSets and AWS accounts based on their model’s attributes (aka, job-function, workload and SDLC environment). At the time of writing this post AWS SSO SCIM API is still limited and therefore, Okta API is used to iterate over the entire Okta user-group list.
Verify that the user-groups we loaded from Okta exists in AWS SSO SCIM, otherwise we ignore them.
Retrieve the AWS account workload tag and its parent OU (SDLC environment) so it can be matched by workload and job-function (job-functions are restricted to specific SDLC environments)
AWS SSO API is used to create / delete PermissionSet Assignments. The API is asynchronous and need to be monitored.
The PermissionSet Assignment request (create/delete) events are sent to AWS SQS (“PermissionSet Assignments Monitoring”), so they can be monitored asynchronously.

The PermissionSet Assignments Monitoring flow (3^rd flow; C) sends audit events to SecOps (or to a SIEM system for this matter). Each event includes information about provisioning/de-provisioning a PermissionSet Assignment:

The SQS2StepFunc Lambda Function is triggered to receive batches of messages from the AWS SQS queue to which steps A8 and B7 from previous flows send audit events.
The Lambda Function starts executing the “Monitor Provisioning Status” AWS Step Functions. Figure-6 presents the corresponding state machine. The state machine process the events one by one by using the Map construct of AWS Steps Function.
If the PermissionSet provisioning status is not final (aka, “in-progress” and neither “succeed” nor “failed”), then the state machine continues polling the status from AWS SSO.
Eventually the final PermissionSet provisioning status is reported back to SecOps/CloudOps by publishing an AWS SNS message.

Figure-6 – AWS SSO Operator – PermissionSet Assignments Monitoring – AWS Step Functions

The AWS SSO Operator does not mandate the use of SAML-Assertion / User-Attributes based ABAC but it is definitely something to consider when implementing fine-grained permissions at-scale as it significantly helps lowering the number of user-groups. For example, let’s assume we have a “Team” user attribute in Okta. Users of “Team_Alpha” and “cloudops_workload_myapp” user-group may have access to the AWS S3 object S3://some-bucket/some-key/myobj (production account; workload “myapp”), while users of “Team_Bravo” in the same user-group, may not.
To speed up the development, the AWS SSO Operator solution is written as a couple of standalone Python modules that can be easily tested locally and then deployed as Lambda Function Layers keeping the Lambda functions code lean and as a mere protocol wrapper only. For infrastructure-as-code AWS SAM is being used.
The Serverless application and the code are designed to avoid the limitations of the (Okta and AWS) API in use like rate-limits. It uses queues to buffer API calls, batch processing, caching, etc.
The AWS SSO Operator has a built-in error handling that makes use of a Dead-Letter-Queue (AWS SQS) that it uses to surface operational errors to CloudOps personnel.
You can now administer AWS SSO from a delegated member account of your AWS Organizations. This is a major advantage as you no longer need to login to the highly sensitive master account in order to use AWS SSO.
AWS SSO PermissionSets may also be used to implement a secure remote access to AWS EC2 instances via AWS Systems Manager Session Manager. If you choose to use a different vendor/service (e.g., zscaler) for remote access, you would need to handle the permissions’ assignments separately.

The AWS SSO Operator and the model it introduces allow administrators of the external identity provider (e.g., Okta) to define user-groups with clear semantic of how those groups associated with AWS SSO PermissionSets and AWS accounts. Finally, the association of the AWS permissions to user-groups and accounts, is done automatically, which makes things more productive and more secure as it reduces the human risk factor.

Thoughts About Permissions Management

Up until now we took care of connecting the dots, assigning permissions to users by associating user-groups, AWS SSO PermissionSets and AWS accounts. But what about the actual permissions? Is there a way to automatically generate the exact least-privilege permissions required for a given job-function and the relevant workloads? Unfortunately, there is still no silver bullet to fully address this challenge.

And yet that does not mean we need to give up security, scale or speed – there is a lot we can do.

Permissions are implemented as AWS IAM policies attached to AWS SSO PermissionSets, AWS Organizations SCPs (or/and AWS Control Tower Preventive Guardrails), AWS Resource-based policies, AWS Session policies and IAM permissions boundaries. The AWS Policy Evaluation Logic uses all of these policies to determine whether permission to access a resource is allowed or denied.

Managing permissions is not a one team show, there are multiple parties responsible for making this process of handling requests for permissions secure and effective at scale. Responsibilities may vary depending on the operating model of your company but usually there are three roles involved:

Engineering is responsible for developing the application and raise their access requirements to the Cloud Infrastructure Security role.
Any delay in granting the permissions these users need would impact their ability to deliver on time.
Cloud Infrastructure Security is responsible for defining and implementing (via CI/CD and Infrastructure-as-code) the AWS SSO PermissionSets and the corresponding IAM policies.
IT is responsible for implementing and operating the external identity provider (e.g., Okta). That includes provisioning of users, user-groups and adding/removing users to/from user-groups.

Our permission assignments model assists the IT team in managing user-groups and users at scale. The naming convention, the syntax, we use for user-groups helps to eliminate ambiguity and it also opens the door for new automations to follow.

The Cloud Infrastructure Security team however, owns the implementation and the responsibility to enforce the least privilege principal. Numerous requests for cloud permissions take place daily and they are expected to be handled almost immediately. The Cloud Infrastructure Security team is the one to handle these requests usually by hand-crafting AWS IAM policies, which are iteratively tested and corrected as needed. This process is manual, time-consuming, inconsistent, and often suffers from trial and error repetition. Our model force a certain order, IAM policies are defined for job-functions in the context of workloads, and to a some extent that eases the process since the job-function must be well defined. Still, this process is quite cumbersome and the Cloud Infrastructure Security might quickly become a bottleneck here.

The Cloud Infrastructure Security and the IT teams shall handle permissions revocations in cases where a permission is unused or/and a user is no longer entitled to have it, e.g., due to user de-activation. The way to automate this process is not covered here and it deserves its own post.
The services and actions of an IAM policy are determined in accordance with the job-function definition.

As the company scales, this kind of centralized and manual management approach falls over, becoming impractical for both operations teams and their users.

The following strategies, which are not mutual exclusive, can become handy to effectively operate permissions at-scale.

Decentralized Permissions Management

Managing permissions centrally may work for small businesses, but it does not scale very well, which is why you would want to have the option to delegate most of it to applications owners who can become more independent. There are multiple ways you could choose from to implement the delegation model but the one that keeps your application owners autonomous is probably preferred especially when these application members are also accountable for the business outcome. Since the AWS SSO Operator solution takes care of permissions’ assignments to user-groups and workloads, the delegated application team members would only need to manage the lifecycle of AWS SSO Permission-Sets (create/update/delete) and only the ones they are entitle to manage (specific workloads, job-functions). Figure-7 demonstrates permission delegation to a SecOps job-function of the “myapp” workload. A more detailed example of how permission delegation works in AWS SSO can be found here.

Figure-7 – AWS SSO PermissionSet Custom Policy

The decentralized permissions management does not mean nothing is managed centrally. Certain operational aspects of the AWS SSO are likely to be still managed centrally by your Cloud Infrastructure Security team. For example,
- Permission delegation and policies like the one illustrated in Figure-7
- ABAC attribute mapping to SAML2 attributes from your external identity provider
- Cross-Functional job-functions’ permissions, e.g., secops, security-auditor, finops, support-ops, etc.
- AWS SSO Permission boundaries, which allows us to limit the permissions we delegate.
None of the AWS SSO Operator responsibilities around permissions’ assignments & provisioning are delegated to anyone – only the AWS SSO Operator is entitled to handle that.

“Shift-Left”

Moving security and specifically permission management aspects to an earlier stage of the development lifecycle makes it easier to put more automation in place and therefore improve your ability to scale. Shifting left in this case would mean to scan, inspect and identify IAM policies security issues right in their IDE or CI pipelines. You want to detect and remediate, as early as possible, overly permissive IAM policies that might allow unintentional capabilities such as privilege escalation, exposure of resources, data exfiltration, credentials exposure, infrastructure modification, etc. A tool like Cloudsplaining, can be integrated into your CI/CD and fail builds that contain such security issues. Similarly, AWS IAM Access Analyzer can be integrated into your CI/CD (as in this example) to validate IAM policies against AWS security best practices and detect security issues including those associated with the principle of least privilege. Checkov is a static analysis tool for infrastructure-as-code (IaC) and it can also be integrated into your CI/CD to run security checks. Checkov is also integrated with Cloudsplaining to add more IAM checks that flag overly permissive IAM policies in IaC templates.
If you seek for best practices around how to automate testing for IAM policies, AWS IAM Policy Simulator can be integrated into your CI/CD for unit testing the IAM policies (as in this example) making sure they are fully functional in the context of the target accounts (e.g., is there an SCP blocking a service or action you plan to use?). The AWS IAM Policy Simulator is useful for testing the policies and to identify permission errors early in the development lifecycle. The AWS IAM Policy Simulator is not designed to assist with the least-privilege principle, however, when we come to implement this principle the chances for errors increase and we want to make sure our tightened policies still work.

These are just examples for tools that can assist in pushing some elements of permission management to the early phases of the development lifecycle in a scalable manner.

Permissions Boundaries

Another useful technique that helps with the least-privilege principle is to set hard boundaries to policies so no matter what permissions a policy declares, the effective permissions can never go beyond what is whitelisted by the permission boundaries. This measure is highly effective in gaining back control over permissions that are granted at high scale. AWS SSO PermissionSet allows defining permission boundaries to limit its policies. AWS Control Tower supports permission-boundaries for the entire organization in the form of preventive Guardrails and Custom Guardrails (which are implemented as AWS Organizations SCPs). AWS IAM support permission boundaries for IAM Roles/Users. Another type of “permission-boundaries” is controlled via AWS account level configuration by which we can block certain permissions for the entire account, e.g., block public access to all AWS S3 buckets within an AWS account.

Permission boundaries are crucial for the success of decentralized permission management. These are preventative controls that helps to ensure the delegated teams cannot go too wrong. The Cloud Infrastructure Security team is a good candidate to own the management of permissions boundaries and AWS Control Tower preventive Guardrails in particular.

Permissions Monitoring

Detective controls includes security monitoring tools, which can also become handy in supporting the least-privilege principle. More importantly, these security monitoring tools are there to ensure your security compliance requirements are met. It provides you with the capability to automatically detect permission related issues and respond to them by taking automatic/manual remediation actions. Figure-8 illustrates a common way of doing that on AWS.

AWS GuardDuty, AWS IAM Access Analyzer and AWS Config support permission related pre-defined checks. AWS Config also support custom rules for custom checks. These are high level services that require minimal or no coding at all. Don’t miss the opportunity to win quick wins!

You can also use any of the log events ingested to our SIEM (e.g., AWS CloudTrail, CloudWatch Logs, etc.) to detect security events. The AWS SSO Operator implements a very simple monitoring for the AWS SSO PermissionSet assignments and provisioning activities. The solution can be extended to send these logs to a SIEM and for example verify if “break-glass” or “power-user” permissions are assigned to inappropriate user-groups.

Continuous permissions checks followed by corrective actions are best handled locally from within the account where the issues are found, aka, in a decentralized manner (see example). This is in contrast to the centralized approach that is illustrated in Figure-8 where security events are propagated to other systems, e.g., SIEM.
Continuous permissions corrective actions may pose another challenge since permissions are most likely managed as source code (IaC) in a source control repository, which is also the source of truth. The challenge in this case is keeping the source code in-sync with the corrected permissions.

Define Permissions Based on Usage Analysis

Reverse-engineering is the act of dismantling an object to see how it works. It is done primarily to analyze and gain knowledge about the way something works but often is used to duplicate or enhance the object.

AWS IAM Access Analyzer generates IAM policies in a process similar to reverse engineering. It analyzes the identity historical AWS CloudTrail events and generates a corresponding policy based on the access activity. You should not consider the generated policy as the final product but it is still much better than nothing. The generated policy requires tuning like adding/removing permissions, specifying resources, adding conditions to the policy, etc.

Let’s assume we want to generate a policy for the job-function CloudOps. We will start by creating another job-function, e.g., CloudOpsTest, that in order to lower the risk, is enabled in development environments only. Then, we create an overly permissive AWS SSO PermissionSet for that job-function and for only a limited period of time during which an identity with that job-function uses the PermissionSet to execute playbooks / runbooks the CloudOps job-function is required to support. Once we are done, similar to reverse-engineering, based on all the actions we performed, we can now generate an IAM policy that would reflect the services and actions CloudOps would need to do her job. Last but not least, we fine-tune the generated policy by specifying resources, adding conditions, etc. Ta-da!

Not only AWS IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity, but it also saves time trying to hand-craft the PermissionSet policy and figuring out what services and actions shall be included.

Over-permissive policies are often used in non-production accounts for test purposes to minimize the effort of securing least-privilege access
This is an iterative process as job-function requirements and the infrastructure used by workloads evolve all the time.

Using ABAC For Fine-Grained Permissions

Imagine your workload is maintained by two different application teams. Each team is responsible for different datasets, which are stored in the same AWS S3 bucket. Each team has its own CloudOps who is responsible for operating the production environments. The permission assignment model we introduced earlier, would assign the same permissions to each of the CloudOps. This is simply because they share the same workload and the same job-function. How can we implement finer-grained permissions for CloudOps so each can only access data belonging to her team? One option would be to split the job-function into two (e.g., CloudOpsTeamOne) and use two separated PermissionSets. But that does not scale well. Another option, which scales better, would be to take the ABAC approach by using user-attribute and AWS resource tag to limit access based on their values. Figure-9 demonstrates a policy that implements the desired ABAC for our fictitious use-case.

Figure-9 – An Example For PermissionSet Policy That use ABAC

Conclusion

The AWS SSO Operator solution allows you to automate the assignment of AWS Permissions to user-groups and workloads. The AWS SSO Operator reduces some operational overhead and it makes things more more secure as it reduces the human risk factor

On the other hand, managing AWS SSO Permissions while adhering to the least privilege principal requires a more holistic strategy. There are some pragmatic measures you can take to operate and manage permissions at-scale while improving your security posture in this area too. It is not entirely unlikely that we’ll soon start seeing solutions/services from vendors automating permissions monitoring, provisioning and adjustments making it simple to follow the principle of least privilege (PoLP). Till then, make sure you have a solid story around it.

All the security controls you put in place meant to address the business risks you have identified. Not all applications and businesses are equally sensitive, thus, you should not invest in fancy solutions without being able to identify the risk you are trying to mitigate and how significant it is for your business.

Tags architecture, aws, best_practices, Cloud, iam, security, serverless, wellarchitected

aws security

The Secret Sauce Of Effective Secrets Management

Post author By Yossi Cohen
Post date January 25, 2022
No Comments on The Secret Sauce Of Effective Secrets Management

Sometimes secrets are nothing more than just credentials that are used for authenticating client applications & users to provide them access to sensitive systems, services, and information. How would you operate these secrets effectively and at scale such that they remain secrets?
Because secrets have to be distributed securely, Secrets Management solutions must account for and mitigate the risks to these secrets, in-transit, at-rest and in-use.
Do not hold your breath, secrets management vendors will not handle all that for you and that is simply because they cannot. Don’t get me wrong, these vendors do provide you with the foundation to manage secrets, but there is still responsibility on your end to keep your secrets management solution secure.

Tip #1: For each service your workload uses, be well aware of the shared responsibility model defined by the provider and make sure you understand where the responsibility of the vendor ends and yours begins.

Generally speaking, cryptographic private and symmetric keys are managed separately via Key Management System (KMS) and Hardware Security Module (HSM) technologies and this topic by itself deserves its own post. These technologies have a lot in common and they usually complement one another as illustrated in Figure-1. Secrets management is a higher level concept, don’t confuse it with KMS. Secrets management technologies include built in KMS capabilities to support the cryptographic operations require to manage secrets.

Figure-1 – Secrets Management, KMS and HSM

The Darkness Before The Light

Just a few years back enterprises suffered from massive proliferation of passwords, passphrases, private/symmetric cryptographic keys, API keys – scattered all over the place. The business had to make sure these secrets are stored, distributed and rotated in a secure manner and with minimal/no impact on production environments. The ability to keep humans away from secrets was limited. Operating these processes was not a picnic either, which led to certain shortcuts like infrequent secrets’ rotations, ‘standard’ well-known passwords, etc. To make the chaos less chaotic, wherever possible, we implemented single-sign-on (LDAP, Kerberos, SAML federation, etc.), which helped reducing the volume of secrets we had to manage. But we still had more than handful of secrets to handle. So we encrypted them using keys, which by themselves are just more secrets to protect, and sometimes those keys were managed via HSM, which was not always used for the right reasons. If that’s not bad enough, there was no standard, unified approach for operating these technologies and automation was a luxury we rarely could afford.. As you can imagine, this approach did not scale very well and it was quite a nightmare for both operations and security personnel.

What Good Looks Like?

The ideal secrets management solution allows both, humans and machines, to securely access and use secrets. Furthermore, it limits the attack surface by making sure our secrets are either automatically rotated or automatically provisioned & revoked in relatively short time intervals (emulating temporary credentials). Basically, the entire lifecycle of our secrets is fully/mostly automated with no/minimal human intervention. Keeping humans away reduces the risk of secrets being leaked eventually.

Depending on your business, you may need your customers to authorize your system so it can access their SaaS accounts on their behalf. Sounds familiar? For example, it could be an application that organize users’ photos in Google Photos, or maybe an application, which read out loud users’ new emails in their Gmail accounts, etc. For that to work, your system stores and uses secrets allowing your system to access customers’ SaaS accounts on their behalf. Your customers trust you to keep their precious secrets in your hands, do not disappoint them, keep these secrets safe, make them important to you as they are for them. To manage customers’ secrets properly we would want to maintain tenant isolation and prevent cross-tenant access. We would keep these secrets away even from our system administrators. Also, as opposed to systems’ secrets, customers’ secrets would often need to be supported at a much higher scale. Moreover, despite our desire to always automate secret rotation with no human intervention, occasionally, this is simply not up to us, as some service providers do not support it. In these cases we monitor the manual secret rotation process and alert our security personnel if for any given secret the rotation policy is not met.
From the moment a secret is created up to the point it is being deactivated, it must be secured all the way.

Tip #2: If you have a viable option to use temporary credentials to access a resource/service (e.g., AWS IAM Temporary Security Credentials), seriously consider giving it precedence over alternatives that involve the use of secrets (e.g., on AWS there are multiple alternatives you can use other than SSH keys).

Let’s go through some use cases that illustrate these three main phases of storing a secret, distributing it to destination(s), and rotating it, if possible, automatically.

Use Case #1: Incident-Response Playbook

Let’s assume your system makes use of a secrets management solution and you are the SecOps of the company. At some point in time your security monitoring system generates an alert indicating that the master credentials (system secret) of a very sensitive database has been leaked. Even though the database resides in the company private network, the company policy guides you to follow an incident-response playbook that address this exact incident. As you can imagine, this sequence of steps is also quite useful to remediate automatic secret rotation failure. This process is illustrated in Figure-2:

SecOps: sign-in to the cloud account
SecOps: for the compromised database credentials, trigger a secret rotation sequence via the Secrets Manager
Secrets Manager: generates a new random secret, rotates the database credentials either directly or via the identity-provider it is configured to use (e.g., Microsoft Active Directory), tests the new credentials and then stores them in the Secrets Manager.
Secrets Manager: notifies the application that the database credentials changed
Applications: retrieves the new database credentials
Applications: re-initializes its database connections and starts using them

CloudOps, who may apply certain database schema fixes, goes through a flow very similar to ‘Applications’ flow in order to gain database access.

Tip #3: while secure access to secrets reduces risk, the preferred approach is always to automate your way out of needing human access in the first place.

Figure-2 – Incident-Response Playbook – Compromised (Secret) Credentials

Tip #4: Leverage the alternating users rotation strategy and keep credentials for two users in one secret in order to support high-availability.

Tip #5: Your secret management solution is incomplete if it is not connected to a SIEM or alike to monitor your secret management and alert on any access anomalies, non-compliance secrets (e.g., secrets which failed to be rotated) and other threats. Since secrets are often credentials of other service providers, make sure these services are regularly audited and connected to your SIEM as well.

Use Case #2: Manually Operating Secret Rotation

Although a great deal of services already support API based credentials rotation, occasionally, we come across services which do not (e.g., at the time of writing, AWS SSO Automatic Provisioning & its access-tokens is one such example). Even though automatic rotation cannot be supported in these cases, as described in Figure-3, we can still detect those secrets that must be rotated just before they fail to meet security compliance policies:

Secrets Manager: using its internal scheduler, generates an event notifying that a given secret is reaching its time for rotation.
Secrets Manager: processes the event by sending an alert to SecOps
SecOps: sign-in to the cloud account
SecOps: follows the manual sequence defined by the service provider to rotate the credentials.
SecOps: stores the credentials in the Secrets Manager.
Secrets Manager: notifies the application that the service credentials changed
Applications: retrieves the new service credentials
Applications: reestablishes its service connections and starts using it

Figure-3 – Monitoring & Manually Operating Secret Rotation (if you must…)

Tip #6: For improved performance and reliability, especially in highly-distributed systems, consider reducing the coupling with the secrets manager by distributing and caching secrets for short time intervals in a secure, ephemeral, local store, which allows applications to process secrets most efficiently.

Tip #7: Secrets are best protected within the secrets manager, if a secret must be highly protected while it is being used, distributing it to a non-compliance store might not be an option. This is where technologies such as AWS Nitro Enclave can really shine. They supports isolated execution environment, which allows you to protect sensitive data when it is in use even in untrusted environments.

Use Case #3: Authorize Access To 3rd Party Service Account

Figure-4 illustrates the use-case in which a customer of yours follows the steps to authorize your application to access their 3rd party SaaS account (e.g., Salesforce CRM) on their behalf:

Customer: sign-in to your system.
Application: redirects the user to the IdP of the customer’s 3rd party SaaS provider requesting the customer to authorize your system.
Customer: sign-in to her SaaS account and submits her consent to authorize your system.
Customer: stores a secret that grants our application permissions to access the customer’s 3rd party SaaS account.
Secrets Manager: notifies the application that the secret of that customer changed.
Applications: retrieves the new customer’s secret
Applications: connects to the customer’s 3rd party SaaS account and starts using it.

Secrets management that can scale in proportion with your number of customers, plays an important role here.

Tip #8: Every service has its limits (e.g., AWS Secrets Manager quotas), make sure your secrets management solution complies with your requirements (e.g., latency, number of secrets, API call rate limits, secret's size, etc.)

Figure-4 – Customer Authorizes Access To Her 3rd Party Service Account

Tip #9: to prevent cross-tenant access make sure your SaaS architecture enforces tenant-aware authorization policies.

Tip #10: adhere to the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your secret management solution. A secret management solution that is integrated with a strong identity foundation is a key prerequisite to enable that.

Use Case #4: Automated Secret Rotation

Automating secret rotation significantly reduces credentials leakage risk just by eliminating the need to run such sensitive security operations manually.

Tip #11: be biased towards automating secrets rotation, it is a key enabler for scaling out your security operations around secrets management.

Figure-5 illustrates the use-case in which a service provider is initialized with a secret just once during its deployment and then the automatic secret rotation kicks off immediately:

CI/CD: a deployment tool configures the service-provider to use credentials it randomly generates on the fly. The deployment tool then use the same credentials to initialize a secret’s value in the Secrets Manager.
Secrets Manager: if the secret has just been initialized, automated secret rotation is triggered almost immediately, otherwise, using secrets manager internal scheduler, automated secret rotation is triggered in fixed time intervals. In both cases, the secrets manager generates a new secret’s value.
Secrets Manager: rotates the database credentials and tests them.
Secrets Manager: notifies the application that the database credentials changed
Applications: retrieves the updated credentials of the service provider.
Applications: starts using the updated credentials to access the service provider.

Tip #12: the more frequent the secret rotation is, the more difficult it is for a potential intruder to gain unauthorized access to it. 

Tip #13: You should keep critical manual procedures available for use when automated procedures fail - monitor your automated secret rotation process and trigger incident-response procedure whenever the automatic process falls short.

Enough With The Mumbo Jumbo

There are many ways to put our theory into practice. Let’s walk through one implementation example that combines both operational excellence and of course, security.

In our imaginary business, we run a very successful online cookies store. The extremely naive functional view of our eCommerce system and the flow for buying cookies is illustrated in Figure-6.

Figure 6 – arealcookie.com online store – functional view

The payment microservice redirect the user to pay via Paypal
The payment microservice receives a callback from Paypal confirming the payment
The payment microservice publishes an event confirming the order request
The Messenger and Order microservices consume the published event in parallel. The Messenger microservice sends an email to the user confirming the order. The Order microservice is taking care of fulfilling the order request

Our application workload runs on AWS, in its managed Kubernetes cluster, AWS EKS. The messaging system the application uses is a managed version of RabbitMQ, AWS AmazonMQ. The secrets management solution we use allows our three microservices to securely access their unique credentials so they can authenticate and gain access to AWS AmazonMQ. Moreover, the secrets management solution takes care of automatically and securely rotating these credentials of AWS AmazonMQ, which makes our CISO extremely delighted that no human has to execute this delicate runbook manually and on a routine basis.
Table-1 provides the reasoning for the technology choices we made.

Table-1 – Secrets Management Solution – Technology Stack

	Technology	Reasoning
	AWS Secrets Manager	Features: extensible secret rotation, event-driven triggers, tagging, versioning, structured & binary secrets, fine-grained permissions, auditing, etc. Interfaces: UI, CLI and API user-friendly interfaces Best of Suite: SaaS that is pre-integrated with AWS services (e.g., CloudWatch, EventBridge, CloudTrail, Config, IAM, KMS, etc.) High-Availability: 99.9% (including cross-region replication support) Audit & Security Monitoring: via integration with CloudWatch, CloudTrail, Config. Security Hub, etc. Compliance: HIPAA, PCI, ISO, etc.
	AWS KMS	Features: master key rotation, event-driven triggers, tagging, versioning, fine-grained permissions, auditing, symmetric and asymmetric keys,high-standards for cryptography, natively support envelope-encryption, protect secrets in transit and at rest, etc. Interfaces: UI, CLI and API user-friendly interfaces Best of Suite: SaaS that is pre-integrated with the majority of AWS services High-Availability: 99.999% (including cross-region replication support) Durability: 99.999999999% Scalability: automatically scale to meet the demand (see AWS KMS quotas) Audit & Security Monitoring: via integration with CloudWatch, CloudTrail, Config. Security Hub, etc. Compliance: ISO, PCI-DSS, SOC, etc.
	Kubernetes Secrets	Features: makes secrets easily and natively accessible by authorized service accounts assigned to Kubernetes Pods. The secrets are kept in the etcd encrypted at rest by AWS EKS KMS plugin plugin.
	Kubernetes External Secrets	Features: allows using external secret management systems (as the source of truth) to securely add secrets in Kubernetes. Interfaces: It extends the Kubernetes API by adding an `ExternalSecrets` object using Custom Resource Definition and a controller to implement the behavior of the object itself. The conversion from `ExternalSecrets` is completely transparent to Pods that can access Kubernetes Secrets normally. Integration: support native integration with cloud providers’ identities, service accounts and IAM Security: supports fine-grained access permissions Multi-Cloud: supports AWS System Manager, Akeyless, Hashicorp Vault, Azure Key Vault, Google Secret Manager and Alibaba Cloud KMS Secret Manager Possible Future Alternative: AWS Secrets and Configuration Provider, ASCP (and implementations for other cloud providers: GCP, Azure) for the Kubernetes Secrets Store CSI Driver. This is still alpha but it is definitely something to consider once it becomes production ready.

Modern secrets management technologies (e.g., Akeyless, Hashicorp Vault) are equipped with much more than just secrets management and may combine several disciplines e.g., secrets, PAM (allows authorized clients to get temporary credentials to target systems that are supported by the PAM function), KMS, PKI, etc. At the time of writing, AWS Secrets Manager still does not support the common functionality that allow an authorized identity to get a unique, temporary credentials, on demand, to access various 3rd party services. For PKI and KMS AWS offers complementary managed services (AWS KMS and AWS Certificate Manager Private Certificate Authority).
If you choose to go with a managed secrets management technology (SaaS), keep in mind that even though it greatly simplifies security operations, you are still fully responsible for certain things and you must be aware of your vendor’s Shared Responsibility Model.
If you choose to go with a self-managed secrets management technology, you have much more work to do to securely operate it in an effective manner.
In our examples, we embraced the best-of-suite strategy and chose to use AWS Secrets Manager, which is SaaS and pre-integrated secrets management technology with all the important services of AWS.
Going with the cloud provider native secrets management may save you all kind of wiring and integrations you would otherwise implement yourself if you were to take a different route. In addition, the implementation and operations are often quite consistent within the same cloud provider, e.g., on AWS: resource-based policies, IAM policies, KMS CMK, IaC, CLI, API, AWS Config, AWS Security Hub, CloudTrail, AWS EventBridge, etc.

Tip #14: On AWS, prefer a multi-account strategy to isolate workloads by systems and also by SDLC environment (e.g., sandbox, development, staging, production, etc.). An independent and isolated secrets management instance shall be used by each of these accounts preferably sharing no secret with the other accounts.

Now let’s see how it all fits together, Figure-7 illustrates this architecture in some greater level of details.

Figure-7 – arealcookie.com online store – technical view

Using an internal scheduler, the AWS Secrets Manager triggers invocation to the Secret Rotation Lambda function
The Secret Rotation Lambda function computes a new password and makes an API call to AWS Secrets Manager to save a new version of the secret in a Pending stage.
The Secret Rotation Lambda function calls Amazon MQ API to set a new password for the application user.
The Secret Rotation Lambda tests the new password by creating a new RabbitMQ connection to AWS AmazonMQ broker.
Secret Rotation Lambda function finishes the rotation flow by promoting the stage of the secret to be Current.
The Secret Rotation Lambda function asynchronously invokes the Secret2EKS Lambda function to notify the EKS cluster about the updated secret.
1. [6-error] On Secret Rotation Lambda function error, a message with details about the failed event is published to AWS SNS topic for a Dead-Letter-Queue.
2. [7-error] AWS SQS queue that is subscribed to the AWS SNS DLQ topic, queues the message keeping it for further error handling processing.
3. [8-error] SecOps gets notified about the error and she executes a playbook to investigate and take actions to remediate the problem.
The Secret2EKS Lambda function applies corresponding ExternalSecret objects to the AWS EKS Cluster API.
Kubernetes External Secrets makes a call to AWS Secrets Manager GetSecretValue API to retrieve the secret corresponding to the ExternalSecret object.
Using the secret retrieved from AWS Secrets Manager, Kubernetes External Secrets applies Kubernetes Secret object to the AWS EKS Cluster API corresponding to the ExternalSecret object.
Using AWS EKS Cluster API, our application running in AWS EKS Cluster gets to use the Kubernetes Secret object.

In this scenario we would follow tip #4 and implement user toggling to ensure stability during secret rotation.
Our application must do one of two things: either polls the secret from the volume to detect updates (e.g., reload feature of Spring Cloud Kubernetes) and re-initialize RabbitMQ connections or alternatively, it lazily re-initialize RabbitMQ connections once a connection is failing due to invalid credentials.

Tip #15: Once you know what you need to protect, you can begin developing secrets management strategies. However, before you spend a dollar of your budget or an hour of your time implementing a secrets management solution to reduce risk, be sure to consider which risk you are addressing, how high its priority is, and whether you are approaching it in the most cost-effective way.

Wrapping Up

Making decisions around secrets management technology is never easy. It requires trading off one item against another, cost, reliability, operational excellence and of course, security. But in this post we preferred, more than anything, to focus on how to distribute secrets safely to applications and also on how to support short secret rotation intervals. This is because no matter how secure your secrets management technology is, once secrets are leaving its secure boundaries there is always that risk they will be compromised.

I hope you find the use-cases and the tips presented here – valuable.
I tried to focus on those secrets management areas you should deeply care about even if you are already paying for the best secrets management technology out there.