Are You Ready To Manage Access At Scale?

Post author By Yossi Cohen
Post date August 7, 2022
1 Comment on Are You Ready To Manage Access At Scale?

Back then in the university, I tended to enjoy the math courses. The first year though wasn’t great, I often lost points in exams merely because I made some pretty lame mistakes. I used to have monumentally bad handwriting so taking the effort to write/draw my math legibly made a gigantic difference. Neatness was a key factor in making my exams significantly less corrupted by errors.

Neatness is also a key factor when coming to design a solution architecture. It helps identifying systematic patterns, ones that you can repeat and eventually automate. Automation helps cloud operations teams reduce the volume of routine tasks that must be manually completed.

AWS SSO helps you securely connect your workforce identities and manage their permissions centrally across AWS accounts and possibly other business cloud applications. How can it fit into a solution that allows you to manage access at-scale? Let’s try to figure it out.

Figure-1 demonstrates a common use-case for an AWS SSO based solution architecture in a multi-account AWS environment:

A developer login to the external identity provider (e.g., Okta) and launches the AWS App. assigned to her Okta‘s group.
A successful login redirect the developer to AWS SSO console with a SAML2 assertion token. Depending on the developer’s association with Okta‘s user groups, the developer get to select a specific PermissionSet she is allowed to use in a specific AWS account.
The developer uses the AWS IAM temporary credentials returned from the previous step to access AWS resources/services.

Figure-1 – A User With A Job-Function Access A Workload – Example

A resource-based policy can result in a final decision of “Allow”, even if an implicit “Deny” in an identity-based policy, permissions boundary, or session policy is present (see AWS IAM Policy Evaluation Logic).
AWS Single Sign-On (AWS SSO) is now called AWS IAM Identity Center.

Figure-2 visualizes user’s AWS access permissions that are granted by first adding users to groups in an an external identity provider, and then by creating tuples of group, AWS SSO PermissionSet and AWS account. In the absence of “Neatness” in place, aka, a solid model for these permissions’ assignments, it is clear how this can grow chaotically. For example, an implementation that includes 20 groups, 20 permission-sets and 20 accounts, can end up having 8,000 unique combinations that represent users’ permissions. What AWS permissions are granted to
User 2 when she is added to User-Group 2? In which accounts? It’s not so easy to figure it out. Not only security operations might be slowed down by this cumbersome process but it also makes things much more error-prone if humans have to manually configure it all.

Figure-2 – AWS SSO Permission-Sets’ Assignments – Illustration

Luckily, this is not too hard to solve.

The “AWS SSO Operator” Solution

We start by defining the conceptual model (see Figure-3) for managing our AWS SSO Permission-Sets Assignments. In addition to Neatness, the model also takes into account the Segregation of Duties principle and the advantage of isolating SDLC environments.

aws sso operator to manage permission assignments at scale — Figure-3 – The Conceptual Model For Permission Assignments

In this conceptual model, a user is associated with one or more job-functions. The model globally fixates the SDLC environments each job-function is allowed to access, e.g., the ‘tester’ job-function is limited to development and staging environments only. Each user is assigned to zero or more user-groups. Each user-group is associated with a single unique combination of job-function and workload. Each PermissionSet is associated with only one job-function but one-or-more workloads. Each account on the other hand, is associated with a single unique combination of workload and SDLC environment.

It simply means that user-groups, PermissionSets and accounts are associated based on their workload, job-function and SDLC environment attributions.

The model may change according to your company needs and policies. Companies choose to define and implement SDLC environments slightly differently. Job-functions also varies from one company to another. Thus it is important to keep the model highly configurable.

Figure-4 demonstrates the realization of the model and the matching process by which AWS SSO PermissionSet Assignments are created.

User-groups follow a naming convention, a syntax, that incorporates job-function and workload identifiers in their names. User profile attributes (e.g., ‘team’, ‘project’) may also be implemented to enable ABAC for finer grained permissions.
Each AWS SSO PermissionSet is tagged with ‘workloads’, a list of one or more workload names. Reserved values allows us to keep this list short by indicating the population of workloads to use, e.g.:
- workload_all – the PermissionSet matches all workloads
- workload_match – the PermissionSet matches multiple pairs of workload and user-group that share the same workload identifier.
There is a special treatment for Sandbox accounts, which are assigned to a single owner.
Each AWS SSO PermissionSet is also tagged with ‘job_function’, which semantically is similar to an AWS IAM Job Function.
Each AWS Account is tagged with ‘workload’.
Each AWS Account can represent a single workload only.
AWS Organizational Units (OUs) in the second level of the hierarchy use reserved names that represent the SDLC environments your company uses.
More on the “awsssooperator” workload is coming next..

Now that our model is implemented and can be used to systematically produce AWS SSO PermissionSet Assignments, wouldn’t it be nice to automate everything? Figure-5 presents the “AWS SSO Operator”, an opinionated solution, that automates the provisioning/de-provisioning of AWS SSO PermissionSet Assignments using the metadata from our implemented model. It keeps the AWS SSO PermissionSet Assignments in-sync with the model pretty much at all times. It continuously evaluates the actual PermissionSet Assignments state against the desired state and acts to remediate the gap.

aws sso operator architecture — Figure 5 – AWS SSO Operator Solution Architecture

The main two flows are divided by “event-based” and “time-based” triggers. They perform similar logic except the time-based one (B) run periodically and it analyzes the entire model to ensure the AWS SSO PermissionSet Assignments are in their desired state so it can initiate remediation actions as needed. The event-based flow (A), analyze and remediate only the AWS SSO PermissionSet Assignments that are in the context of the event. The main steps for the PermissionSet Assignments event-based flow:

Perform cross-region event re-routing for the AWS Organizations MoveAccount event. This is only required if your AWS SSO Operator runs in a region other than N. Virginia. A separate SAM application is used for this purpose.
Event Rules are implemented in AWS CloudWatch Events (similar to AWS EventBridge) to intercept the relevant events and target them to the appropriate AWS Lambda Functions (“Event Handlers”), e.g., MoveAccount (organizations.amazonaws.com), TagResource (source: sso.amazonaws.com), etc.
AWS Lambda Functions, retrieve Okta & AWS SSO SCIM access tokens. The Event Handlers need those to access Okta and AWS SSO SCIM API.
Okta user-groups are loaded by the event handlers so they can be matched against AWS SSO PermissionSets and AWS accounts based on their model’s attributes (aka, job-function, workload and SDLC environment). At the time of writing this post AWS SSO SCIM API is still limited and therefore, Okta API is used to iterate over the entire Okta user-group list.
Verify that the user-groups we loaded from Okta exists in AWS SSO SCIM, otherwise we ignore them.
Retrieve the AWS account workload tag and its parent OU (SDLC environment) so it can be matched by workload and job-function (job-functions are restricted to specific SDLC environments)
AWS SSO API is used to create / delete PermissionSet Assignments. The API is asynchronous and need to be monitored.
The PermissionSet Assignment request (create/delete) events are sent to AWS SQS (“PermissionSet Assignments Monitoring”), so they can be monitored asynchronously.

The PermissionSet Assignments Monitoring flow (3^rd flow; C) sends audit events to SecOps (or to a SIEM system for this matter). Each event includes information about provisioning/de-provisioning a PermissionSet Assignment:

The SQS2StepFunc Lambda Function is triggered to receive batches of messages from the AWS SQS queue to which steps A8 and B7 from previous flows send audit events.
The Lambda Function starts executing the “Monitor Provisioning Status” AWS Step Functions. Figure-6 presents the corresponding state machine. The state machine process the events one by one by using the Map construct of AWS Steps Function.
If the PermissionSet provisioning status is not final (aka, “in-progress” and neither “succeed” nor “failed”), then the state machine continues polling the status from AWS SSO.
Eventually the final PermissionSet provisioning status is reported back to SecOps/CloudOps by publishing an AWS SNS message.

Figure-6 – AWS SSO Operator – PermissionSet Assignments Monitoring – AWS Step Functions

The AWS SSO Operator does not mandate the use of SAML-Assertion / User-Attributes based ABAC but it is definitely something to consider when implementing fine-grained permissions at-scale as it significantly helps lowering the number of user-groups. For example, let’s assume we have a “Team” user attribute in Okta. Users of “Team_Alpha” and “cloudops_workload_myapp” user-group may have access to the AWS S3 object S3://some-bucket/some-key/myobj (production account; workload “myapp”), while users of “Team_Bravo” in the same user-group, may not.
To speed up the development, the AWS SSO Operator solution is written as a couple of standalone Python modules that can be easily tested locally and then deployed as Lambda Function Layers keeping the Lambda functions code lean and as a mere protocol wrapper only. For infrastructure-as-code AWS SAM is being used.
The Serverless application and the code are designed to avoid the limitations of the (Okta and AWS) API in use like rate-limits. It uses queues to buffer API calls, batch processing, caching, etc.
The AWS SSO Operator has a built-in error handling that makes use of a Dead-Letter-Queue (AWS SQS) that it uses to surface operational errors to CloudOps personnel.
You can now administer AWS SSO from a delegated member account of your AWS Organizations. This is a major advantage as you no longer need to login to the highly sensitive master account in order to use AWS SSO.
AWS SSO PermissionSets may also be used to implement a secure remote access to AWS EC2 instances via AWS Systems Manager Session Manager. If you choose to use a different vendor/service (e.g., zscaler) for remote access, you would need to handle the permissions’ assignments separately.

The AWS SSO Operator and the model it introduces allow administrators of the external identity provider (e.g., Okta) to define user-groups with clear semantic of how those groups associated with AWS SSO PermissionSets and AWS accounts. Finally, the association of the AWS permissions to user-groups and accounts, is done automatically, which makes things more productive and more secure as it reduces the human risk factor.

Thoughts About Permissions Management

Up until now we took care of connecting the dots, assigning permissions to users by associating user-groups, AWS SSO PermissionSets and AWS accounts. But what about the actual permissions? Is there a way to automatically generate the exact least-privilege permissions required for a given job-function and the relevant workloads? Unfortunately, there is still no silver bullet to fully address this challenge.

And yet that does not mean we need to give up security, scale or speed – there is a lot we can do.

Permissions are implemented as AWS IAM policies attached to AWS SSO PermissionSets, AWS Organizations SCPs (or/and AWS Control Tower Preventive Guardrails), AWS Resource-based policies, AWS Session policies and IAM permissions boundaries. The AWS Policy Evaluation Logic uses all of these policies to determine whether permission to access a resource is allowed or denied.

Managing permissions is not a one team show, there are multiple parties responsible for making this process of handling requests for permissions secure and effective at scale. Responsibilities may vary depending on the operating model of your company but usually there are three roles involved:

Engineering is responsible for developing the application and raise their access requirements to the Cloud Infrastructure Security role.
Any delay in granting the permissions these users need would impact their ability to deliver on time.
Cloud Infrastructure Security is responsible for defining and implementing (via CI/CD and Infrastructure-as-code) the AWS SSO PermissionSets and the corresponding IAM policies.
IT is responsible for implementing and operating the external identity provider (e.g., Okta). That includes provisioning of users, user-groups and adding/removing users to/from user-groups.

Our permission assignments model assists the IT team in managing user-groups and users at scale. The naming convention, the syntax, we use for user-groups helps to eliminate ambiguity and it also opens the door for new automations to follow.

The Cloud Infrastructure Security team however, owns the implementation and the responsibility to enforce the least privilege principal. Numerous requests for cloud permissions take place daily and they are expected to be handled almost immediately. The Cloud Infrastructure Security team is the one to handle these requests usually by hand-crafting AWS IAM policies, which are iteratively tested and corrected as needed. This process is manual, time-consuming, inconsistent, and often suffers from trial and error repetition. Our model force a certain order, IAM policies are defined for job-functions in the context of workloads, and to a some extent that eases the process since the job-function must be well defined. Still, this process is quite cumbersome and the Cloud Infrastructure Security might quickly become a bottleneck here.

The Cloud Infrastructure Security and the IT teams shall handle permissions revocations in cases where a permission is unused or/and a user is no longer entitled to have it, e.g., due to user de-activation. The way to automate this process is not covered here and it deserves its own post.
The services and actions of an IAM policy are determined in accordance with the job-function definition.

As the company scales, this kind of centralized and manual management approach falls over, becoming impractical for both operations teams and their users.

The following strategies, which are not mutual exclusive, can become handy to effectively operate permissions at-scale.

Decentralized Permissions Management

Managing permissions centrally may work for small businesses, but it does not scale very well, which is why you would want to have the option to delegate most of it to applications owners who can become more independent. There are multiple ways you could choose from to implement the delegation model but the one that keeps your application owners autonomous is probably preferred especially when these application members are also accountable for the business outcome. Since the AWS SSO Operator solution takes care of permissions’ assignments to user-groups and workloads, the delegated application team members would only need to manage the lifecycle of AWS SSO Permission-Sets (create/update/delete) and only the ones they are entitle to manage (specific workloads, job-functions). Figure-7 demonstrates permission delegation to a SecOps job-function of the “myapp” workload. A more detailed example of how permission delegation works in AWS SSO can be found here.

Figure-7 – AWS SSO PermissionSet Custom Policy

The decentralized permissions management does not mean nothing is managed centrally. Certain operational aspects of the AWS SSO are likely to be still managed centrally by your Cloud Infrastructure Security team. For example,
- Permission delegation and policies like the one illustrated in Figure-7
- ABAC attribute mapping to SAML2 attributes from your external identity provider
- Cross-Functional job-functions’ permissions, e.g., secops, security-auditor, finops, support-ops, etc.
- AWS SSO Permission boundaries, which allows us to limit the permissions we delegate.
None of the AWS SSO Operator responsibilities around permissions’ assignments & provisioning are delegated to anyone – only the AWS SSO Operator is entitled to handle that.

“Shift-Left”

Moving security and specifically permission management aspects to an earlier stage of the development lifecycle makes it easier to put more automation in place and therefore improve your ability to scale. Shifting left in this case would mean to scan, inspect and identify IAM policies security issues right in their IDE or CI pipelines. You want to detect and remediate, as early as possible, overly permissive IAM policies that might allow unintentional capabilities such as privilege escalation, exposure of resources, data exfiltration, credentials exposure, infrastructure modification, etc. A tool like Cloudsplaining, can be integrated into your CI/CD and fail builds that contain such security issues. Similarly, AWS IAM Access Analyzer can be integrated into your CI/CD (as in this example) to validate IAM policies against AWS security best practices and detect security issues including those associated with the principle of least privilege. Checkov is a static analysis tool for infrastructure-as-code (IaC) and it can also be integrated into your CI/CD to run security checks. Checkov is also integrated with Cloudsplaining to add more IAM checks that flag overly permissive IAM policies in IaC templates.
If you seek for best practices around how to automate testing for IAM policies, AWS IAM Policy Simulator can be integrated into your CI/CD for unit testing the IAM policies (as in this example) making sure they are fully functional in the context of the target accounts (e.g., is there an SCP blocking a service or action you plan to use?). The AWS IAM Policy Simulator is useful for testing the policies and to identify permission errors early in the development lifecycle. The AWS IAM Policy Simulator is not designed to assist with the least-privilege principle, however, when we come to implement this principle the chances for errors increase and we want to make sure our tightened policies still work.

These are just examples for tools that can assist in pushing some elements of permission management to the early phases of the development lifecycle in a scalable manner.

Permissions Boundaries

Another useful technique that helps with the least-privilege principle is to set hard boundaries to policies so no matter what permissions a policy declares, the effective permissions can never go beyond what is whitelisted by the permission boundaries. This measure is highly effective in gaining back control over permissions that are granted at high scale. AWS SSO PermissionSet allows defining permission boundaries to limit its policies. AWS Control Tower supports permission-boundaries for the entire organization in the form of preventive Guardrails and Custom Guardrails (which are implemented as AWS Organizations SCPs). AWS IAM support permission boundaries for IAM Roles/Users. Another type of “permission-boundaries” is controlled via AWS account level configuration by which we can block certain permissions for the entire account, e.g., block public access to all AWS S3 buckets within an AWS account.

Permission boundaries are crucial for the success of decentralized permission management. These are preventative controls that helps to ensure the delegated teams cannot go too wrong. The Cloud Infrastructure Security team is a good candidate to own the management of permissions boundaries and AWS Control Tower preventive Guardrails in particular.

Permissions Monitoring

Detective controls includes security monitoring tools, which can also become handy in supporting the least-privilege principle. More importantly, these security monitoring tools are there to ensure your security compliance requirements are met. It provides you with the capability to automatically detect permission related issues and respond to them by taking automatic/manual remediation actions. Figure-8 illustrates a common way of doing that on AWS.

AWS GuardDuty, AWS IAM Access Analyzer and AWS Config support permission related pre-defined checks. AWS Config also support custom rules for custom checks. These are high level services that require minimal or no coding at all. Don’t miss the opportunity to win quick wins!

You can also use any of the log events ingested to our SIEM (e.g., AWS CloudTrail, CloudWatch Logs, etc.) to detect security events. The AWS SSO Operator implements a very simple monitoring for the AWS SSO PermissionSet assignments and provisioning activities. The solution can be extended to send these logs to a SIEM and for example verify if “break-glass” or “power-user” permissions are assigned to inappropriate user-groups.

Continuous permissions checks followed by corrective actions are best handled locally from within the account where the issues are found, aka, in a decentralized manner (see example). This is in contrast to the centralized approach that is illustrated in Figure-8 where security events are propagated to other systems, e.g., SIEM.
Continuous permissions corrective actions may pose another challenge since permissions are most likely managed as source code (IaC) in a source control repository, which is also the source of truth. The challenge in this case is keeping the source code in-sync with the corrected permissions.

Define Permissions Based on Usage Analysis

Reverse-engineering is the act of dismantling an object to see how it works. It is done primarily to analyze and gain knowledge about the way something works but often is used to duplicate or enhance the object.

AWS IAM Access Analyzer generates IAM policies in a process similar to reverse engineering. It analyzes the identity historical AWS CloudTrail events and generates a corresponding policy based on the access activity. You should not consider the generated policy as the final product but it is still much better than nothing. The generated policy requires tuning like adding/removing permissions, specifying resources, adding conditions to the policy, etc.

Let’s assume we want to generate a policy for the job-function CloudOps. We will start by creating another job-function, e.g., CloudOpsTest, that in order to lower the risk, is enabled in development environments only. Then, we create an overly permissive AWS SSO PermissionSet for that job-function and for only a limited period of time during which an identity with that job-function uses the PermissionSet to execute playbooks / runbooks the CloudOps job-function is required to support. Once we are done, similar to reverse-engineering, based on all the actions we performed, we can now generate an IAM policy that would reflect the services and actions CloudOps would need to do her job. Last but not least, we fine-tune the generated policy by specifying resources, adding conditions, etc. Ta-da!

Not only AWS IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity, but it also saves time trying to hand-craft the PermissionSet policy and figuring out what services and actions shall be included.

Over-permissive policies are often used in non-production accounts for test purposes to minimize the effort of securing least-privilege access
This is an iterative process as job-function requirements and the infrastructure used by workloads evolve all the time.

Using ABAC For Fine-Grained Permissions

Imagine your workload is maintained by two different application teams. Each team is responsible for different datasets, which are stored in the same AWS S3 bucket. Each team has its own CloudOps who is responsible for operating the production environments. The permission assignment model we introduced earlier, would assign the same permissions to each of the CloudOps. This is simply because they share the same workload and the same job-function. How can we implement finer-grained permissions for CloudOps so each can only access data belonging to her team? One option would be to split the job-function into two (e.g., CloudOpsTeamOne) and use two separated PermissionSets. But that does not scale well. Another option, which scales better, would be to take the ABAC approach by using user-attribute and AWS resource tag to limit access based on their values. Figure-9 demonstrates a policy that implements the desired ABAC for our fictitious use-case.

Figure-9 – An Example For PermissionSet Policy That use ABAC

Conclusion

The AWS SSO Operator solution allows you to automate the assignment of AWS Permissions to user-groups and workloads. The AWS SSO Operator reduces some operational overhead and it makes things more more secure as it reduces the human risk factor

On the other hand, managing AWS SSO Permissions while adhering to the least privilege principal requires a more holistic strategy. There are some pragmatic measures you can take to operate and manage permissions at-scale while improving your security posture in this area too. It is not entirely unlikely that we’ll soon start seeing solutions/services from vendors automating permissions monitoring, provisioning and adjustments making it simple to follow the principle of least privilege (PoLP). Till then, make sure you have a solid story around it.

All the security controls you put in place meant to address the business risks you have identified. Not all applications and businesses are equally sensitive, thus, you should not invest in fancy solutions without being able to identify the risk you are trying to mitigate and how significant it is for your business.

Tags architecture, aws, best_practices, Cloud, iam, security, serverless, wellarchitected

aws security

The Secret Sauce Of Effective Secrets Management

Post author By Yossi Cohen
Post date January 25, 2022
No Comments on The Secret Sauce Of Effective Secrets Management

Sometimes secrets are nothing more than just credentials that are used for authenticating client applications & users to provide them access to sensitive systems, services, and information. How would you operate these secrets effectively and at scale such that they remain secrets?
Because secrets have to be distributed securely, Secrets Management solutions must account for and mitigate the risks to these secrets, in-transit, at-rest and in-use.
Do not hold your breath, secrets management vendors will not handle all that for you and that is simply because they cannot. Don’t get me wrong, these vendors do provide you with the foundation to manage secrets, but there is still responsibility on your end to keep your secrets management solution secure.

Tip #1: For each service your workload uses, be well aware of the shared responsibility model defined by the provider and make sure you understand where the responsibility of the vendor ends and yours begins.

Generally speaking, cryptographic private and symmetric keys are managed separately via Key Management System (KMS) and Hardware Security Module (HSM) technologies and this topic by itself deserves its own post. These technologies have a lot in common and they usually complement one another as illustrated in Figure-1. Secrets management is a higher level concept, don’t confuse it with KMS. Secrets management technologies include built in KMS capabilities to support the cryptographic operations require to manage secrets.

Figure-1 – Secrets Management, KMS and HSM

The Darkness Before The Light

Just a few years back enterprises suffered from massive proliferation of passwords, passphrases, private/symmetric cryptographic keys, API keys – scattered all over the place. The business had to make sure these secrets are stored, distributed and rotated in a secure manner and with minimal/no impact on production environments. The ability to keep humans away from secrets was limited. Operating these processes was not a picnic either, which led to certain shortcuts like infrequent secrets’ rotations, ‘standard’ well-known passwords, etc. To make the chaos less chaotic, wherever possible, we implemented single-sign-on (LDAP, Kerberos, SAML federation, etc.), which helped reducing the volume of secrets we had to manage. But we still had more than handful of secrets to handle. So we encrypted them using keys, which by themselves are just more secrets to protect, and sometimes those keys were managed via HSM, which was not always used for the right reasons. If that’s not bad enough, there was no standard, unified approach for operating these technologies and automation was a luxury we rarely could afford.. As you can imagine, this approach did not scale very well and it was quite a nightmare for both operations and security personnel.

What Good Looks Like?

The ideal secrets management solution allows both, humans and machines, to securely access and use secrets. Furthermore, it limits the attack surface by making sure our secrets are either automatically rotated or automatically provisioned & revoked in relatively short time intervals (emulating temporary credentials). Basically, the entire lifecycle of our secrets is fully/mostly automated with no/minimal human intervention. Keeping humans away reduces the risk of secrets being leaked eventually.

Depending on your business, you may need your customers to authorize your system so it can access their SaaS accounts on their behalf. Sounds familiar? For example, it could be an application that organize users’ photos in Google Photos, or maybe an application, which read out loud users’ new emails in their Gmail accounts, etc. For that to work, your system stores and uses secrets allowing your system to access customers’ SaaS accounts on their behalf. Your customers trust you to keep their precious secrets in your hands, do not disappoint them, keep these secrets safe, make them important to you as they are for them. To manage customers’ secrets properly we would want to maintain tenant isolation and prevent cross-tenant access. We would keep these secrets away even from our system administrators. Also, as opposed to systems’ secrets, customers’ secrets would often need to be supported at a much higher scale. Moreover, despite our desire to always automate secret rotation with no human intervention, occasionally, this is simply not up to us, as some service providers do not support it. In these cases we monitor the manual secret rotation process and alert our security personnel if for any given secret the rotation policy is not met.
From the moment a secret is created up to the point it is being deactivated, it must be secured all the way.

Tip #2: If you have a viable option to use temporary credentials to access a resource/service (e.g., AWS IAM Temporary Security Credentials), seriously consider giving it precedence over alternatives that involve the use of secrets (e.g., on AWS there are multiple alternatives you can use other than SSH keys).

Let’s go through some use cases that illustrate these three main phases of storing a secret, distributing it to destination(s), and rotating it, if possible, automatically.

Use Case #1: Incident-Response Playbook

Let’s assume your system makes use of a secrets management solution and you are the SecOps of the company. At some point in time your security monitoring system generates an alert indicating that the master credentials (system secret) of a very sensitive database has been leaked. Even though the database resides in the company private network, the company policy guides you to follow an incident-response playbook that address this exact incident. As you can imagine, this sequence of steps is also quite useful to remediate automatic secret rotation failure. This process is illustrated in Figure-2:

SecOps: sign-in to the cloud account
SecOps: for the compromised database credentials, trigger a secret rotation sequence via the Secrets Manager
Secrets Manager: generates a new random secret, rotates the database credentials either directly or via the identity-provider it is configured to use (e.g., Microsoft Active Directory), tests the new credentials and then stores them in the Secrets Manager.
Secrets Manager: notifies the application that the database credentials changed
Applications: retrieves the new database credentials
Applications: re-initializes its database connections and starts using them

CloudOps, who may apply certain database schema fixes, goes through a flow very similar to ‘Applications’ flow in order to gain database access.

Tip #3: while secure access to secrets reduces risk, the preferred approach is always to automate your way out of needing human access in the first place.

Figure-2 – Incident-Response Playbook – Compromised (Secret) Credentials

Tip #4: Leverage the alternating users rotation strategy and keep credentials for two users in one secret in order to support high-availability.

Tip #5: Your secret management solution is incomplete if it is not connected to a SIEM or alike to monitor your secret management and alert on any access anomalies, non-compliance secrets (e.g., secrets which failed to be rotated) and other threats. Since secrets are often credentials of other service providers, make sure these services are regularly audited and connected to your SIEM as well.

Use Case #2: Manually Operating Secret Rotation

Although a great deal of services already support API based credentials rotation, occasionally, we come across services which do not (e.g., at the time of writing, AWS SSO Automatic Provisioning & its access-tokens is one such example). Even though automatic rotation cannot be supported in these cases, as described in Figure-3, we can still detect those secrets that must be rotated just before they fail to meet security compliance policies:

Secrets Manager: using its internal scheduler, generates an event notifying that a given secret is reaching its time for rotation.
Secrets Manager: processes the event by sending an alert to SecOps
SecOps: sign-in to the cloud account
SecOps: follows the manual sequence defined by the service provider to rotate the credentials.
SecOps: stores the credentials in the Secrets Manager.
Secrets Manager: notifies the application that the service credentials changed
Applications: retrieves the new service credentials
Applications: reestablishes its service connections and starts using it

Figure-3 – Monitoring & Manually Operating Secret Rotation (if you must…)

Tip #6: For improved performance and reliability, especially in highly-distributed systems, consider reducing the coupling with the secrets manager by distributing and caching secrets for short time intervals in a secure, ephemeral, local store, which allows applications to process secrets most efficiently.

Tip #7: Secrets are best protected within the secrets manager, if a secret must be highly protected while it is being used, distributing it to a non-compliance store might not be an option. This is where technologies such as AWS Nitro Enclave can really shine. They supports isolated execution environment, which allows you to protect sensitive data when it is in use even in untrusted environments.

Use Case #3: Authorize Access To 3rd Party Service Account

Figure-4 illustrates the use-case in which a customer of yours follows the steps to authorize your application to access their 3rd party SaaS account (e.g., Salesforce CRM) on their behalf:

Customer: sign-in to your system.
Application: redirects the user to the IdP of the customer’s 3rd party SaaS provider requesting the customer to authorize your system.
Customer: sign-in to her SaaS account and submits her consent to authorize your system.
Customer: stores a secret that grants our application permissions to access the customer’s 3rd party SaaS account.
Secrets Manager: notifies the application that the secret of that customer changed.
Applications: retrieves the new customer’s secret
Applications: connects to the customer’s 3rd party SaaS account and starts using it.

Secrets management that can scale in proportion with your number of customers, plays an important role here.

Tip #8: Every service has its limits (e.g., AWS Secrets Manager quotas), make sure your secrets management solution complies with your requirements (e.g., latency, number of secrets, API call rate limits, secret's size, etc.)

Figure-4 – Customer Authorizes Access To Her 3rd Party Service Account

Tip #9: to prevent cross-tenant access make sure your SaaS architecture enforces tenant-aware authorization policies.

Tip #10: adhere to the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your secret management solution. A secret management solution that is integrated with a strong identity foundation is a key prerequisite to enable that.

Use Case #4: Automated Secret Rotation

Automating secret rotation significantly reduces credentials leakage risk just by eliminating the need to run such sensitive security operations manually.

Tip #11: be biased towards automating secrets rotation, it is a key enabler for scaling out your security operations around secrets management.

Figure-5 illustrates the use-case in which a service provider is initialized with a secret just once during its deployment and then the automatic secret rotation kicks off immediately:

CI/CD: a deployment tool configures the service-provider to use credentials it randomly generates on the fly. The deployment tool then use the same credentials to initialize a secret’s value in the Secrets Manager.
Secrets Manager: if the secret has just been initialized, automated secret rotation is triggered almost immediately, otherwise, using secrets manager internal scheduler, automated secret rotation is triggered in fixed time intervals. In both cases, the secrets manager generates a new secret’s value.
Secrets Manager: rotates the database credentials and tests them.
Secrets Manager: notifies the application that the database credentials changed
Applications: retrieves the updated credentials of the service provider.
Applications: starts using the updated credentials to access the service provider.

Tip #12: the more frequent the secret rotation is, the more difficult it is for a potential intruder to gain unauthorized access to it. 

Tip #13: You should keep critical manual procedures available for use when automated procedures fail - monitor your automated secret rotation process and trigger incident-response procedure whenever the automatic process falls short.

Enough With The Mumbo Jumbo

There are many ways to put our theory into practice. Let’s walk through one implementation example that combines both operational excellence and of course, security.

In our imaginary business, we run a very successful online cookies store. The extremely naive functional view of our eCommerce system and the flow for buying cookies is illustrated in Figure-6.

Figure 6 – arealcookie.com online store – functional view

The payment microservice redirect the user to pay via Paypal
The payment microservice receives a callback from Paypal confirming the payment
The payment microservice publishes an event confirming the order request
The Messenger and Order microservices consume the published event in parallel. The Messenger microservice sends an email to the user confirming the order. The Order microservice is taking care of fulfilling the order request

Our application workload runs on AWS, in its managed Kubernetes cluster, AWS EKS. The messaging system the application uses is a managed version of RabbitMQ, AWS AmazonMQ. The secrets management solution we use allows our three microservices to securely access their unique credentials so they can authenticate and gain access to AWS AmazonMQ. Moreover, the secrets management solution takes care of automatically and securely rotating these credentials of AWS AmazonMQ, which makes our CISO extremely delighted that no human has to execute this delicate runbook manually and on a routine basis.
Table-1 provides the reasoning for the technology choices we made.

Table-1 – Secrets Management Solution – Technology Stack

	Technology	Reasoning
	AWS Secrets Manager	Features: extensible secret rotation, event-driven triggers, tagging, versioning, structured & binary secrets, fine-grained permissions, auditing, etc. Interfaces: UI, CLI and API user-friendly interfaces Best of Suite: SaaS that is pre-integrated with AWS services (e.g., CloudWatch, EventBridge, CloudTrail, Config, IAM, KMS, etc.) High-Availability: 99.9% (including cross-region replication support) Audit & Security Monitoring: via integration with CloudWatch, CloudTrail, Config. Security Hub, etc. Compliance: HIPAA, PCI, ISO, etc.
	AWS KMS	Features: master key rotation, event-driven triggers, tagging, versioning, fine-grained permissions, auditing, symmetric and asymmetric keys,high-standards for cryptography, natively support envelope-encryption, protect secrets in transit and at rest, etc. Interfaces: UI, CLI and API user-friendly interfaces Best of Suite: SaaS that is pre-integrated with the majority of AWS services High-Availability: 99.999% (including cross-region replication support) Durability: 99.999999999% Scalability: automatically scale to meet the demand (see AWS KMS quotas) Audit & Security Monitoring: via integration with CloudWatch, CloudTrail, Config. Security Hub, etc. Compliance: ISO, PCI-DSS, SOC, etc.
	Kubernetes Secrets	Features: makes secrets easily and natively accessible by authorized service accounts assigned to Kubernetes Pods. The secrets are kept in the etcd encrypted at rest by AWS EKS KMS plugin plugin.
	Kubernetes External Secrets	Features: allows using external secret management systems (as the source of truth) to securely add secrets in Kubernetes. Interfaces: It extends the Kubernetes API by adding an `ExternalSecrets` object using Custom Resource Definition and a controller to implement the behavior of the object itself. The conversion from `ExternalSecrets` is completely transparent to Pods that can access Kubernetes Secrets normally. Integration: support native integration with cloud providers’ identities, service accounts and IAM Security: supports fine-grained access permissions Multi-Cloud: supports AWS System Manager, Akeyless, Hashicorp Vault, Azure Key Vault, Google Secret Manager and Alibaba Cloud KMS Secret Manager Possible Future Alternative: AWS Secrets and Configuration Provider, ASCP (and implementations for other cloud providers: GCP, Azure) for the Kubernetes Secrets Store CSI Driver. This is still alpha but it is definitely something to consider once it becomes production ready.

Modern secrets management technologies (e.g., Akeyless, Hashicorp Vault) are equipped with much more than just secrets management and may combine several disciplines e.g., secrets, PAM (allows authorized clients to get temporary credentials to target systems that are supported by the PAM function), KMS, PKI, etc. At the time of writing, AWS Secrets Manager still does not support the common functionality that allow an authorized identity to get a unique, temporary credentials, on demand, to access various 3rd party services. For PKI and KMS AWS offers complementary managed services (AWS KMS and AWS Certificate Manager Private Certificate Authority).
If you choose to go with a managed secrets management technology (SaaS), keep in mind that even though it greatly simplifies security operations, you are still fully responsible for certain things and you must be aware of your vendor’s Shared Responsibility Model.
If you choose to go with a self-managed secrets management technology, you have much more work to do to securely operate it in an effective manner.
In our examples, we embraced the best-of-suite strategy and chose to use AWS Secrets Manager, which is SaaS and pre-integrated secrets management technology with all the important services of AWS.
Going with the cloud provider native secrets management may save you all kind of wiring and integrations you would otherwise implement yourself if you were to take a different route. In addition, the implementation and operations are often quite consistent within the same cloud provider, e.g., on AWS: resource-based policies, IAM policies, KMS CMK, IaC, CLI, API, AWS Config, AWS Security Hub, CloudTrail, AWS EventBridge, etc.

Tip #14: On AWS, prefer a multi-account strategy to isolate workloads by systems and also by SDLC environment (e.g., sandbox, development, staging, production, etc.). An independent and isolated secrets management instance shall be used by each of these accounts preferably sharing no secret with the other accounts.

Now let’s see how it all fits together, Figure-7 illustrates this architecture in some greater level of details.

Figure-7 – arealcookie.com online store – technical view

Using an internal scheduler, the AWS Secrets Manager triggers invocation to the Secret Rotation Lambda function
The Secret Rotation Lambda function computes a new password and makes an API call to AWS Secrets Manager to save a new version of the secret in a Pending stage.
The Secret Rotation Lambda function calls Amazon MQ API to set a new password for the application user.
The Secret Rotation Lambda tests the new password by creating a new RabbitMQ connection to AWS AmazonMQ broker.
Secret Rotation Lambda function finishes the rotation flow by promoting the stage of the secret to be Current.
The Secret Rotation Lambda function asynchronously invokes the Secret2EKS Lambda function to notify the EKS cluster about the updated secret.
1. [6-error] On Secret Rotation Lambda function error, a message with details about the failed event is published to AWS SNS topic for a Dead-Letter-Queue.
2. [7-error] AWS SQS queue that is subscribed to the AWS SNS DLQ topic, queues the message keeping it for further error handling processing.
3. [8-error] SecOps gets notified about the error and she executes a playbook to investigate and take actions to remediate the problem.
The Secret2EKS Lambda function applies corresponding ExternalSecret objects to the AWS EKS Cluster API.
Kubernetes External Secrets makes a call to AWS Secrets Manager GetSecretValue API to retrieve the secret corresponding to the ExternalSecret object.
Using the secret retrieved from AWS Secrets Manager, Kubernetes External Secrets applies Kubernetes Secret object to the AWS EKS Cluster API corresponding to the ExternalSecret object.
Using AWS EKS Cluster API, our application running in AWS EKS Cluster gets to use the Kubernetes Secret object.

In this scenario we would follow tip #4 and implement user toggling to ensure stability during secret rotation.
Our application must do one of two things: either polls the secret from the volume to detect updates (e.g., reload feature of Spring Cloud Kubernetes) and re-initialize RabbitMQ connections or alternatively, it lazily re-initialize RabbitMQ connections once a connection is failing due to invalid credentials.

Tip #15: Once you know what you need to protect, you can begin developing secrets management strategies. However, before you spend a dollar of your budget or an hour of your time implementing a secrets management solution to reduce risk, be sure to consider which risk you are addressing, how high its priority is, and whether you are approaching it in the most cost-effective way.

Wrapping Up

Making decisions around secrets management technology is never easy. It requires trading off one item against another, cost, reliability, operational excellence and of course, security. But in this post we preferred, more than anything, to focus on how to distribute secrets safely to applications and also on how to support short secret rotation intervals. This is because no matter how secure your secrets management technology is, once secrets are leaving its secure boundaries there is always that risk they will be compromised.

I hope you find the use-cases and the tips presented here – valuable.
I tried to focus on those secrets management areas you should deeply care about even if you are already paying for the best secrets management technology out there.

Tags architecture, aws, best_practices, Cloud, kms, secretsmanager, secrets_management, security, serverless, wellarchitected

aws

Recalculating Route With AWS SaaS Migrations

Post author By Yossi Cohen
Post date December 1, 2020
No Comments on Recalculating Route With AWS SaaS Migrations

First Things First

“Almost all of the market segments with enterprise software are being driven by the adoption of Software As A Service (SaaS)“, Gartner, January, 2020. Whether you are already a SaaS provider, or a founder in the very early stages thinking of becoming one or maybe an old good enterprise who is toying with the idea of SaaS migration – this post is written especially for you.

SaaS requirements and implementations tend to vary depending on the use case, there is no one size fits all! But, as you will see, the patterns and principles, which are part of AWS SaaS Factory, are sustainable, reusable and demonstrate the best practices for building SaaS solutions.

So now that we know that SaaS solutions are booming, let’s crack it!

As trivial as they may sound, these three must-do tips can save you a lot of headache if you choose to go through SaaS migration:

Build the case for you and for your customers that justifies this migration. What will make your SaaS offering more attractive for your customers than your existing COTS products?
Do not get mislead, SaaS migrations are not led by technical professionals alone, business and technical experts shall work closely together to assess the implications on both ends. E.g., how are you going to tier different levels of tenants in your system? That effects both, technical and business decisions.
Map the areas that are important to address as part of a SaaS migration (e.g., tenant isolation). Make sure you are up to date with the relevant best practices. To move faster and in the right direction, consider acquiring support from a partner if/as needed.

Figure 1- SaaS Journey, a holistic transformation

To meet your SaaS requirements, you most probably want to redesign and optimize certain pieces of your system, that’s your ‘Target System Architecture’ and more on that later on. In this post we explore the layers outside-in starting with the satellites, the SaaS Enablers.
Be aware that, technically speaking, the design and implementation of these SaaS Enablers are far from being agnostic to the SaaS model your target system architecture uses.

The SaaS Enablers

SaaS Enablers are those capabilities we acquire, which surround the ‘Target System Architecture’. They are key for you to meet the business objectives of your SaaS solution. Make sure you focus on the SaaS Enablers first!

In the SaaS model we aim for an automated tenant onboarding and provisioning experience. This is your way to guarantee a reliable, consistent, secure and scalable tenant onboarding. Registration form/API is used to collect all of the tenant configuration data before launching the onboarding process. This process executes the onboarding steps needed to introduce a new tenant into the system. If your system is integrated with a billing system, the onboarding process is also used to provision the billing account for the new tenant. Obviously, your solution for tenant onboarding shall not ignore the reverse process of tenant offboarding.

Your system architecture probably already implements identity solution. For the sake of your SaaS model, you want to extend the identity solution to securely connect user’s identity with the identity of its tenant via tenant-context – that’s your SaaS identity.

The SaaS identity is attached to all interactions with the SaaS environment, allowing you to reliably resolve and apply this context across all the services of your target system architecture, for example, to support tenant-isolation, tenant-level data partitioning, authorization, monitoring, metering, usage & analytics, SLA, etc.. Also, keep in mind that your authorization scheme shall distinguish between system identities and tenant identities. For example, a system user is an administrator of a SaaS provider and has access to data of all tenants, whereas a tenant user is constrained to managing configuration and data that is part of their environment only.

Figure 4 – System Roles Vs. Tenant Roles

Make identity and isolation an early priority in your migration plans!

Support a zero down time model, a single code base and one fully automated deployment for all tenants (customization, if supported, is metadata/configuration driven, e.g., feature toggling). This best practice allows you to support frequent, consistent and reliable deployments with minimal impact on customers. For most environments deployment automation is highly recommended, For SaaS environments, where new features/patches are deployed frequently, it is kinda a must-have capability.

It is essential to have a single pane of glass, a multi-tenant aware dashboards & alarms, to monitor your SaaS environment so you can quickly respond to operational events, which may otherwise impact the availability of your SaaS environment, e.g., that could be the outcome of noisy neighbors. You would also want to introduce management functionality at the tenant level like, configuration, administration, enable/disable, deactivation, change plan, etc.
Practically speaking, not managing your tenants collectively, is not really an option. The complexity involves in building that varies depending on how you choose to design your target system architecture. The nature of your tenant partitioning model, the compute model you’re using, and any number of other factors could influence your approach to assembling your multi-tenant aware management & monitoring solution.

Tenant-aware logging and metrics, is a must-have capability!

Optimizing your SaaS environment is an ongoing process. Make sure that tenant-aware metric data is collected, aggregated and made available to a range of roles within your SaaS organization, enabling immediate insights into key usage, consumption and activity trends that shape both business and technical direction.

This data is used to shape architecture strategies, pricing models, product roadmap, and operational efficiency.

Figure 6 – business, technical, and operational agility relies on rich metrics

Instrument for metrics even if you start with just a few, the key is to build the foundation. Collaborate with stakeholders to identify those metrics that give them the visibility and insights they need to run the business.

Define and collect metering records reflecting the usage at a tenant level. Feed your billing system with these records to generate charges and produce the bill. But how do you know your pricing and tiering strategy and tenant consumption are aligned? You measure tenant consumption and correlate the two, pricing with cost footprint, this helps shape the granularity of your metering strategy, e.g., bandwidth, number of users, storage usage—these are all flavors of billing models that are used to correlate tenant activity with some billing construct. Remember that metering is targeted at capturing just those metrics that are meant to derive a tenant’s bill – they may not map to the notion of tenant infrastructure consumption, which is focused squarely on determining the actual cost associated with each tenant’s activity.

Figure 7 – sell pay-per-use.. well, sort of..

The Target System Architecture

Tenant Isolation Concepts & Considerations

Isolation is a foundational element of SaaS and every system that delivers a solution in a multi-tenant model should ensure that their systems take measures to ensure that tenant resources are isolated. What is considered “enough” isolation? It depends.. for example, some high compliance industries require that every tenant have its own database. How do we design for isolation? Once again, it depends on what your target system architecture is like, the compute (e.g., containers, serverless, ec2 instances, etc.), the network, database and storage – while there are many approaches to achieving tenant isolation, the realities of a given domain may impose constraints that require specific flavor of isolation.

Figure 8 – tenant isolation architectures

Silo Isolation

Each tenant is running a fully siloed, dedicated stack of resources

Pool Isolation

Tenants are running on a shared stack of resources

P
R
O
S

agility – with shared infrastructure model you get all the natural efficiencies without performing one-off tasks on a tenant-by-tenant basis;
cost efficiency – your system scales based on the actual load and activity of all of your tenants;
simplified operations – with the shared infrastructure model the centralized management is usually natively built

C
O
N
S

noisy neighbor is a concern, you should design your architecture to limit these impacts (e.g., via throttling, access policies, etc.);
tenant cost tracking – in a silo model, it’s much easier to attribute consumption of a resource to a specific tenant;
blast radius – having all of your resources shared also introduces some operational risk, all or nothing availability;
Due to security & compliance considerations customers may be unwilling to run in this model;

Table 2 – The Trade-Offs – Pool Isolation

Bridge Model of Isolation

When the isolation landscape is less absolute, consider creating a mix of the silo and pool models

The bridge model is more a hybrid model that focuses on enabling you to apply the silo or pool model where it makes sense. The pros and cons of the bridge model basically derive from the trade-offs of silo and pool models for each resource/layer of your architecture.

Tier-Based Isolation

This is about how you might package and offer different flavors (business-wise) of isolation to different tenants with different profiles and less on the mechanics of preventing cross-tenant access and how you’re actually isolating tenants.
Figure 12 demonstrates a scenario where there’s a mix of silo and pool isolation models that have been offered up as tiers to the tenants. Beware that If too many of your tenants fall into this model, you’ll begin to fall back to a fully silo’ed model and inherit many of the challenges that we outlined.

Similar to the bridge model, the pros and cons of the tier-based model derive from the trade-offs of silo and pool models for each resource/layer of your architecture.

While picking your favor tenant isolation concept(s) – be cautious, keep in mind the various trade-offs they bring to the table, mainly in these areas of pricing and tiering strategy, security & compliance, opportunities for cost saving, noisy-neighbor, cross-tenant access, leaving an existing legacy architecture intact, etc.

Also keep in mind is that an application that is properly designed using Pool Isolation gives the business the flexibility to decide how and when to apply any of the tenant isolation concepts (e.g., silo, pool, etc.). While targeting Pool Isolation, you can still support Silo Isolation (see Figure 13) – strive to support multiple tenants on day one!

Figure 13 – Target Pool and Support Silo

Common SaaS Migration Models

If you design a new product, do not miss the opportunity to design an AWS cloud-native, multi-tenant SaaS solution with all the best practices.
What if there is a business case for your company to introduce a new multi-tenant version for an existing product such that they both evolve in parallel? For the new version of the product, it is a no-brainer, we redesign it as a proper AWS cloud-native, multi-tenant solution, one that meets all these SaaS best practices. The next part is a bit tricky. The approach we take to evolve both versions, the single-tenant and the multi-tenant, is to build new features as first-class citizens (AWS cloud-native, multi-tenant, etc.) of the multi-tenant version of the product and adopt this exact code base into the legacy version of the product (see Figure 14).

Figure 14 – Single-Tenant & Multi-Tenants Versions Of A Product Evolve In Parallel

Sometimes, the reality and constraints lead us to continue maintaining our legacy architecture and make some compromises, which we do not want to make in a random fashion. This is where our common models and value systems come to play. It is by no mean an exhaustive list of migration models, and yet we probably can say, with high level of confidence, that these few are the main flavors.
Note! regardless of the migration model you choose to follow, these points are very much relevant:

Building and integrating SaaS Enablers with your entire target system architecture and making them a priority is one of your most critical tasks to complete
As part of your target system architecture design, there are quire a few decisions for you to make, e.g.:
- How would you design and build the SaaS enablers?
- How would you design and build tenant-routing functionality (e.g., DNS-based, API & tenant-context-based, etc.)? How do you plan to leverage your solution for identity (SaaS enabler) to enable tenant-routing functionality?
- How do you plan to support tenant-aware logs, metrics and analytics, metering?
- How do you plan to support tenant-aware partitioned data access?
- How do you plan to move away from one-off versions toward a universal version of your application?
- What isolation mechanism serve you best? AWS VPC? AWS Account?

Some of the most common migration models partially retain the legacy architecture. However, the SaaS enablers are all tenant-aware and they deliver results based on tenant-context. It basically means that you may need to consider a way to inject the multi-tenant constructs into legacy code (see Figure 15).

Figure 15 – Injecting Multi-Tenant Constructs Into Legacy Code

Silo Lift and Shift

An existing app moves as-is and it is deployed as isolated single-tenant

You may have a one-off solutions for each of your customers that run on-premise. If your primary goals are to reduce the risk and minimize the migration effort then with the Silo Lift and Shift model we move all these silos of stacks into a universal representation and with minimal / no changes and then we try to lift and shift this workload to AWS. Our attention and effort are focused on building the SaaS enablers while retaining most of the legacy architecture. When it comes to legacy architectures, technical folks often have a ~~love~~-hate relationship as they strive for constant improvement and modernization. However, it is important to note that for some businesses, this specific model of Silo Lift and Shift, is quite appealing simply because it is good enough in terms of scale and cost and it also allows to move faster with the migration.

Figure 16 – You may find the Silo Lift and Shift model optimal for your business

Figure 18 – Data-Driven Routing – AWS Example

Figure 19 – Silo Lift and Shift Migration

Layer-By-Layer

An incremental approach; Similar to the Silo Lift and Shift model except layers of an application are incrementally migrated to a multi-tenant model

Unlike in the case of the Silo Lift and Shift model, with the Layer-By-Layer model you clearly get the chance to optimize your target system architecture and become more scalable & efficient. You incrementally pick the layers you choose to transform into multi-tenant. Some layers are simpler to migrate using this approach than others. For instance a web tier in a traditional three tiers application is likely to have less tenant context in it, it should not have a lot of coupling and dependencies between tenants and it can basically be pulled out of the monolith with minimal / no impact. However, the application layer would require significant effort to migrate (see Figures 20, 21).

Figure 20 – Layer-By-Layer Migration – Phase 1 Web-Tier – Example

Figure 21 – Layer-By-Layer Migration – Phase 1 App-Tier – Example

Service-By-Service

An incremental approach; Similar to the Layer-By-Layer model except unlike a layer, a service is an existing self-contained functional capability that is fully redesigned to wear the shape of a cloud-native multi-tenant service that follows the SaaS best practices.

Unlike in the case of the Layer-By-Layer model, with the Service-By-Service model you gradually carve out capabilities from your existing system and redesign them from scratch as multi-tenant, cloud-native services that carefully adhere to the SaaS best practices. This model fits best to organizations, which are willing to invest to get their solution to that end-state of a modern, cloud-native, multi-tenant architecture with all the benefits that follow while the legacy architecture is eliminated entirely. However, since this migration model is incremental, the legacy architecture shall run in a complete harmony, side-by-side, with the new services we introduce. Note that carving-out functionality from a monolith and transforming it to an autonomous (Micro-) service can be a real challenge, especially at the data layer. Things like SQL JOIN statements, inter-dependencies and tight-coupling inside your monolith, are the kind of challenges you should identify and use to prioritize the low hanging fruits, in other words, try to surface those capabilities that are relatively straightforward candidates for the Service-By-Service model.

Figure 22 illustrates the Service-By-Service model where the new modernized, multi-tenant services are implemented as Microservices and the existing traditional three-tier web application is migrated and deployed using silo isolation.

Figure 22 – Service-By-Service Migration – Example

Figure 23 presents a variation of the previous Service-By-Service model example. In this example we further optimize our target system architecture as we carve out the web tier from the monolith and then we apply the Layer-By-Layer model by turning it to a multi-tenant construct.

Figure 23 – Service-By-Service Migration (variation 1) – Example

Redesigning the web tier to support multi-tenancy as illustrated in Figure 23, indeed boost your ability to scale and it also makes things more efficient but there is a greater opportunity for modernizing your target system architecture. Figure 24 presents a second variation of the Service-By-Service model example. This time we eliminate your web-tier altogether and replace it with the AWS S3 (for serving static web resources) and AWS API Gateway to support your web app via Restful API.

Figure 24 – Service-By-Service Migration (variation 2) – Example

Which SaaS Migration Models To Use? When?

What do the advantages and disadvantages of the SaaS Migration Models mean for your business? Could be that combining these models together in a longer term, multi-phase plan, makes more sense to your business?

Assuming that our ultimate goal is to end up with a modern, cloud-native, multi-tenant target system architecture, Figure 25 illustrates a multi-phase plan to meet this end-state. In phase #1 we use silo lift and shift to move faster (short TTM) with minimal effort; In phase #2 we either take a more conservative transformation and use the layer-by-layer model first to make our web tier multi-tenant or we move directly to the service-by-service migration model.

Figure 25 – Combining SaaS Migration Models – Example

Which SaaS migration model you should use and when, is really a question for your business to answer. We can definitely point out that each migration model has its challenges and opportunities as listed in Table 3.

	Lift & Shift	Layer-By-Layer	Service-By-Service
P R O S	TTM; Minimally invasive; Security & compliance (strict-isolation);	Incremental; Moderately invasive; Quick wins;	Incremental; Full modernization; Scale, availability, agility;
C O N S	Agility; Cost; Manageability;	TTM; Manageability; Cost	TTM; Data model migration; Complexity (invasive);

Table 3 – The Trade-Offs – SaaS Migration Models

This Doesn’t End Here…

We touched on some key points pertaining to AWS SaaS migration journey from business and technical angles while trying to answer the key questions of what does it take to become a SaaS provider and what are the main strategies. Although we’ve made some progress, this is just the tip of the iceberg, if you want to go deeper on AWS SaaS, I’d advise you to explore the AWS SaaS Factory Program web site and watch the recording of this excellent breakout session: “AWS re:Invent 2019: SaaS migration: Real-world patterns and strategies (ARC371-P)“.

Tags architecture, aws, best_practices, Cloud, cloud-native, multitenancy, patterns, reinvent2019, saas

aws Cloud

Architecting and Operating Resilient Serverless Solutions On AWS

Post author By Yossi Cohen
Post date December 23, 2019
1 Comment on Architecting and Operating Resilient Serverless Solutions On AWS

Horizontal side view of a lonely yellow flower growing on dried cracked soil

“AWS Lambda scale automatically by design, what is left for us (architects) to do aside from sitting back and relax?” you may ask. That’s true, AWS Lambda does scale automatically, but, you will easily miss your performance and operational targets unless you follow some guidelines. Let’s walk through some patterns and services you may want to consider when building serverless solutions.

Load Shedding

“You can parallelize a system up to the point where contention becomes the bottleneck” –Amdahl’s universal scalability law.
So as transactions per second (TPS) increases, at some point, your system scratches its limits, latency gets higher, your clients experience timeouts and that does not stop your servers (and downstream servers) from continue working hard for nothing. This waste gets worse when clients retry on errors. Here are a few recommendations:

Cheaply reject excess work
- Implement concurrency limits using AWS Lambda Function configuration.
- Implement API throttling using AWS API Gateway.
Do not waste work
- Implement server-side timeout using AWS Lambda Function configuration.
Do bounded work
- Implement AWS Lambda Functions that consume a similar amount of resources per event (e.g., pagination is an example of a method that bound the work). Are you doing orchestration within your AWS Lambda Function? You may want to consider using AWS Step Functions (or the newly introduced AWS Step Function Express Workflows feature), which allows taking the orchestration part out from AWS Lambda Functions. AWS Step Function service is also pre-integrated with several AWS services (e.g., AWS SQS, SNS, Batch, etc.).
  
  Example – Orchestration Within AWS Lambda Function
Do not take extra work
- By implementing AWS Lambda functions you already get an isolated execution environment with fixed resources per request (aka, unit of work).

Dependency Isolation

Little’s Law: concurrency = arrival rate x latency
In other words, the lower the latency goes – the higher concurrency you can gain (and up to the predefined limits).
So if an API supports different modes requiring different compute resources and/or different processing time, that may lead to slowing down all transactions of all modes.
In a different scenario, a service starts generating a higher volume than expected on your database, which badly degrades its performance. Here is how you can compartmentalize dependencies to isolate their concurrency capacity:

Ensure your API is designed to prevent one dependency from affecting unrelated functionality. You do not want your API to become overloaded when dependencies slow down.
Use throttling and concurrency limits in AWS API Gateway and Lambda Functions to protect your services.
Consider placing AWS API Gateway/Lambda Function in front of resources that do not already isolate their concurrency.
Prefer asynchronous invocations over synchronous ones unless you absolutely can’t. Not only that helps to isolate the concurrency of each of the services involved but also if downstream services fail or worst yet, take too long to respond back, your service remains intact. So, even if another service is crippled within the application, there isn’t a ripple effect throughout all of the services due to synchronous dependencies.
If all you need to know is that the request was successfully processed and that the payload was durably stored, you most probably can put AWS Lambda asynchronous invocation to work. But if, after all, you still need the API response, consider adhering to any of these Asynchronous Patterns.

Example – Asynchronous Invocation When API Response Is Not Needed

Implementation Tips

If the order of messages matters, use the recently announced AWS Lambda Supports Amazon SQS FIFO as an Event Source.
If low latency API at low cost are of high priority, consider implementing the recently announced HTTP APIs for AWS API Gateway. Basically. you can now choose between Rest API, WebSockets and HTTP API.

Example – Throttling, Timeout, and exhausted database connections

AWS Lambda Function execution environments get reused and so to avoid resource starvation and application bottlenecks you must clean up resources that are no longer needed and before the termination of the Lambda Function (e.g., file descriptors, sockets, etc.). Having said that, especially within the ‘handler’ function of AWS Lambda, avoid expensive re-initialization of resources. Usually, it is better to keep initialization code outside the function handler so it can be reused across invocations (e.g., static constructors, global/static variables, database connections, etc.). For use cases where you need to refresh these values periodically, use an in-memory cache with an expiry policy.
Note that AWS has just announced AWS RDS Proxy, which can handle persistent database connections (and authentication!) for you. Re-initialization of a database connection (including the TLS handshake) is an expensive operation that can now be reduced significantly.

To handle concurrency correctly in case of synchronous AWS Lambda Function invocations, it is important to understand how AWS Lambda Function Scaling works. Especially this part from AWS official documentation: “… For an initial burst of traffic, your function’s concurrency can reach an initial level of between 500 and 3000, which varies per Region… After the initial burst, your function’s concurrency can scale by an additional 500 instances each minute. This continues until there are enough instances to serve all requests, or a concurrency limit is reached.“. Note that in re:Invent 2019, AWS announced AWS Lambda Function Provisioned Concurrency, which ensures that at any given time you have a predefined number of Lambda Functions ready to be executed concurrently; Use it to avoid cold starts.

Example – Lambda Bursting & Ramp-Up

AWS services most often use well-defined API throttling limits, so use these APIs with care, for example, refrain from frequent re-initialization (e.g., avoid retrieving the same secret value from AWS Secret Manager on every invocation). If your AWS Lambda Function publishing custom metrics to AWS CloudWatch, consider using the recently announced AWS CloudWatch Embedded Metric Format where your AWS Lambda Function can push these metrics to AWS CloudWatch by simply logging the metrics.
You should also be aware that several AWS services API support batching e.g., AWS SQS, use it (“write in batches”) to reduce your chances to be throttled. If your AWS Lambda Function is being triggered by AWS SQS, you can configure batch size to optimize the concurrent execution of the function (“read in batches”).
Last but not least, lacking or improper error handling may degrade your application resiliency especially in cases of repeating errors. Embrace the Fail-Fast Principle in your application design, there is no point in continuing to generate a steady load on a failing service, it will just make things worse. Consider implementing retry with backoff (e.g., Circuit breaker design pattern) to cope with dependency’s failures.

Avoiding Queue Backlogs

What happens if your queue filled up such that the pace in which messages are produced is higher than the pace in which messages are consumed? The processing of the whole backlog is slowed down, which depending on our service SLA, may not be acceptable.
Depending on your use case, here is how you can deal with the issue:

The consumer applications of those queues shall be designed to automatically scale
During a spike, work with low priority queue (e.g., for enqueuing aged messages) and high priority queue to handle the spike; If reading messages from a queue is much faster than processing it, then read a message from the high priority queue and if it meets certain age criteria, defer the message processing by enqueuing the message in the low priority queue; Or simply drop aged messages (TTL) if application-wise it makes sense (e.g., IoT device point in time state);
Reduce the number of retries attempts and handle failed events by using either AWS Lambda Function Dead Letter Queues or the recently announced feature AWS Lambda Supports Destinations for Asynchronous Invocations. Also, consider configuring the Maximum Event Age in your AWS Lambda Function to automatically dismiss aged events.
Implement backpressure (throttling) using AWS API Gateway to reject excess load or alternatively (and very similar to the priority queues), implement application logic to route excess load to a ‘surge queue’ and the rest of the traffic to a ‘warm queue’. Each queue is processed separately by a dedicated AWS Lambda Function isolated by its own concurrency configuration.
Implement the Shuffle Sharding design pattern, where you introduce a group of queues behind a “smart” routing layer. Each customer gets assigned two or more queues and this assignment is permanent. This is all about limiting the blast radius.

Operating

Your serverless solution works like a charm. Until it does not.
How quickly do you diagnose and mitigate issues? You need to be able to analyze the behavior of your distributed application by profiling your code and by monitoring your transactions, application & infrastructure.

Implement a request tracing solution to identify and troubleshoot the root cause of performance issues and errors. There are several open-source and commercial products that can be used (e.g., Epsagon, Dynatrace, Datadog, Zipkin, Jaeger, etc.).
AWS recommends using AWS X-Ray AWS ServiceLens, which “ties together CloudWatch metrics and logs, as well as traces from AWS X-Ray to give you a complete view of your applications and their dependencies” (providing overall system health in one place by combining application and transactions metrics).
Collect, search and analyze your log data by using AWS ElasticSearch service or alternatively by using AWS CloudWatch Logs & the recently announced AWS CloudWatch Logs Insights, which requires no setup and is quite useful and very fast (in addition, AWS CloudWatch Contributor Insights , which was also recently announced, can be leveraged to analyze log data to provide a view of the top contributors e.g., user, service, resource, etc. influencing system performance.
Implement monitoring dashboards (e.g., by using AWS CloudWatch Dashboard) in a consistent manner by adhering to some pattern. For example, use a layered approach starting from the customer “front door” – AWS ALB then to your Lambda Functions (including breakdowns at API level) then to your cache service and finally to your database.
Consider monitoring your end-user experience by using the recently announced AWS CloudWatch Synthetics that continually generates traffic to verify your customers’ experiences.

Asynchronous Patterns

It is pretty straightforward to implement asynchronous API when your client is not expecting a response back. But what if it does expect a response back? In this case, you may want to consider implementing one or more of these patterns.

APIs In The Front, Async In The Back

Polling

The client submits a job to AWS Step Function via AWS API Gateway and in return, it gets an immediate response back with a request-id. Then, the client uses the request-id to poll for status. Once the job is completed and the result is stored in AWS S3 bucket, the client is ready to fetch the results via AWS API Gateway. Now, if the processing time is relatively short (e.g., less than 15 min) then the business logic orchestrated by your AWS Step Function probably would be implemented via AWS Lambda Function, otherwise, you would prefer your AWS Step Function to trigger AWS Batch job.
What if your throughput is relatively large (e.g., greater than 300 RPS)? In that case you would prefer deploying AWS SQS & Lambda Function in front of your AWS Step Function or alternatively, the newly introduced AWS Step Function Express Workflows capability may be a good fit to meet high throughput requirements.
What if the object in AWS S3 bucket is relatively large (e.g,. greater than 10MB)? Then you would probably prefer to directly download the object from AWS S3 bucket using a pre-signed URL.
On the upside, it requires minimal changes for clients and it may be used to wrap existing backends.
On the downside, first, you delay the response (polling time minus job completion time), second, see how much excess compute is wasted on both ends, client and server.

Example – Asynchronous Patterns – Polling

WebHooks

In order to establish trust, your client is being registered and verified. Then, the backing service does all the work asynchronously. Lastly, using AWS SNS, the backing service calls back to the client when the job is completed (consider setting a dead-letter queue to an AWS SNS subscription to capture and handle undeliverable messages). Similar as before, if the processing time is relatively short then the business logic orchestrated by your AWS Step Function probably would be implemented via AWS Lambda Function, otherwise, you may better consider triggering AWS Batch job. If the object in AWS S3 bucket is larger than 256KB, which is AWS SNS payload size limit, you would prefer to directly download the object from AWS S3 bucket using a pre-signed URL.
On the upside, comparing to polling, it is less resource intensive for both client and server. In addition, AWS SNS is handling all the heavy lifting of the delivery & retries.
On the downside, the client needs to host a web endpoint (highly available or/and supporting the server’s retry policy), and the server must implement a mechanism for establishing trust.

Example – Asynchronous Patterns – Webhooks

WebSockets

The client submits a job to AWS Step Function via AWS API Gateway and in return, it gets an immediate response back with a bunch of details enabling the AWS Step Function and the client to securely communicate via a WebSockets endpoint of AWS API Gateway. Then, the client opens connection to the WebSockets endpoint via AWS API Gateway and AWS Lambda Function implements the necessary logic to notify the AWS Step Function, which also identify the client as the one who submitted the job. Once the job is completed and the client connection is approved, AWS Step Function executes a callback step updating the client with the result.
What if your throughput is relatively large (e.g., greater than 300 RPS)? Like before, you would rather deploying AWS SQS & Lambda Function in front of your AWS Step Function or alternatively, the newly introduced AWS Step Function Express Workflows capability if it fits better.
What if the resulted object is relatively large? AWS API Gateway WebSockets defines payload size limit to be 128KB while each WebSocket frame size must not exceed 32KB. Once again, this limitation can be mitigated, for example, by directly downloading the object from AWS S3 bucket using a pre-signed URL.
On the upside, less waste of compute resources and the bi-directional communication channel you established may be beneficial for a wide range of use cases relevant for you. You can now use this ‘push’ (vs. poll) mechanism not only to notify clients when a job is done but you can also proactively push events to update the client on any server side state changes the client is subscribed to. If you want to learn more about these kind of architectures, this is a great post to start with: From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.
On the downside, AWS API Gateway WebSockets has a limit of 500 new connections per second. You and the information security team need to be well familiar with the WebSockets protocol. Also, you may have this requirement to ensure portability across supported client devices and browsers.

Example – Asynchronous Patterns – WebSockets

Conclusion

We highlighted a few patterns and services you probably want to consider when building resilient serverless solutions on AWS. AWS and Amazon.com for this matter, both embraced the ‘Eat Your Own Dog Food‘ approach to building platforms. AWS services and quite a few of the techniques we covered are already in use by Amazon.com and AWS. At the re:Invent 2019, AWS announced the Amazon Builders’ Library, “a collection of living articles that take readers under the hood of how Amazon architects, releases, and operates the software underpinning Amazon.com and AWS“. For more information including how internally AWS Lambda utilizes some of these techniques, watch the recording of these breakout sessions “SVS407 Architecting and operating resilient serverless systems at scale” and “SVS335 Serverless at scale: Design patterns and optimizations“.

Tags @_AllCloud, allcloud, api_gateway, aws, lambda, microservices, reinvent, serverless

Cloud

Las-Vegas, Here I Come!

After a 3 years break, I am traveling back to the place where the modern cloud is reborn every year in December – the AWS re:Invent 2019 event.

Fueled with high expectations and enthusiasm, I am traveling to the cloud technology conference to be inspired by the trends, strategics, and innovations led by AWS. Then, I cannot help self-brainstorming of how these newly announced capabilities, platforms, services, and features, can be leveraged to design & implement modernized solutions that assist our customers to better achieve their business objectives. These insights I’ll bring back from the conference, is one of the most important items in my black Friday shopping cart. Stay tuned for updates!

Tags @awsreinvent, @_AllCloud, allcloud, aws, reinvent2019

Cloud

My Key 5 Takeaways from VMWorld 2018

Post author By Yossi Cohen
Post date November 12, 2018
No Comments on My Key 5 Takeaways from VMWorld 2018

I traveled to VMworld 2018 to explore VMware, its ecosystem and where it is heading. VMware is a pretty much new world to me and my main goal was to learn how can businesses benefit from VMware through their journey to the public cloud and to AWS in particular.

As one who finds “lift & shift” as a less attractive strategy (being gentle here..) and who strongly believes in never-ending software modernization, I was thrilled to find some promising messages coming from VMWare. Pat Gelsinger, CEO, VMWare:
…”Our vision is simple: Empower people to access any app on any device, from any cloud, with intrinsic security architected-in across every layer.” and “VMware observes a transition from Data Centers to Centers of Data that need to be connected, operated and secured together.”

These two inspiring statements well summarize VMware’s overall strategy. Despite VMware’s strategy is referring to “any cloud”, in this post, we will focus on VMware Cloud on AWS.

#1 The Variety of Use Cases for Public Cloud

The journey to the public cloud begins with the understanding of how VMware Cloud fits into your cloud strategy.

According to VMware, there is an hybrid-cloud trend to run workloads in the public cloud while continue running workloads on-premise. That brings tighter integration across cloud providers and significant cost saving for hardware. However, the main business challenges are operational inconsistency, acquiring new methodology and tools, monitoring and security, and budget constraints.

Data Center Extension

There are various scenarios for which you may choose to extend your data center in the public cloud.

Whether you need to meet any regional footprint expansion and growth needs of the business, provisioning of temporary capacity for development and testing – take advantage of the on-demand capacity and the global infrastructure public cloud provides you with. It allows you to expanse your footprint in new regions around the world with all the capacity you need. If you are looking to reduce cost / modernizing your disaster recovery & backup then VMware site recovery & AWS S3 may become handy.

Cloud Migrations

VMware provides you with various capabilities to support live migration of existing workloads to the public cloud.

HCX for VM bi-directional migration
vSAN backed by AWS EBS scaling compute and storage independently
High-performance hybrid connectivity via NSX micro-segmentation & AWS Direct Connect.
CPU Core & VM Compute Policies supports applications licensing requirements (ideal for enterprise applications from Oracle and Microsoft).
AWS global presence supports various compliance requirements

VMware on AWS supports several networking options to address varies requirements around high-availability, bandwidth and security (including HCX and NSX’s stretch networks):

Next Generation Applications

Next generation applications can further benefit from modern cloud-native architecture, e.g.:

DNS Management (AWS Route53 in a customer’s owned AWS account routes traffic goes through its VPC DNS and from there to VMware CLoud on AWS’s SDDC account enabling DNS management via AWS Route53 & AWS directory service)

AWS ELB integration with SDDC

SDDC integration with AWS S3 or EFS storage

SDDC integration with AWS RDS

#2 Becoming Cloud Agnostic Via Consistent Multi-Cloud Support

The multi-cloud strategy is really an interesting one and can bring a lot of value if it is well supported. Is such a tremendous level of abstraction is indeed achievable and how smooth can it get (e.g., with native integrations, with no swivel chair across public cloud providers, etc.)?

#3 Containerization & The Bet on Kubernetes

The acquisition of Heptio and VMware new capability around Kubernetes as a service (VMware cloud PKS) on top of various public cloud providers is exciting for cloud-native app developers as well as for those who are interested in leveraging VMware multi-cloud strategy. However, it is not clear when/if service-mesh (e.g., Istio) will be part of this service managed offering. Also, in VMworld 2018, I did not have the chance to learn more about it and compare it with AWS EKS for example.

#4 VMWare on AWS – Extended Integration & Serverless

Another great news for cloud-native developers is the appearance of new application services in the VMware’s arsenal.

#5 Security

Shared Responsibility

Hold on! You can’t cross a river without getting wet… In other words, how come VMware is not sharing responsibility for the new managed services, like AWS RDS (the VMware side of it), Project Dimension, VMware Cloud PKS, VMware Blockchain, etc. ?

The shared responsibility illustrated below would make more sense back then in 2004 when VMware was an IaaS provider…

VMware Cloud on AWS – Security Tools

Even though it is not an exhausting list, we are all curious to know what technology stack others use aren’t we 🙂

Misc.

PCI compliance is not yet available but it is in the roadmap
Host systems removed from a cluster are cryptographically wiped
VMware is a processor – the customer is the controller (GDPR)
VMware password strength is 4 characters (!?#%) but it is in the roadmap to fix it
VMware has a whole bunch of advanced security capabilities (e.g., NSX Micro-Segmentation) mainly around networking that I did not cover here and it is not news.

Conclusion

VMware strategy is impressing indeed as well as some of the demos that were presented during VMworld 2018. Customers who already use VMware and interested in hybrid-cloud may find VMware on AWS quite appealing. However, I was left with some open questions:

How much operational consistency will eventually be achieved?
What is the give and take comparing VMware’s multi-cloud abstraction to native integration with a public cloud like AWS?
What are the pros & cons of VMware Cloud managed services comparing to the equivalent public cloud offering (e.g., VMware Cloud PKS vs. AWS EKS)?

Tags 2018, aws, cloud-native, hybrid-cloud, vmware, vmworld