Adopting RHEDcloud for AWS

Background

In response to inquiries about how to deploy RHEDcloud for AWS, which is already in production at Emory University, and is being deployed at Rice University, the RHEDcloud Project prepared this overview page about the architecture, requirements, and deployment of the solution.

Application Architecture

The RHEDcloud web applications and web services are all Java applications that can be run in a Java application server. The RHEDcloud Project provides CloudFormation templates and scripts to create AWS Elastic Beanstalk environments in which to deploy these Java web services and web applications. These Java applications and web services communicate with each other using a Java Message Service provider. The RHEDcloud Project provides CloudFormation templates and scripts to deploy an Amazon MQ Java Message Service provider.

RHEDcloud for AWS applications store their account metadata, notifications, and other housekeeping data in a relational database such as PostgreSQL, MySQL, Oracle, etc. The RHEDcloud Project provides CloudFormation templates, scripts, and utilities to deploy an RDS database to serve as the persistent store of this data. The RHEDcloud Security Risk Detection Service persists security risk detection data to the serverless Amazon DynamoDB database. The RHEDcloud Project setup and installation templates and utilities also setup this persistent store.

All of the virtual private cloud and service control policy templates are implemented as AWS CloudFormation templates, and their comprehensive tests are developed in Python. Python was selected over Java for the template and policy unit test suite, because it could be more readily learned and used by some system administrators who are transitioning roles to DevOps engineers.

Components

  1. The following are required:

    1. RHEDcloud AWS Account Service (Account Metadata Repository and Provisioning)

    2. RHEDcloud Console for AWS (Account, VPC, Service, Network, Provisioning, and Notification Management)

    3. RHEDcloud Landing Page for AWS (Launch page with links into the AWS Console, RHEDcloud Console, Service Request System, AWS Service Inventory and Risk Assessments, etc.)

    4. Security Risk Detection Service (Security Overwatch deployed with any or all of the existing detectors developed by the RHEDcloud project)

    5. E-mail Address Validation Service (E-mail Address Validation for Addresses used with Cloud Accounts)

    6. Temporary Key Issuance (TKI) Serivce (Accelerates and Simplifies User Access without Long-lived Credentials)

    7. IDM Service (Exposes Roles and Role Assignments/Memberships to the Other Services listed here---presently implemented for NetIQ, but also implementing for Grouper and others)

    8. Directory Service (Exposes an organization’s person search features to the other services listed here)

    9. Financial Account System Service (Exposes financial account system numbers to the rest of these applications and services for validation)

  2. The following are optional if the site chooses to deploy them (depending on what the site is doing and what network, firewall, and other infrastructure they have):

    1. Network Operations Service (Network Automation for Site-to-Site VPN and Static NAT)

    2. Cisco ASR Service (Router-level automation for Site-to-Site VPN and Static NAT---presently implemented for Cisco ASR routers using Netconf/Yang, but could be extended to other equipment and standards)

    3. Elastic IP Service (Orchestrates On-Prem Static NAT for the Cloud)

    4. Firewall Service (Exposes On-prem and Cloud-based Firewall rules for VPCs---presently implemented for Palo Alto firewalls, but could be extended to others)

How RHEDcloud Deploys Environments

RHEDcloud deploys all applications into AWS Elastic Beanstalk using the Tomcat app server with Bitbucket pipelines to automate build, package, test, and deployment. RHEDcloud uses the AWS-managed Java Message Service provider Amazon MQ. All deployments have been automated to allow push-button promotions of new code from DEV, to TEST, STAGE, and PROD. Most sites will only need to run one non-production and one production environment.

The RHEDcloud Project provides deploy-only pipelines that implementing sites can adopt to pull specific builds of each RHEDcloud product from the project repositories in Bitbucket at https://bitbucket.org/rhedcloud and deploy them into their own AWS account to implement for their site. In this way, implementing sites are wired directly into the project to accept and test new updates, features, and fixes easily.

The RHEDcloud Project provides a mechanism to automatically quiesce unused environments and restart them on-demand to manage cost of unused environments. There is nothing less expensive than spending $0/hour on environments that are not actively used, so the goals is only to run environments when they are needed. The RHEDcloud Project is adding activity awareness to the RHEDcloud Console and AWS Account Service to quiesce environments based on their use in the future. Presently unused environments are quiesced on a schedule.

The cost to run the RHEDcloud for AWS service on AWS will vary, depending on the number of environments a site runs and the class of instances selected for compute and databases. Some examples from implementing sites and their deployment profile will be posted here soon.

Enterprise Application Integration Requirements

Emory chose to integrate the AWS at Emory service deeply into Emory's security, identity, network, and financial accounting infrastructure, so there are many integrations that adopters may choose to use or adapt to their needs. However, three integrations are absolutely required for the solution to work:

  1. Single Sign-on Integration - Emory uses Shibboleth, a SAML2 identity provider, so the solution is presently pre-configured to support authentication with a SAML2 identity and service provider. The SSO integration must include releasing SAML attributes for all of the AWS roles a user is assigned to. These attributes are used by the AWS console and the RHEDcloud Temporary Key Issuance (TKI) Service.
  2. Identity Management Integration - the RHEDcloud security model is based on roles, so it needs a provider of roles. Adopting sites must expose their provider of roles as a web service that the RHEDcloud components can call to:
    1. query for roles
    2. provision new roles
    3. de-provision roles
    4. query for role assignments
    5. assign users to roles
    6. remove users from roles
  3. Directory Service Integration - several RHEDcloud components must be able to look up users by fields such as legal name, directory name, Network ID, unique-person ID, etc. Adopting sites must expose elements of their directory as a web service that the RHEDcloud components can call to look up users and their current e-mail addresses.

Monitoring and Alerts

The RHEDcloud Project has developed a comprehensive set of synthetic transaction monitors and log monitoring for RHEDcloud for AWS using the DataDog platform. Implementing sites may re-use this pre-built monitoring and alerting solution or they may use these monitoring definitions to implement monitoring and alerting with their preferred tools. The RHEDcloud Project selected DataDog, because it is a cost-effective, single provider of synthetic transaction monitoring, log monitoring and analysis, and deep integrations into AWS with consoles and reporting for Amazon MQ, RDS, DynamoDB and other services used by the RHEDcloud for AWS solution.

Network Automation

There are very specific requirements for network hardware, OS version, and configuration that must be met in order to take advantage of the RHEDcloud network automation as it is currently implemented. To summarize, you would need the following:

  • 2 x Cisco ASR1Ks (we use ASR1002-HX with added hardware crypto module)
  • IOS-XE 16.6.2 (we are working on support for 16.9.x now)
  • Public IPs for 2 x 200 VPN end-points (we use 2 x /24s - 200 addresses per router - for this)
  • Private IPs for 200 VPCs (we use 2 x /16s for this)
  • Public IPs for 1:1 Static NAT (we use 1 x /23 for this)
  • Public IPs for PAT for general VPC internet access (we use 1:2048 - public:private)
  • APIPA IPs for GRE tunneled traffic & BGP peering (you should be able to copy ours unless you have a conflict)
  • ASN for your on-prem environment (ours is public 3512 but probably doesn't have to be public)
  • ASN for your AWS VGWs (we use 65533 but this should be configurable in RHEDcloud config docs)
  • RHEDcloud per-requisite router configuration (adapted based on your numbers and specifics of your site)
    • including "Cloud" vrf (we called ours AWS but Clould would probably be better)
    • including "NAT" vrf
    • netconf-yang
    • VPN
    • 1:1 static NAT
  • Router credentials for the NETCONF API (we use local credentials one per router per service VPN & NAT - stored in AWS secrets manager)
  • Reachability from RHEDclould automation environment IPs to router management IP's on port 830 for NETCONF API access

Administration and Operational Requirements

All AWS at Emory components run in fully managed AWS services and their deployment and maintenance have been automated to the greatest extent possible. For example, no operating system patching is required, because instances are automatically retired using environment quiesce and activate processes or AWS Elastic Beanstalk automatic instance replacement features. Traditional database administration and storage management tasks are provided as part of the AWS RDS managed service. Point-in-time recovery, when needed, is available with a few clicks within the AWS console. Deployments and upgrades of new releases of components are automated in the deploy-only pipelines provided by the RHEDcloud Project. Few traditional middleware or database administration tasks remain with such deployments that utilize AWS managed services to the fullest extent.

Implementing sites do need to plan for the following:

  1. Alert monitoring and incident reporting - While monitoring in DataDog can automatically alert support staff and/or create incidents or tickets in a support system, someone must be responsible for review any alerts by the monitoring and assigning them to the appropriate support staff.
  2. RHEDcloud application and solution expertise - An implementing site needs to develop a general familiarity with the purpose and function of the solution both to support their users and triage basic problems reported by users or automated alerts.
  3. Security Operations at adopting sites will likely want to ingest some AWS CloudTrail, GuardDuty, and RHEDcloud Security Risk Detection service events and logs into their security information event management infrastructure.
  4. Some common administration and support issues are:
    1. Disruptions of network connectivity between AWS and the implementing site - the solution must communicate at times with web services on premises (outlined in the integrations above). These disruptions are rare and depend on the type of connectivity used by the implementing site. When these outages occur app server instances must sometimes be restarted to restore service. These rare failures have have occurred in two primary forms:
      1. Internet provider issues like Internet2 slows down, stops working, or redirects traffic over a slower route. This manifests as disrupted site-to-site VPN connections that read down on both the on-prem and cloud provider side or the tunnels remain up but the network is slow.
      2. The site-to-site VPN is up and working and indicates up from both the on-prem network and the cloud provider, but some routing or other network configuration on the on-prem network prevents the proper routing of traffic. In these cases resources in the remote VPC cannot reach on the on-prem network and vice versa even thought the site-to-site VPN is up.
    2. AWS slowdowns and disruptions - AWS can experience incidents which slow down API calls to AWS services, usually in a specific region. When these slowdowns occur it can increase the error rate in components of RHEDcloud for AWS and increase error notifications. In these cases, a RHEDcloud administrator at a site may need to take action to throttle back RHEDcloud components like the Security Risk Detection Service. An incident like this may also require deleting some superfluous messages from the AmazonMQ message transport to prevent RHEDcloud components from unnecessarily processing all of the errors produced by the slow AWS API calls.
    3. Monitoring of the RHEDcloud Security Risk Detection (SRD) Service may report degraded performance - in these rare cases the problem is generally resolve by stopping and restarting the SRD Service.
    4. Users of the RHEDcloud service will often need assistance implementing their use cases when site security controls implemented in RHEDcloud interfere or impede their work - usually this assistance involves working with the users to find an alternate way to accomplish their goal or work with information security to adjust the security control.
    5. Occasionally there are infrastructure updates to AWS managed services such as a need to update certificates or other attendant changes related to changes to the managed services themselves. AWS proactively notifies account admins using these services of such changes.
    6. One of the few downsides of extensive automation and automatic updates to infrastructure is that sometimes they introduce unexpected incompatibilities. Occasionally an automatic update of an operating system or platform component may break something, and administrators need to consider these updates as another potential cause of any problem they see. For example, a Shibboleth SSO service provider might be broken by an automatically updated operating system. These problems are rare, but they must be considered. In this model these problems cannot be solved in place, but rather they must be addressed in the pipeline automation that automatically deploys the application. So, these problems are addressed by the developers or a skilled DevOps team and not a system or application administrator.
    7. Presently, durable subscriptions for publish/subscribe services must be manually deleted on the message broker when instances automatically replaced by the AWS Elastic Beanstalk orchestration or when services and stopped and restarted by terminating and replacing instances. The reason for this is that these pub/sub services run in multiple instances and must have unique durable subscription names per instances. Presently they use a IP address as a component of the durable subscription name to make it unique. We have a modification to the pub/sub consumer framework in the works that will eliminate the need for this maintenance in all cases except scale-in scenarios where to the total number of instances is reduced.

RHEDcloud Quick Start

RHEDcloud Project participants have started a site implementation guide here:

https://bitbucket.org/rhedcloud/rhedcloud-aws-admin-master-cfn/src/master/README.md.

Professional Service and Managed Services

Candid Cloud (RHEDcloud Foundation Member)

Candid has developed a comprehensive engagement to help adopting sites implement RHEDcloud, setup their preferred cloud networking strategy, and implement integrations with their enterprise infrastructure. For details of the engagement see: RHEDcloud for AWS Launch Accelerator and Knowledge Transfer Engagement (Provider Candid Cloud).

Smartronix

Smartronix has proposed designing a cloud strategy consulting service that would include implementation of RHEDcloud for AWS, perhaps using Surge for the implementation work. More details should be forthcoming from Smartronix on the RHEDcloud wiki.

Surge

Surge and Emory worked together to design a RHEDcloud for AWS Launch and Knowledge transfer engagement to help sites get up and running quickly with RHEDcloud for AWS with minimum cost and effort. The engagement costs about $28,000 and is intended to run over one to two month depending on the availability of resources at the implementing site. Surge has performed this work at Emory in implementing RHEDcloud environments both for AWS at Emory and the Emory Cloud Infrastructure Migration Project (CIMP). For details of the engagement see: RHEDcloud for AWS Launch Accelerator and Knowledge Transfer Engagement (Provider Surge).

UNICON

UNICON is developing a managed service to operate RHEDcloud for adopting sites.