RHEDcloud for AWS Monitoring

Background

Several types of testing are useful for monitoring RHEDcloud services:

  1. synthetic transaction monitoring to confirm backend services are working

  2. log monitoring and alerts to detect underlying error conditions in logs

  3. infrastructure monitoring to detect emergent events in the infrastructure that supports applications and services

Synthetic Transaction Monitoring

Component ID

Service/Application

Method

Impl

Test Spec and Notes

Component ID

Service/Application

Method

Impl

Test Spec and Notes

1

AWS Account Service

SOAP request

DataDog

Endpoint Example | Request | Reply

Notes: The endpoint example is a private Emory endpoint and the number of results in the response will vary over time as more accounts are added to the account series in question

2

Network Operations Service

SOAP request

DataDog

Request | Reply

3

Cisco ASR 1 Service

SOAP request

DataDog

Request | Reply

4

Cisco ASR 2 Service

SOAP request

DataDog

Request | Reply

5

LDS Service

SOAP request

DataDog

Request | Reply

6

Elastic IP Service

SOAP request

DataDog

Request | Reply

7

E-mail Validation Service

SOAP request

DataDog

Request | Reply

8

Firewall Service

SOAP request

DataDog

Request | Reply

9

Identity Management Service

SOAP request

DataDog

Request | Reply

10

Financial Account Validation Service

SOAP request

DataDog

Request | Reply

11

Security Risk Detection Service

SOAP request

DataDog

Request | Reply

Metrics:

TotalDetectorsExecuting Request | Reply

DetectionRatePerSecond Request | Reply

12

Service Desk Service

SOAP request

DataDog

Request | Reply

13

Temporary Key Issuance Service

SOAP request

DataDog

Request | Reply

14

RHEDcloud Landing Page

Selenium test

Selenium or SauceLabs invoked from DataDog

Test script

15

RHEDcloud Console

Selenium test

Selenium or SauceLabs invoked from DataDog

Test script

Log Alerts

AWS Account Service

Item ID

Type

Text

Action

Item ID

Type

Text

Action

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

3

Monitoring Item

OutOfMemoryError

Alert

4

Monitoring Item

An error occurred generating the Stack

Alert

5

Monitoring Item

The AWS Access Key Id needs a subscription for the service

Alert

6

Monitoring Item

P2PConsumer1 - Done handing request

?

Note: unsure what they were doing with that, there would be a lot of these is this a metric or timing being gathered?

7

Excluded

Service.v1_0|xml version=|AppConfig:|SessionFactoryUtil-HibernateMoaPersistenceHelper|AccessDenied;

?

Unsure what they were doing with that. Seems like that might be an error we would want to know about

Cisco ASR Services

Item ID

Type

Text

Action

Item ID

Type

Text

Action

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

3

Monitoring Item

P2PConsumer1 - Done handing request

?

Note: unsure what they were doing with that, there would be a lot of these

4

Monitoring Item

Exception

Alert

5

Excluded

failed on first attempt|ORA-02396

 

Elastic IP Service

Item ID

Type

Text

Action

Item ID

Type

Text

Action

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

E-mail Address Validation Service

Item ID

Type

Text

Action

Item ID

Type

Text

Action

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

Firewall Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

Identity Management Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

3

Monitoring Item

P2PConsumer1 - Done handing request

?

Note: unsure what they were doing with that, there would be a lot of these. Again, is there a timing metric being gathered here?

4

Monitoring Item

Error returned from NetIQ in generate

Alert

5

Monitoring Item

RoleServiceRequestcommand execution

?

Unclear what action they would take here

6

Exception

Data is already in NetIQ|Resource already exists

 

LDS Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

Network Operations Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

3

Monitoring Item

P2PConsumer1 - Done handing request

?

Note: unsure what they were doing with that, there would be a lot of these

4

Monitoring Item

Exception

Alert

Financial Account Number Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

Service Desk Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

3

Monitoring Item

ServiceNowPointToPointConsumer - Done handling request

?

Unclear what they are doing with these

4

Exception

FirewallExceptionRequest

?

Unclear what they are doing with these

TKI Service

Item ID

Type

 

 

Item ID

Type

 

 

1

Monitoring Item

ERROR

Alert

2

Monitoring Item

FATAL

Alert

4

Monitoring Item

Generate-Request execution complete in

Metric

Note: unclear what they do with this metric

5

Monitoring Item

[DuoSecurityUtil]authorizaUser Complete

Metric

Note: unclear what they do with this metric

6

Excluded

AccessDenied:

Note: unclear what exactly is excluded

 

Infrastructure Monitoring

The primary focus of infrastructure monitoring are destinations on the Java Message Service provider used by the solution. The following metrics are monitored and trigger alerts at the following thresholds:

Dest ID

Destination Name

Metric

Alert Threshold

Dest ID

Destination Name

Metric

Alert Threshold

1

AwsAccountServiceQueue

QueueSize

10

2

NetworkOpsServiceQueue

QueueSize

10

3

CiscoAsr1ServiceQueue

QueueSize

10

4

CiscoAsr2ServiceQueue

QueueSize

10

5

LdsServiceQueue

QueueSize

10

6

ElasticIpServiceQueue

QueueSize

10

7

EmailValidationServiceQueue

QueueSize

10

8

FirewallServiceQueue

QueueSize

10

9

IdmServiceQueue

QueueSize

10

10

FinancialAccountValidationServiceQueue

QueueSize

10

11

SecurityRiskDetectionServiceQueue

QueueSize

10

12

ServiceDeskServiceQueue

QueueSize

10

13

TkiService

QueueSize

10

 

TBD add topics and metrics