Section 19 - Monitoring, Audit and Performance

Every service in AWS has CloudWatch metrics
Metrics are a monitored variable belonging to a namespace
A dimension is an attribute of a metric (ex. instance id, environment), a metric can have up to 30 dimensions
Metrics have timestamps
Can create a dashboard of CloudWatch metrics, as well as create custom CloudWatch metrics

Stream metrics to a selected destination with near-real-time delivery and low-latency
There's an option to filter metrics so only a subset of metrics are streamed

Log groups are given an arbitrary name and usually represent an app
Log streams are instances within the app, log files or containers
Can define log expiration policies
Encrypted by default, can also set up KMS-based encryption with your own keys
Logs Insights
- Search and analyze log data stored in CloudWatch Logs
- A purpose-built query language is available for automatically discovering fields from services and JSON log events
  - Can also fetch desired event fields, filter them, etc.
  - Save queries and add to dashboards
- Query multiple log groups in different AWS accounts
- Not a real-time engine, it's a query engine
S3 export
- Log data can take up to 12 hours to become available for export
- API call is CreateExportTask
Logs Subscriptions
- Get real-time log events from CloudWatch Logs for processing and analysis
- Can send to Data Streams, Firehose or Lambda
  - With a cross-account subscription, you can send log events to resources in different AWS accounts
- Set up a subscription filter to filter which logs are events delivered to your destination
For EC2:
- By default, no logs from EC2 instances go to CloudWatch, you need to run a CloudWatch agent to push log files
- Ensure correct IAM permissions
- The agent can be set up on-premises as well

Logs Agent is an old version which can only send logs to CloudWatch Logs
Unified Agent collects additional system-level metrics and sends logs to CloudWatch Logs
- Centralized configuration using SSM Parameter Store
- Metrics can include CPU, Disk metrics, RAM, netstat, processes, swap space

For testing alarms and notifications, set alarm state to ALARM using CLI

aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"

Period:
- Length of time in seconds to evaluate metric
- High resolution custom metrics (10 seconds, 30 seconds or multiples of 60 seconds)
Targets:
- Stop, terminate, reboot or recover an EC2 instance
- Trigger auto scaling action
- Send notification to SNS (can do pretty much anything from here)
Composite Alarms
- Monitor states of MULTIPLE other alarms
- Use AND and OR conditions
- Can reduce alarm nise by creating complex composite alarms

Schedule cron jobs (scripts) or perform some action based on an event pattern, such as triggering lambda for email notifications upon logging in
Event buses are accessible by other AWS accounts using resource-based policies
Can archive events sent to an event bus
Ability to replay archived events

EventBridge can analyze events in your bus and infer the schema
Can generate code for your app that knows in advance how data is structured in the event bus
Versioning

Manage permissions for a specific event bus
Useful for aggregating all events from your Organization in a single AWS account or region

Collection, aggregate and summarize metrics and logs from containers
For EKS and Kubernetes, Insights is using a containerized version of CloudWatch agent to discover containers

Monitor and troubleshoot solution for serverless apps running on Lambda
Collects, aggregates and summarizes system-level metrics including CPU time, memory disk and network
- Also applicable for diagnostic information such as cold starts and Lambda worker shutdowns
Insights is provided as a Lambda Layer

Analyze log data and create timeseries that displays contributor data
- Find "top-N" talkers and understand who's impacting system performance

Provides automated dashboards that show potential problems with monitored apps to help isolate ongoing issues
Gives enhanced visibility into your app health to reduce time to troubleshoot and repair your apps
Findings and alerts are sent to EventBridge and SSM OpsCenter

Management
- Operations performed on resources in your AWS account
- Trails are configured to log management events by default
- Can separate read events from write events
- Ex. configuring security, setting up logging, etc.
Data
- Data events not logged by default due to high volume operations
- Can separate read events from write events
- Ex. S3 activity and Lambda function execution
CloudTrail Insights
- Enable to detect unusual activity in your account
- Analyzes normal management events to create a baseline then continuously analyzes write events to find unusual patterns
- Anomalies appear in CloudTrail console, event is sent to S3 and an EventBridge event is generated (for automation needs)
Events are stored for 90 days
- To keep events beyond this period, log them to S3 and use Athena

Assists with auditing and recording compliance of resources with changes over time
Can receive alerts via SNS for changes
Config is a per region service
Can be aggregated across regions and accounts
Can store configuration data into S3 which can be analyzed by Athena

Can use AWS managed config rules (there are over 75)
Custom config rules must be defined in Lambda
Rules can be evaluated or triggered for each config change and/or at regular time intervals
Config rules don't prevent actions from happening (no deny)
There is no free tier, it's $0.003 per configuration item recorded per region and $0.001 per config rule evaluation per region
Remediations:
- Automate remediation of non-compliant resources using SSM Automation Documents
- Use AWS-Managed Automation Documents or create custom Automation Documents
- Can set Remediation Retries if the resource is still non-compliant after auto-remediation
Notifications:
- Use EventBridge to trigger notifications when AWS resources are non-compliant
- Ability to send configuration changes and compliance state notifications to SNS

CloudWatch
- Performance monitoring and dashboards
- Events and alerting
- Log aggregation and analysis
CloudTrail
- Record API calls made within your account by everyone
- Define trails for specific resources
- Global service
Config
- Record configuration changes
- Evaluate resources against compliance rules
- Get timeline of changes and compliance
For an ELB:
- CloudWatch
  - Monitor incoming connections metric
  - Visualize error codes as percentage over time
  - Make dashboard to visualize load balancer performance
- CloudTrail
  - Track who made changes to the load balancer with API calls
- Config
  - Track security group rules for load balancer
  - Track configuration changes for the load balancer
  - Ensure an SSL certificate is always assigned to the load balancer (compliance)