Section 19 - Monitoring, Audit and Performance
CloudWatch Metrics
- Every service in AWS has CloudWatch metrics
- Metrics are a monitored variable belonging to a namespace
- A dimension is an attribute of a metric (ex. instance id, environment), a metric can have up to 30 dimensions
- Metrics have timestamps
- Can create a dashboard of CloudWatch metrics, as well as create custom CloudWatch metrics
CloudWatch Metric Streams
- Stream metrics to a selected destination with near-real-time delivery and low-latency
- There's an option to filter metrics so only a subset of metrics are streamed
CloudWatch Logs
- Log groups are given an arbitrary name and usually represent an app
- Log streams are instances within the app, log files or containers
- Can define log expiration policies
- Encrypted by default, can also set up KMS-based encryption with your own keys
- Logs Insights
- Search and analyze log data stored in CloudWatch Logs
- A purpose-built query language is available for automatically discovering fields from services and JSON log events
- Can also fetch desired event fields, filter them, etc.
- Save queries and add to dashboards
- Query multiple log groups in different AWS accounts
- Not a real-time engine, it's a query engine
- S3 export
- Log data can take up to 12 hours to become available for export
- API call is CreateExportTask
- Logs Subscriptions
- Get real-time log events from CloudWatch Logs for processing and analysis
- Can send to Data Streams, Firehose or Lambda
- With a cross-account subscription, you can send log events to resources in different AWS accounts
- Set up a subscription filter to filter which logs are events delivered to your destination
- For EC2:
- By default, no logs from EC2 instances go to CloudWatch, you need to run a CloudWatch agent to push log files
- Ensure correct IAM permissions
- The agent can be set up on-premises as well
CloudWatch Logs Agent & Unified Agent
- Logs Agent is an old version which can only send logs to CloudWatch Logs
- Unified Agent collects additional system-level metrics and sends logs to CloudWatch Logs
- Centralized configuration using SSM Parameter Store
- Metrics can include CPU, Disk metrics, RAM, netstat, processes, swap space
CloudWatch Alarms
- Used to trigger notifications for any metric based off threshold values
- Alarm states include OK, INSUFFICIENT_DATA, ALARM
- Alarms can be created based on CloudWatch Logs Metrics Filters
- For testing alarms and notifications, set alarm state to ALARM using CLI
aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"
- Period:
- Length of time in seconds to evaluate metric
- High resolution custom metrics (10 seconds, 30 seconds or multiples of 60 seconds)
- Targets:
- Stop, terminate, reboot or recover an EC2 instance
- Trigger auto scaling action
- Send notification to SNS (can do pretty much anything from here)
- Composite Alarms
- Monitor states of MULTIPLE other alarms
- Use AND and OR conditions
- Can reduce alarm nise by creating complex composite alarms
EventBridge
- Schedule cron jobs (scripts) or perform some action based on an event pattern, such as triggering lambda for email notifications upon logging in
- Event buses are accessible by other AWS accounts using resource-based policies
- Can archive events sent to an event bus
- Ability to replay archived events
EventBridge Schema Registry
- EventBridge can analyze events in your bus and infer the schema
- Can generate code for your app that knows in advance how data is structured in the event bus
- Versioning
EventBridge Resource-based Policy
- Manage permissions for a specific event bus
- Useful for aggregating all events from your Organization in a single AWS account or region
CloudWatch Container Insights
- Collection, aggregate and summarize metrics and logs from containers
- For EKS and Kubernetes, Insights is using a containerized version of CloudWatch agent to discover containers
CloudWatch Lambda Insights
- Monitor and troubleshoot solution for serverless apps running on Lambda
- Collects, aggregates and summarizes system-level metrics including CPU time, memory disk and network
- Also applicable for diagnostic information such as cold starts and Lambda worker shutdowns
- Insights is provided as a Lambda Layer
CloudWatch Contributor Insights
- Analyze log data and create timeseries that displays contributor data
- Find "top-N" talkers and understand who's impacting system performance
CloudWatch Application Insights
- Provides automated dashboards that show potential problems with monitored apps to help isolate ongoing issues
- Gives enhanced visibility into your app health to reduce time to troubleshoot and repair your apps
- Findings and alerts are sent to EventBridge and SSM OpsCenter
CloudTrail
- Provides governance, compliance and audit for your AWS account
- Enabled by default
- Can put logs into CloudWatch Logs or S3
- Trails can be applied to all regions (default) or a single region
- If a resource is deleted, investigate CloudTrail first
CloudTrail Events
- Management
- Operations performed on resources in your AWS account
- Trails are configured to log management events by default
- Can separate read events from write events
- Ex. configuring security, setting up logging, etc.
- Data
- Data events not logged by default due to high volume operations
- Can separate read events from write events
- Ex. S3 activity and Lambda function execution
- CloudTrail Insights
- Enable to detect unusual activity in your account
- Analyzes normal management events to create a baseline then continuously analyzes write events to find unusual patterns
- Anomalies appear in CloudTrail console, event is sent to S3 and an EventBridge event is generated (for automation needs)
- Events are stored for 90 days
- To keep events beyond this period, log them to S3 and use Athena
Config
- Assists with auditing and recording compliance of resources with changes over time
- Can receive alerts via SNS for changes
- Config is a per region service
- Can be aggregated across regions and accounts
- Can store configuration data into S3 which can be analyzed by Athena
Config Rules
- Can use AWS managed config rules (there are over 75)
- Custom config rules must be defined in Lambda
- Rules can be evaluated or triggered for each config change and/or at regular time intervals
- Config rules don't prevent actions from happening (no deny)
- There is no free tier, it's $0.003 per configuration item recorded per region and $0.001 per config rule evaluation per region
- Remediations:
- Automate remediation of non-compliant resources using SSM Automation Documents
- Use AWS-Managed Automation Documents or create custom Automation Documents
- Can set Remediation Retries if the resource is still non-compliant after auto-remediation
- Notifications:
- Use EventBridge to trigger notifications when AWS resources are non-compliant
- Ability to send configuration changes and compliance state notifications to SNS
CloudWatch vs. CloudTrail vs. Config
- CloudWatch
- Performance monitoring and dashboards
- Events and alerting
- Log aggregation and analysis
- CloudTrail
- Record API calls made within your account by everyone
- Define trails for specific resources
- Global service
- Config
- Record configuration changes
- Evaluate resources against compliance rules
- Get timeline of changes and compliance
- For an ELB:
- CloudWatch
- Monitor incoming connections metric
- Visualize error codes as percentage over time
- Make dashboard to visualize load balancer performance
- CloudTrail
- Track who made changes to the load balancer with API calls
- Config
- Track security group rules for load balancer
- Track configuration changes for the load balancer
- Ensure an SSL certificate is always assigned to the load balancer (compliance)
- CloudWatch