Graduate Program KB

Section 19 - Monitoring, Audit and Performance

CloudWatch Metrics

  • Every service in AWS has CloudWatch metrics
  • Metrics are a monitored variable belonging to a namespace
  • A dimension is an attribute of a metric (ex. instance id, environment), a metric can have up to 30 dimensions
  • Metrics have timestamps
  • Can create a dashboard of CloudWatch metrics, as well as create custom CloudWatch metrics

CloudWatch Metric Streams

  • Stream metrics to a selected destination with near-real-time delivery and low-latency
  • There's an option to filter metrics so only a subset of metrics are streamed

CloudWatch Logs

  • Log groups are given an arbitrary name and usually represent an app
  • Log streams are instances within the app, log files or containers
  • Can define log expiration policies
  • Encrypted by default, can also set up KMS-based encryption with your own keys
  • Logs Insights
    • Search and analyze log data stored in CloudWatch Logs
    • A purpose-built query language is available for automatically discovering fields from services and JSON log events
      • Can also fetch desired event fields, filter them, etc.
      • Save queries and add to dashboards
    • Query multiple log groups in different AWS accounts
    • Not a real-time engine, it's a query engine
  • S3 export
    • Log data can take up to 12 hours to become available for export
    • API call is CreateExportTask
  • Logs Subscriptions
    • Get real-time log events from CloudWatch Logs for processing and analysis
    • Can send to Data Streams, Firehose or Lambda
      • With a cross-account subscription, you can send log events to resources in different AWS accounts
    • Set up a subscription filter to filter which logs are events delivered to your destination
  • For EC2:
    • By default, no logs from EC2 instances go to CloudWatch, you need to run a CloudWatch agent to push log files
    • Ensure correct IAM permissions
    • The agent can be set up on-premises as well

CloudWatch Logs Agent & Unified Agent

  • Logs Agent is an old version which can only send logs to CloudWatch Logs
  • Unified Agent collects additional system-level metrics and sends logs to CloudWatch Logs
    • Centralized configuration using SSM Parameter Store
    • Metrics can include CPU, Disk metrics, RAM, netstat, processes, swap space

CloudWatch Alarms

  • Used to trigger notifications for any metric based off threshold values
  • Alarm states include OK, INSUFFICIENT_DATA, ALARM
  • Alarms can be created based on CloudWatch Logs Metrics Filters
  • For testing alarms and notifications, set alarm state to ALARM using CLI
    aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"
    
  • Period:
    • Length of time in seconds to evaluate metric
    • High resolution custom metrics (10 seconds, 30 seconds or multiples of 60 seconds)
  • Targets:
    • Stop, terminate, reboot or recover an EC2 instance
    • Trigger auto scaling action
    • Send notification to SNS (can do pretty much anything from here)
  • Composite Alarms
    • Monitor states of MULTIPLE other alarms
    • Use AND and OR conditions
    • Can reduce alarm nise by creating complex composite alarms

EventBridge

  • Schedule cron jobs (scripts) or perform some action based on an event pattern, such as triggering lambda for email notifications upon logging in
  • Event buses are accessible by other AWS accounts using resource-based policies
  • Can archive events sent to an event bus
  • Ability to replay archived events

EventBridge Schema Registry

  • EventBridge can analyze events in your bus and infer the schema
  • Can generate code for your app that knows in advance how data is structured in the event bus
  • Versioning

EventBridge Resource-based Policy

  • Manage permissions for a specific event bus
  • Useful for aggregating all events from your Organization in a single AWS account or region

CloudWatch Container Insights

  • Collection, aggregate and summarize metrics and logs from containers
  • For EKS and Kubernetes, Insights is using a containerized version of CloudWatch agent to discover containers

CloudWatch Lambda Insights

  • Monitor and troubleshoot solution for serverless apps running on Lambda
  • Collects, aggregates and summarizes system-level metrics including CPU time, memory disk and network
    • Also applicable for diagnostic information such as cold starts and Lambda worker shutdowns
  • Insights is provided as a Lambda Layer

CloudWatch Contributor Insights

  • Analyze log data and create timeseries that displays contributor data
    • Find "top-N" talkers and understand who's impacting system performance

CloudWatch Application Insights

  • Provides automated dashboards that show potential problems with monitored apps to help isolate ongoing issues
  • Gives enhanced visibility into your app health to reduce time to troubleshoot and repair your apps
  • Findings and alerts are sent to EventBridge and SSM OpsCenter

CloudTrail

  • Provides governance, compliance and audit for your AWS account
  • Enabled by default
  • Can put logs into CloudWatch Logs or S3
  • Trails can be applied to all regions (default) or a single region
  • If a resource is deleted, investigate CloudTrail first

CloudTrail Events

  • Management
    • Operations performed on resources in your AWS account
    • Trails are configured to log management events by default
    • Can separate read events from write events
    • Ex. configuring security, setting up logging, etc.
  • Data
    • Data events not logged by default due to high volume operations
    • Can separate read events from write events
    • Ex. S3 activity and Lambda function execution
  • CloudTrail Insights
    • Enable to detect unusual activity in your account
    • Analyzes normal management events to create a baseline then continuously analyzes write events to find unusual patterns
    • Anomalies appear in CloudTrail console, event is sent to S3 and an EventBridge event is generated (for automation needs)
  • Events are stored for 90 days
    • To keep events beyond this period, log them to S3 and use Athena

Config

  • Assists with auditing and recording compliance of resources with changes over time
  • Can receive alerts via SNS for changes
  • Config is a per region service
  • Can be aggregated across regions and accounts
  • Can store configuration data into S3 which can be analyzed by Athena

Config Rules

  • Can use AWS managed config rules (there are over 75)
  • Custom config rules must be defined in Lambda
  • Rules can be evaluated or triggered for each config change and/or at regular time intervals
  • Config rules don't prevent actions from happening (no deny)
  • There is no free tier, it's $0.003 per configuration item recorded per region and $0.001 per config rule evaluation per region
  • Remediations:
    • Automate remediation of non-compliant resources using SSM Automation Documents
    • Use AWS-Managed Automation Documents or create custom Automation Documents
    • Can set Remediation Retries if the resource is still non-compliant after auto-remediation
  • Notifications:
    • Use EventBridge to trigger notifications when AWS resources are non-compliant
    • Ability to send configuration changes and compliance state notifications to SNS

CloudWatch vs. CloudTrail vs. Config

  • CloudWatch
    • Performance monitoring and dashboards
    • Events and alerting
    • Log aggregation and analysis
  • CloudTrail
    • Record API calls made within your account by everyone
    • Define trails for specific resources
    • Global service
  • Config
    • Record configuration changes
    • Evaluate resources against compliance rules
    • Get timeline of changes and compliance
  • For an ELB:
    • CloudWatch
      • Monitor incoming connections metric
      • Visualize error codes as percentage over time
      • Make dashboard to visualize load balancer performance
    • CloudTrail
      • Track who made changes to the load balancer with API calls
    • Config
      • Track security group rules for load balancer
      • Track configuration changes for the load balancer
      • Ensure an SSL certificate is always assigned to the load balancer (compliance)