Graduate Program KB

Section 17 - Data & Analytics

Athena

  • Serverless query service to analyze stored S3 data
  • Uses SQL to query files
  • Supports CSV, JSON, ORC, Avro and Parquet
  • Pricing is $5 per TB of scanned data
  • Commonly used with Quicksight for reporting/dashboards
  • Useful for business intelligence, analytics, query VPC flow logs, ELB logs, CloudTrail logs, etc.

Athena Performance Improvements

  • Columnar data is cost efficient as there are less scans
    • Use Apache Parquet or ORC, can convert data between the two using AWS Glue
  • Compress data for smaller retrievals
  • Partition datasets in S3 for easy querying on virtual columns
  • Use larger files greater than 128MB to minimize overhead

Athena Federated Query

  • Run SQL queries across data stored in various data sources (cloud or on-premises)
  • Uses data source connectors that run on Lambda to run federated queries
  • The results are stored back in S3

Redshift

  • For OLAP (online analytical processing), not for OLTP
  • 10x better performance over other data warehouses and can scale to PBs of data
  • Uses a columnar storage structure instead of rows and uses a parallel query engine
  • There are two modes: Provisioned cluster or serverless cluster
  • Integrated with BI tools such as Quicksight and Tableau
  • Compared to Athena, it has faster queries, joins and aggregations due to indexes
  • Cluster:
    • Leader node for query planning and aggregating results
    • Compute node for performing queries and sending results to the leader node
    • Provisioned mode involves selecting an instance type, reserve instances can be used for cost savings

Redshift Snapshots and Disaster Recovery

  • Multi-AZ for some clusters
  • PITR snapshots stored internally in S3
    • Snapshots are also incremental, storing only the most recent changes
  • Snapshots can be restored into a new cluster
  • Snapshots are automated, occurring every 8 hours, every 5GB or on a schedule with a set retention between 1 to 35 days
    • For manual snapshots, they are retained until you delete it
  • Redshift can be configured to automatically copy snapshots of a cluster to another region

OpenSearch

  • Can search any field or partial matches, unlike DynamoDB which use primary keys or indexes
  • Typically used as a complement to another database
  • There are two modes: Managed cluster or serverless cluster
  • Doesn't natively support SQL unless enabled via a plugin
  • Ingestion from Firehose, AWS IoT and CloudWatch logs
  • Security via Cognito, IAM, KMS, TLS
  • Has OpenSearch Dashboards for visualization

EMR

  • Creates Hadoop clusters (big data) to analyze and process vast amount of data
  • Clusters can be made of hundreds of EC2 instances
  • EMR is bundled with Apache Spark, HBase, Presto, Flink, etc. and takes care of all provisioning and configuration
  • Has auto-scaling and integration with Spot instances
  • Useful for data processing, machine learning, web indexing and big data

EMR Node Types

  • Master: Manages the cluster, coordinates and manages health
  • Core: Runs tasks and stores data
  • Task (optional): Just runs tasks
  • Purchasing options:
    • On-demand: Reliable, predictable and won't be terminated
    • Reserved (min. 1 year): Cost savings (EMR will automatically use if available)
    • Spot instances: Cheaper but less reliable as it can be terminated anytime

Quicksight

  • A serverless ML-powered business intelligence service to create interactive dashboards
  • A dashboard is a read-only snapshot of an analysis you can share (configuration of analysis is preserved)
    • Dashboards must be published first before it can be shared
    • It can be shared with Users (standard) or Groups (enterprise), these exist within Quicksight, not IAM
  • Fast, automatically scalable embeddable with per session pricing
  • Integrated with RDS, Aurora, Athena, Redshift, S3, etc.
  • In-memory computation using SPICE engine if data is imported into Quicksight
  • With the enterprise edition, you can set up column-level security
  • Useful for business analytics, visualizations, ad-hoc analysis and getting business insights using data

Glue

  • Fully managed extract, transform and load service
  • Useful to prepare and transform data for analytics
  • Glue Job Bookmarks: Prevent re-processing old data
  • Glue Elastic Views:
    • Combine and replicate data across multiple data stores using SQL
    • No custom code, Glue monitors for changes in the source data
    • Leverages a virtual table, which is just a materialized view
  • Glue DataBrew: Clean and normalize data using pre-built transformation
  • Glue Studio: New GUI to create, run and monitor ETL jobs in Glue
  • Glue Streaming ETL: Compatible with Kinesis Data Streaming, Kafka, MSK

Lake Formation

  • A fully managed service built on top of AWS Glue that makes it easy to set up a data lake in days
  • Data lakes are a central location to have all your data (structured and unstructured) for analytical purposes
  • You can discover, clean, transform and ingest data in your data lake
  • Many complex steps are automated, such as collecting, cleansing, moving, cataloguing and de-duplicating data
  • Out-of-the-box source blueprints for S3, RDS, Relational an NoSQL databases
  • Fine-grained access control for your apps (row and column level)

Kinesis Data Analytics (SQL)

  • A fully managed service providing real-time analytics on Kinesis Data Streams and Firehose using SQL
  • Can add reference data from S3 to enrich streaming data
  • Has automatic scaling and pay based off consumption rate
  • Output:
    • Data Streams: Create streams out of the real-time analytic queries
    • Data Firehose: Send analytics query results to destinations
  • Useful for timeseries analytics, real-time dashboards, real-time metrics
  • Use Flink (Java, Scala or SQL) to process and analyze streaming data
  • Run any Apache Flink app on a managed cluster on AWS
    • Provision compute resources, parallel computation with automatic scaling
    • App backups (checkpoints and snapshots)
    • Use any Apache Flink programming features
    • Flink doesn't read from Firehose (use Kinesis Analytics for SQL instead)

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

  • A fully managed Apache Kafka implementation on AWS
    • An alternative to Amazon Kinesis
  • Features:
    • Create, update and delete clusters
    • MSK creates and manages Kafka brokers nodes and Zookeeper nodes for you
    • Deploy MSK cluster in VPC with multi-az
    • Automatic recovery from common Kafka failures
    • Data is stored on EBS volumes for as long as you want
  • MSK Serverless
    • Run Kafka on MSK without managing capacity
    • MSK automatically provisions resources and scales compute and storage

Kinesis Data Streams vs. MSK

  • Kinesis Data Streams
    • 1MB message size limit
    • Data streams with shards
    • Shard splitting and merging
    • TLS in-flight encryption
    • KMS at-rest encryption
  • MSK
    • 1MB default, configurable up to 10MB
    • Kafka topics with partitions
    • Can only add partitions to a topic
    • PLAINTEXT or TLS in-flight encryption
    • KMS at-rest encryption