Section 17 - Data & Analytics
Athena
- Serverless query service to analyze stored S3 data
- Uses SQL to query files
- Supports CSV, JSON, ORC, Avro and Parquet
- Pricing is $5 per TB of scanned data
- Commonly used with Quicksight for reporting/dashboards
- Useful for business intelligence, analytics, query VPC flow logs, ELB logs, CloudTrail logs, etc.
- Columnar data is cost efficient as there are less scans
- Use Apache Parquet or ORC, can convert data between the two using AWS Glue
- Compress data for smaller retrievals
- Partition datasets in S3 for easy querying on virtual columns
- Use larger files greater than 128MB to minimize overhead
Athena Federated Query
- Run SQL queries across data stored in various data sources (cloud or on-premises)
- Uses data source connectors that run on Lambda to run federated queries
- The results are stored back in S3
Redshift
- For OLAP (online analytical processing), not for OLTP
- 10x better performance over other data warehouses and can scale to PBs of data
- Uses a columnar storage structure instead of rows and uses a parallel query engine
- There are two modes: Provisioned cluster or serverless cluster
- Integrated with BI tools such as Quicksight and Tableau
- Compared to Athena, it has faster queries, joins and aggregations due to indexes
- Cluster:
- Leader node for query planning and aggregating results
- Compute node for performing queries and sending results to the leader node
- Provisioned mode involves selecting an instance type, reserve instances can be used for cost savings
Redshift Snapshots and Disaster Recovery
- Multi-AZ for some clusters
- PITR snapshots stored internally in S3
- Snapshots are also incremental, storing only the most recent changes
- Snapshots can be restored into a new cluster
- Snapshots are automated, occurring every 8 hours, every 5GB or on a schedule with a set retention between 1 to 35 days
- For manual snapshots, they are retained until you delete it
- Redshift can be configured to automatically copy snapshots of a cluster to another region
OpenSearch
- Can search any field or partial matches, unlike DynamoDB which use primary keys or indexes
- Typically used as a complement to another database
- There are two modes: Managed cluster or serverless cluster
- Doesn't natively support SQL unless enabled via a plugin
- Ingestion from Firehose, AWS IoT and CloudWatch logs
- Security via Cognito, IAM, KMS, TLS
- Has OpenSearch Dashboards for visualization
EMR
- Creates Hadoop clusters (big data) to analyze and process vast amount of data
- Clusters can be made of hundreds of EC2 instances
- EMR is bundled with Apache Spark, HBase, Presto, Flink, etc. and takes care of all provisioning and configuration
- Has auto-scaling and integration with Spot instances
- Useful for data processing, machine learning, web indexing and big data
EMR Node Types
- Master: Manages the cluster, coordinates and manages health
- Core: Runs tasks and stores data
- Task (optional): Just runs tasks
- Purchasing options:
- On-demand: Reliable, predictable and won't be terminated
- Reserved (min. 1 year): Cost savings (EMR will automatically use if available)
- Spot instances: Cheaper but less reliable as it can be terminated anytime
Quicksight
- A serverless ML-powered business intelligence service to create interactive dashboards
- A dashboard is a read-only snapshot of an analysis you can share (configuration of analysis is preserved)
- Dashboards must be published first before it can be shared
- It can be shared with Users (standard) or Groups (enterprise), these exist within Quicksight, not IAM
- Fast, automatically scalable embeddable with per session pricing
- Integrated with RDS, Aurora, Athena, Redshift, S3, etc.
- In-memory computation using SPICE engine if data is imported into Quicksight
- With the enterprise edition, you can set up column-level security
- Useful for business analytics, visualizations, ad-hoc analysis and getting business insights using data
Glue
- Fully managed extract, transform and load service
- Useful to prepare and transform data for analytics
- Glue Job Bookmarks: Prevent re-processing old data
- Glue Elastic Views:
- Combine and replicate data across multiple data stores using SQL
- No custom code, Glue monitors for changes in the source data
- Leverages a virtual table, which is just a materialized view
- Glue DataBrew: Clean and normalize data using pre-built transformation
- Glue Studio: New GUI to create, run and monitor ETL jobs in Glue
- Glue Streaming ETL: Compatible with Kinesis Data Streaming, Kafka, MSK
- A fully managed service built on top of AWS Glue that makes it easy to set up a data lake in days
- Data lakes are a central location to have all your data (structured and unstructured) for analytical purposes
- You can discover, clean, transform and ingest data in your data lake
- Many complex steps are automated, such as collecting, cleansing, moving, cataloguing and de-duplicating data
- Out-of-the-box source blueprints for S3, RDS, Relational an NoSQL databases
- Fine-grained access control for your apps (row and column level)
Kinesis Data Analytics (SQL)
- A fully managed service providing real-time analytics on Kinesis Data Streams and Firehose using SQL
- Can add reference data from S3 to enrich streaming data
- Has automatic scaling and pay based off consumption rate
- Output:
- Data Streams: Create streams out of the real-time analytic queries
- Data Firehose: Send analytics query results to destinations
- Useful for timeseries analytics, real-time dashboards, real-time metrics
Kinesis Data Analytics (Apache Flink)
- Use Flink (Java, Scala or SQL) to process and analyze streaming data
- Run any Apache Flink app on a managed cluster on AWS
- Provision compute resources, parallel computation with automatic scaling
- App backups (checkpoints and snapshots)
- Use any Apache Flink programming features
- Flink doesn't read from Firehose (use Kinesis Analytics for SQL instead)
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- A fully managed Apache Kafka implementation on AWS
- An alternative to Amazon Kinesis
- Features:
- Create, update and delete clusters
- MSK creates and manages Kafka brokers nodes and Zookeeper nodes for you
- Deploy MSK cluster in VPC with multi-az
- Automatic recovery from common Kafka failures
- Data is stored on EBS volumes for as long as you want
- MSK Serverless
- Run Kafka on MSK without managing capacity
- MSK automatically provisions resources and scales compute and storage
Kinesis Data Streams vs. MSK
- Kinesis Data Streams
- 1MB message size limit
- Data streams with shards
- Shard splitting and merging
- TLS in-flight encryption
- KMS at-rest encryption
- MSK
- 1MB default, configurable up to 10MB
- Kafka topics with partitions
- Can only add partitions to a topic
- PLAINTEXT or TLS in-flight encryption
- KMS at-rest encryption