Section 17 - Data & Analytics
Athena
- Serverless query service to analyze stored S3 data
 
- Uses SQL to query files
 
- Supports CSV, JSON, ORC, Avro and Parquet
 
- Pricing is $5 per TB of scanned data
 
- Commonly used with Quicksight for reporting/dashboards
 
- Useful for business intelligence, analytics, query VPC flow logs, ELB logs, CloudTrail logs, etc.
 
- Columnar data is cost efficient as there are less scans
- Use Apache Parquet or ORC, can convert data between the two using AWS Glue
 
 
- Compress data for smaller retrievals
 
- Partition datasets in S3 for easy querying on virtual columns
 
- Use larger files greater than 128MB to minimize overhead
 
Athena Federated Query
- Run SQL queries across data stored in various data sources (cloud or on-premises)
 
- Uses data source connectors that run on Lambda to run federated queries
 
- The results are stored back in S3
 
Redshift
- For OLAP (online analytical processing), not for OLTP
 
- 10x better performance over other data warehouses and can scale to PBs of data
 
- Uses a columnar storage structure instead of rows and uses a parallel query engine
 
- There are two modes: Provisioned cluster or serverless cluster
 
- Integrated with BI tools such as Quicksight and Tableau
 
- Compared to Athena, it has faster queries, joins and aggregations due to indexes
 
- Cluster:
- Leader node for query planning and aggregating results
 
- Compute node for performing queries and sending results to the leader node
 
- Provisioned mode involves selecting an instance type, reserve instances can be used for cost savings
 
 
Redshift Snapshots and Disaster Recovery
- Multi-AZ for some clusters
 
- PITR snapshots stored internally in S3
- Snapshots are also incremental, storing only the most recent changes
 
 
- Snapshots can be restored into a new cluster
 
- Snapshots are automated, occurring every 8 hours, every 5GB or on a schedule with a set retention between 1 to 35 days
- For manual snapshots, they are retained until you delete it
 
 
- Redshift can be configured to automatically copy snapshots of a cluster to another region
 
OpenSearch
- Can search any field or partial matches, unlike DynamoDB which use primary keys or indexes
 
- Typically used as a complement to another database
 
- There are two modes: Managed cluster or serverless cluster
 
- Doesn't natively support SQL unless enabled via a plugin
 
- Ingestion from Firehose, AWS IoT and CloudWatch logs
 
- Security via Cognito, IAM, KMS, TLS
 
- Has OpenSearch Dashboards for visualization
 
EMR
- Creates Hadoop clusters (big data) to analyze and process vast amount of data
 
- Clusters can be made of hundreds of EC2 instances
 
- EMR is bundled with Apache Spark, HBase, Presto, Flink, etc. and takes care of all provisioning and configuration
 
- Has auto-scaling and integration with Spot instances
 
- Useful for data processing, machine learning, web indexing and big data
 
EMR Node Types
- Master: Manages the cluster, coordinates and manages health
 
- Core: Runs tasks and stores data
 
- Task (optional): Just runs tasks
 
- Purchasing options:
- On-demand: Reliable, predictable and won't be terminated
 
- Reserved (min. 1 year): Cost savings (EMR will automatically use if available)
 
- Spot instances: Cheaper but less reliable as it can be terminated anytime
 
 
Quicksight
- A serverless ML-powered business intelligence service to create interactive dashboards
 
- A dashboard is a read-only snapshot of an analysis you can share (configuration of analysis is preserved)
- Dashboards must be published first before it can be shared
 
- It can be shared with Users (standard) or Groups (enterprise), these exist within Quicksight, not IAM
 
 
- Fast, automatically scalable embeddable with per session pricing
 
- Integrated with RDS, Aurora, Athena, Redshift, S3, etc.
 
- In-memory computation using SPICE engine if data is imported into Quicksight
 
- With the enterprise edition, you can set up column-level security
 
- Useful for business analytics, visualizations, ad-hoc analysis and getting business insights using data
 
Glue
- Fully managed extract, transform and load service
 
- Useful to prepare and transform data for analytics
 
- Glue Job Bookmarks: Prevent re-processing old data
 
- Glue Elastic Views:
- Combine and replicate data across multiple data stores using SQL
 
- No custom code, Glue monitors for changes in the source data
 
- Leverages a virtual table, which is just a materialized view
 
 
- Glue DataBrew: Clean and normalize data using pre-built transformation
 
- Glue Studio: New GUI to create, run and monitor ETL jobs in Glue
 
- Glue Streaming ETL: Compatible with Kinesis Data Streaming, Kafka, MSK
 
- A fully managed service built on top of AWS Glue that makes it easy to set up a data lake in days
 
- Data lakes are a central location to have all your data (structured and unstructured) for analytical purposes
 
- You can discover, clean, transform and ingest data in your data lake
 
- Many complex steps are automated, such as collecting, cleansing, moving, cataloguing and de-duplicating data
 
- Out-of-the-box source blueprints for S3, RDS, Relational an NoSQL databases
 
- Fine-grained access control for your apps (row and column level)
 
Kinesis Data Analytics (SQL)
- A fully managed service providing real-time analytics on Kinesis Data Streams and Firehose using SQL
 
- Can add reference data from S3 to enrich streaming data
 
- Has automatic scaling and pay based off consumption rate
 
- Output:
- Data Streams: Create streams out of the real-time analytic queries
 
- Data Firehose: Send analytics query results to destinations
 
 
- Useful for timeseries analytics, real-time dashboards, real-time metrics
 
Kinesis Data Analytics (Apache Flink)
- Use Flink (Java, Scala or SQL) to process and analyze streaming data
 
- Run any Apache Flink app on a managed cluster on AWS
- Provision compute resources, parallel computation with automatic scaling
 
- App backups (checkpoints and snapshots)
 
- Use any Apache Flink programming features
 
- Flink doesn't read from Firehose (use Kinesis Analytics for SQL instead)
 
 
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- A fully managed Apache Kafka implementation on AWS
- An alternative to Amazon Kinesis
 
 
- Features:
- Create, update and delete clusters
 
- MSK creates and manages Kafka brokers nodes and Zookeeper nodes for you
 
- Deploy MSK cluster in VPC with multi-az
 
- Automatic recovery from common Kafka failures
 
- Data is stored on EBS volumes for as long as you want
 
 
- MSK Serverless
- Run Kafka on MSK without managing capacity
 
- MSK automatically provisions resources and scales compute and storage
 
 
Kinesis Data Streams vs. MSK
- Kinesis Data Streams
- 1MB message size limit
 
- Data streams with shards
 
- Shard splitting and merging
 
- TLS in-flight encryption
 
- KMS at-rest encryption
 
 
- MSK
- 1MB default, configurable up to 10MB
 
- Kafka topics with partitions
 
- Can only add partitions to a topic
 
- PLAINTEXT or TLS in-flight encryption
 
- KMS at-rest encryption