Section 17 - Data & Analytics

Serverless query service to analyze stored S3 data
Uses SQL to query files
Supports CSV, JSON, ORC, Avro and Parquet
Pricing is $5 per TB of scanned data
Commonly used with Quicksight for reporting/dashboards
Useful for business intelligence, analytics, query VPC flow logs, ELB logs, CloudTrail logs, etc.

Columnar data is cost efficient as there are less scans
- Use Apache Parquet or ORC, can convert data between the two using AWS Glue
Compress data for smaller retrievals
Partition datasets in S3 for easy querying on virtual columns
Use larger files greater than 128MB to minimize overhead

Run SQL queries across data stored in various data sources (cloud or on-premises)
Uses data source connectors that run on Lambda to run federated queries
The results are stored back in S3

For OLAP (online analytical processing), not for OLTP
10x better performance over other data warehouses and can scale to PBs of data
Uses a columnar storage structure instead of rows and uses a parallel query engine
There are two modes: Provisioned cluster or serverless cluster
Integrated with BI tools such as Quicksight and Tableau
Compared to Athena, it has faster queries, joins and aggregations due to indexes
Cluster:
- Leader node for query planning and aggregating results
- Compute node for performing queries and sending results to the leader node
- Provisioned mode involves selecting an instance type, reserve instances can be used for cost savings

Multi-AZ for some clusters
PITR snapshots stored internally in S3
- Snapshots are also incremental, storing only the most recent changes
Snapshots can be restored into a new cluster
Snapshots are automated, occurring every 8 hours, every 5GB or on a schedule with a set retention between 1 to 35 days
- For manual snapshots, they are retained until you delete it
Redshift can be configured to automatically copy snapshots of a cluster to another region

Can search any field or partial matches, unlike DynamoDB which use primary keys or indexes
Typically used as a complement to another database
There are two modes: Managed cluster or serverless cluster
Doesn't natively support SQL unless enabled via a plugin
Ingestion from Firehose, AWS IoT and CloudWatch logs
Security via Cognito, IAM, KMS, TLS
Has OpenSearch Dashboards for visualization

Creates Hadoop clusters (big data) to analyze and process vast amount of data
Clusters can be made of hundreds of EC2 instances
EMR is bundled with Apache Spark, HBase, Presto, Flink, etc. and takes care of all provisioning and configuration
Has auto-scaling and integration with Spot instances
Useful for data processing, machine learning, web indexing and big data

Master: Manages the cluster, coordinates and manages health
Core: Runs tasks and stores data
Task (optional): Just runs tasks
Purchasing options:
- On-demand: Reliable, predictable and won't be terminated
- Reserved (min. 1 year): Cost savings (EMR will automatically use if available)
- Spot instances: Cheaper but less reliable as it can be terminated anytime

A serverless ML-powered business intelligence service to create interactive dashboards
A dashboard is a read-only snapshot of an analysis you can share (configuration of analysis is preserved)
- Dashboards must be published first before it can be shared
- It can be shared with Users (standard) or Groups (enterprise), these exist within Quicksight, not IAM
Fast, automatically scalable embeddable with per session pricing
Integrated with RDS, Aurora, Athena, Redshift, S3, etc.
In-memory computation using SPICE engine if data is imported into Quicksight
With the enterprise edition, you can set up column-level security
Useful for business analytics, visualizations, ad-hoc analysis and getting business insights using data

Fully managed extract, transform and load service
Useful to prepare and transform data for analytics
Glue Job Bookmarks: Prevent re-processing old data
Glue Elastic Views:
- Combine and replicate data across multiple data stores using SQL
- No custom code, Glue monitors for changes in the source data
- Leverages a virtual table, which is just a materialized view
Glue DataBrew: Clean and normalize data using pre-built transformation
Glue Studio: New GUI to create, run and monitor ETL jobs in Glue
Glue Streaming ETL: Compatible with Kinesis Data Streaming, Kafka, MSK

A fully managed service built on top of AWS Glue that makes it easy to set up a data lake in days
Data lakes are a central location to have all your data (structured and unstructured) for analytical purposes
You can discover, clean, transform and ingest data in your data lake
Many complex steps are automated, such as collecting, cleansing, moving, cataloguing and de-duplicating data
Out-of-the-box source blueprints for S3, RDS, Relational an NoSQL databases
Fine-grained access control for your apps (row and column level)

A fully managed service providing real-time analytics on Kinesis Data Streams and Firehose using SQL
Can add reference data from S3 to enrich streaming data
Has automatic scaling and pay based off consumption rate
Output:
- Data Streams: Create streams out of the real-time analytic queries
- Data Firehose: Send analytics query results to destinations
Useful for timeseries analytics, real-time dashboards, real-time metrics

Use Flink (Java, Scala or SQL) to process and analyze streaming data
Run any Apache Flink app on a managed cluster on AWS
- Provision compute resources, parallel computation with automatic scaling
- App backups (checkpoints and snapshots)
- Use any Apache Flink programming features
- Flink doesn't read from Firehose (use Kinesis Analytics for SQL instead)

A fully managed Apache Kafka implementation on AWS
- An alternative to Amazon Kinesis
Features:
- Create, update and delete clusters
- MSK creates and manages Kafka brokers nodes and Zookeeper nodes for you
- Deploy MSK cluster in VPC with multi-az
- Automatic recovery from common Kafka failures
- Data is stored on EBS volumes for as long as you want
MSK Serverless
- Run Kafka on MSK without managing capacity
- MSK automatically provisions resources and scales compute and storage

Kinesis Data Streams
- 1MB message size limit
- Data streams with shards
- Shard splitting and merging
- TLS in-flight encryption
- KMS at-rest encryption
MSK
- 1MB default, configurable up to 10MB
- Kafka topics with partitions
- Can only add partitions to a topic
- PLAINTEXT or TLS in-flight encryption
- KMS at-rest encryption