Databases & Analytics
Types
- Relational DB:
- Tables that are linked together.
- Use SQL to perform queries.
- NoSQL DB:
- Non-relational database.
- Built for a specific data model and have flexible schemas.
- Benefits:
- Flexibility: easy to evolve data model.
- Scalability: designed to scale-out by using distributed clusters.
- High-performance: optimized for a specific data model.
- Highly functional: types optimized for the data model.
- A common example is to model a NoSQL database with JSON.
Shared Responsibility on AWS
- AWS offers to manage different databases.
- Benefits include:
- Quick provisioning, High availability, vertical and horizontal scaling.
- Automated backup and restore, operations, upgrades.
- Operating system patching is handled by AWS.
- Monitoring and alerting are integrated.
- Many DB technologies can be run on an EC2 instance but this puts much more responsibility on you as it is managed by you.
Amazon RDS
- Relational Database Service.
- Use SQL for its query language.
- Free tier on AWS.
- Allows you to create databases in the cloud that are managed by AWS: Postgres, mySQL, MariaDB, Oracle, Microsoft SQL Server, IBM DB2.
- Advantages of using this service over your own on an EC2:
- It's managed service.
- Automated provisioning.
- Continuous backups and restore to specific timestamp (Point in Time Restore).
- Monitoring dashboards.
- Read replicas for improved read performance.
- Multi AZ setup for Disaster Recovery (DR).
- Maintenance windows for upgrades.
- Scale horizontally and vertically.
- Storage backed by EBS.
- Disadvantage: You can't SSH into your instances.
- Deployment Options:
- Read Replicas:
- Scale the read workload of your DB.
- Can create up to 15 read replicas.
- Data is only written to the main DB.
- Multi-AZ:
- Failover in case of AZ outage, high availability.
- Multi-region:
- Multi-region read replicas.
- The replicas all still write to the one main DB.
- Disaster recovery in case of region issue.
- Local performance for global reads.
- Replication cost.
Amazon Aurora
- Not open sourced, its from AWS.
- PostgreSQL and MySQL are both supported as AuroraDB.
- Aurora is AWS cloud optimized and claims 5x performance improvement over MySQL on RDS, over 3x performance of Postgres on RDS.
- Aurora storage automatically grows.
- Costs more than RDS, but is more cost ejective.
- Not in free tier on AWS.
Amazon Aurora Serverless
- Automated database instantiation and auto-scaling based on actual usage.
- PostgreSQL and MySQL are both supported as Aurora Serverless DB.
- No capacity planning needed.
- Least management overhead.
- Pay per second, can be more cost-effective.
- Good for infrequent, intermittent or unpredictable workloads.
Amazon ElastiCache
- A managed Redis or Memcached DB.
- Caches are in-memory databases with high performance, low latency.
- Helps reduce the load off databases for read intensive workloads.
- AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring failure recovery and backups.
- Solution Architecture - Cache:
- ELB goes to an EC2 instance which reads and writes data from RDS.
- The EC2 instance can also read and write from a ElastiCache which will be much faster.
- Good for taking load off the RDS.
DynamoDB
- Part of the NoSQL family.
- Scales to massive workloads, distributed "serverless" database (does have servers but they're in the backend).
- Fast and consistent performance.
- Single digit millisecond latency.
- Integrated with IAM for security, authorization and admin.
- Low cost and auto scaling capabilities.
- Standard & Infrequent Access Table Class.
- Type of Data:
- Is a key/value DB consisting of attributes.
- Each attribute has a name (key), a value and a type.
- DynamoDB Accelerator - DAX:
- Fully managed in-memory cache for DynamoDB.
- 10x performance improvement.
- Secure, highly scalable & highly available.
- Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases.
- Global Tables:
- Makes DynamoDB accessible with low latency in multiple regions.
- Active-ACtive replication, this means you can actively write into any of the regions and they will be replicated.
Redshift
- Based on PostgreSQL, but its not used for OLTP.
- It is OLAP - online analytical processing (analytics and data warehousing).
- Load data once every hour, not every second.
- 10x performance than other data warehouses, scale to PBs of data.
- Columnar storage of data instead of row based.
- Massively parallel query execution (MPP), highly available.
- Pay as you go based on the instances provisioned.
- Has a SQL interface for performing the queries.
- Redshift Serverless:
- Automatically provisions and scales data warehouse underlying capacity.
- Run analytics workloads without managing dat warehouse infrastructure.
- Pay only for what you use.
- Use cases: reporting, dashboard apps, real-time analytics.
Amazon EMR
- Elastic MapReduce
- Creates a Hadoop cluster (Big Data) to analyze and process vast amounts of data.
- The clusters can be made of hundreds of EC2 instances.
- Supports Apache Spark, HBase, Presto, and more.
- EMR takes care of all provisioning and configuration.
- Auto-scaling and integrated with Spot instances.
- Use Cases: data processing, machine learning, web indexing.
Amazon Athena
- Serverless query service to perform analytics against S3 objects.
- Uses standard SQL language to query files.
- Supports CSV, JSON, ORC, Avro and PArquet (built on Presto.)
- Pricing: $5 per TB opf data scanned.
- Uses compressed or columnar data for cost-savings.
- Exam Tip: analyze data in S3 using serverless SQL, use Athena
Amazon QuickSight
- Serverless machine-learning powered business intelligence service to create interactive dashboards.
- Fast, automatically scalable, emendable, with per-session pricing.
- Use cases: Business analytics, Building visualizations, perform ad-hoc analysis.
- Integrated with RDS, Aurora, Athena, Redshift and S3.
Document DB
- A NoSQL database and based on top of mongoDB tech (compatible with).
- MongoDB is used to store, query, and index JSON data.
- SimilarSimilar “deployment concepts” as Aurora
- Fully Managed, highly available with replication across 3 AZ
- DocumentDB storage automatically grows in increments of 10GB
- Automatically scales to workloads with millions of requests per seconds
Amazon Neptune
- Fully managed graph database.
- Popular graph dataset would be a social network.
- Highly available across 3 AZ, with up to 15 read replicas.
- Build and run applications working with highly connected datasets (optimized for these complex and hard queries.)
- High storage and low latency.
- Highly available with replications across multiple AZs.
- Great for knowledge graphs, fraud detection, recommendation engines, and social networking.
Amazon Timestream
- Fully managed, fast, scalable, serverless time series database
- Automatically scales up/down to adjust capacity
- Store and analyze trillions of events per day
- 1000s times faster & 1/10th the cost of relational databases
- Built-in time series analytics functions (helps you identify patterns in your data in near real-time)
Amazon QLDB
- Quantum Ledger Database.
- A ledge is a book recording financial transactions.
- Fully managed , serverless, highly available, replication across 3 AZs.
- Used to review history of all changes made to application over time.
- Immutable system: no entry can be removed or modified, cryptographically verifiable.
- 2-3x better performance than common ledger blockchain frameworks.
- Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules.
Amazon Managed Blockchain
- Blockchain makes it possible tom build apps where multiple parties can execute transactions without the need for a trusted, central authority.
- AMB is a managed service to join public blockchain networks or create your own scalable private network.
- Compatible with the frameworks Hyperledger Fabric and Ethereum.
AWS Glue
- Managed extract, transform, and load (ETL) service.
- Useful to prepare and transform data for analytics.
- Fully serverless service.
- Glue data catalog: catalog of datasets.
- Can be used by Athena, Redshift, EMR.
Database Migration Service (DMS)
- Quickly and securely migrate databases to AWS, resilient, self healing.
- The sources database remains available during the migration.
- Supports:
- Homogeneous migrations: example, Oracle to Oracle.
- Heterogeneous migrations: example, Microsoft SQL Server to Aurora.
Summary
- Relational Databases - OLTP: RDS & Aurora (SQL)
- Differences between Multi-AZ, Read Replicas, Multi-Region
- In-memory Database: ElastiCache
- Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB)
- Warehouse - OLAP: Redshift (SQL)
- Hadoop Cluster: EMR
- Athena: query data on Amazon S3 (serverless & SQL)
- QuickSight: dashboards on your data (serverless)
- DocumentDB: “Aurora for MongoDB” (JSON – NoSQL database)
- Amazon QLDB: Financial Transactions Ledger (immutable journal, cryptographically verifiable)
- Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains
- Glue: Managed ETL (Extract Transform Load) and Data Catalog service
- Database Migration: DMS
- Neptune: graph database
- Timestream: time-series database