Databases
-
There are limitations to storing data on a disk (EFS, EBS, EC2 Instance Store, S3)
-
Databases are optimised for a purpose, they come with different features, shapes and constraints
-
Databases allow you to:
- Structure data
- Build indexes to efficiently query / search
- Define relationships between datasets
-
Relational Databases
- Similar in appearance to Excel
- Queries can be performed with SQL
-
NoSQL Databases
- Same as non-SQL or non-relational databases
- Built for specific data models and have flexible schemas for creating modern applications
- Benefits:
- Flexible (simple to evolve data model)
- Scalable (designed to scale-out using distributed clusters)
- High-performance (optimised for specific data model)
- Highly functional (types optimised for data model)
- A common form of data that fits into a NoSQL model is JSON
- Data can be nested
- Fields can change
- Support for different types of data
Databases & Shared Responsibility on AWS
- Benefits of AWS offering different manageable databases:
- Quick provisioning
- High availability
- Vertical and horizontal scaling
- Automated backup & restore, operations, upgrades
- OS patching
- Monitoring, alerting
- Databases can be run on EC2 but then you'd lose these benefits which takes more effort to address
Amazon RDS
-
Relational Database Service (RDS): A managed database service using the query language SQL
- Create databases in the cloud managed by AWS
- Postgres
- MySQL
- MariaDB
- Microsoft SQL Server
- IBM DB2
- Aurora
- Create databases in the cloud managed by AWS
-
Advantages of RDS over deploying database on EC2:
- Automated provisioning
- OS patching
- Continuous backups and restoration at specific timestamps
- Monitoring dashboards
- Read replicas improve read performance
- Multiple AZ setup for disaster recovery
- Maintenance windows for upgrades
- Vertical and horizontal scaling capabilities
- Storage backed by EBS
-
However, you can't SSH into the RDS database instance
Amazon Aurora
-
Aurora: A proprietary technology from AWS
- Aurora DB supports PostgreSQL and MySQL
- It's "AWS cloud optimised" and claims:
- 5x performance improvement over MySQL on RDS
- 3x performance improvement over PostgresSQL on RDS
- Storage dynamically grows up to 128 TB, in increments of 10 GB
- Costs 20% more than RDS but more efficient
- Not available in free tier
-
Amazon Aurora Serverless option
- Automated database instantiation and auto-scaling based off usage (don't need to plan for capacity)
- Also supports PostgreSQL and MySQL
- No servers means no management overhead
- Cost is pay per second which can be more cost-efficient
- Use cases: Good for infrequent, intermittent or unpredictable workloads
RDS Deployments
-
Read Replicas: Scale the read workload of your database
- Create up to 15 Read Replicas
- Data written only to main database, not to any replicas
-
Multi-AZ: Failover in case of AZ outage
- Data read / written only to main database
- Can only have 1 other AZ as failover
-
Multi-Region: Disaster recovery in case of region issue
- Applications in different regions have better read performance since they read from a local database (low latency)
- Replication cost for replicating database across regions through a network
Amazon ElastiCache
- ElasticCache is a managed database service compatible with Redis or Memcached data stores
- Caches are in-memory databases with high performance and low latency
- The goal is to help reduce load off databases for read intensive workloads
- AWS takes care of patching, maintenance, recovery, backups, monitoring, etc.
DynamoDB
-
DynamoDB: A fully managed and highly available database with replication across 3 AZ
- Not a relational database (NoSQL)
- Scales to massive workloads, distributed "serverless" database
- Millions of requests per second, hundreds of TB of storage
- Fast and consistent in performance
- Low latency retrieval, up to single digit millisecond latency
- Integrated with IAM for security, authorisation and administration
- Low cost and auto-scaling capabilities
- Standard & Infrequent Access Table Class
-
DynamoDB is a key/value database
- Primary key: Made up of partition key and sort key (optional)
- Products: All attributes of the row
-
DynamoDB Accelerator (DAX): A fully managed in-memory cache for DynamoDB
- Microsecond latency retrieval, about 10x faster than standard retrieval
- Secure, scalable and highly available
- DAX is integrated and only used for DynamoDB, while ElastiCache can be used for other databases
-
Global tables
- Make a DynamoDB table accessible in multiple regions with low latency
- Active-Active replication (read / write to any AWS Region)
Redshift
-
Redshift: A database based off PostgreSQL, but not used for online transaction processing (OLTP)
- It's online analytical processing (OLAP), which is used for analytics and data warehousing
- Data is loaded once every hour, not second
- 10x better performance than other data warehouses, scaling to petabytes of data
- Column-based (columnar) storage rather than row-based
- Uses Massively Parallel Query Execution (MPP) engine to perform computations very quickly
- Pay as you go based on instances provisioned
- Provides an SQL interface for performing queries
- Integrated with Business Intelligence (BI) tools such as AWS Quicksight or Tableau for creating dashboards
-
Redshift Serverless option
- Automatically provisions and scales data warehouse underlying capacity
- Run analytics workloads without managing data warehouse infrastructure
- Pay only for what you use, more cost-efficient
- Use cases: Reporting, dashboarding applications, real-time analytics
Amazon EMR
- Elastic MapReduce (EMR): A managed cluster platform for creating Hadoop clusters (Big Data)
- Hadoop clusters are used to analyse and process vast amounts of data
- Clusters can be made of hundreds of EC2 instances
- Supports Apache Spark, HBase, Presto, Flink
- EMR takes care of all provisioning and configuration
- Auto-scaling and integrated with Spot instances
- Use cases: Data processing, machine learning, web indexing, big data
Amazon Athena
- Athena: A serverless query service to perform analytics against S3 objects
- Uses standard SQL language to query files
- Supports CSV, JSON, ORC, Avro and Parquet
- Pricing is around $5 per TB of data scanned
- Use compressed or columnar data for saving scanning costs
- Use cases: Business intelligence, analytics, reporting, analyse & query VPC Flow Logs, ELB Logs, CloudTrail trails
Amazon Quicksight
- Quicksight: A serverless machine learning powered business intelligence service to create interactive dashboards
- Fast, scales automatically, embeddable
- Pricing is per-session based
- Integrated with RDS, Aurora, Athena, Redshift, S3, etc.
- Use cases: Business analytics, building visualisations, perform ad-hoc analysis, get business insights using data
DocumentDB
- DocumentDB: A fully managed proprietary NoSQL database service that supports document data structures
- Just like how Aurora is an "AWS-implementation" of PostgresSQL and MySQL...
- DocumentDB is the same for MongoDB which is a NoSQL database
- MongoDB is used to store, query and index JSON data
- DocumentDB has similar deployment concepts as Aurora
- Highly available with replication across 3 AZ
- Storage automatically grows in increments of 10 GB
- Automatically scales to workloads with millions of requests per seconds
- Just like how Aurora is an "AWS-implementation" of PostgresSQL and MySQL...
Amazon Neptune
- Neptune: A fully managed graph database
- Ex. a graph dataset could be a social network, including users, posts, comments, likes, shares, etc.
- Highly available across 3 AZ, with up to 15 Read Replicas
- Used to build and run applications with highly connected datasets
- Optimised to run complex queries
- Store up to billions of relations and query the graph with milliseconds of latency
- Highly available with replications across multiple AZ
- Use cases: Knowledge graphs, fraud detection, recommendation engines, social networking
Amazon Timestream
- Timestream: A fully managed, fast, scalable and serverless time series database
- Automatically scales up and down to adjust capacity
- Store and analyse trillions of events per day
- Thousands of times faster than relational databases
- About 10% the cost of relational databases
- Has built-in time series analytics functions which can help you identify patterns in your data in near real-time
Amazon QLDB
- Quantum Ledger Database (QLDB): A fully managed ledger database of financial transactions
- A ledger is a book recording financial transactions
- Serverless and highly available with replication across 3 AZ
- Used to review history of all the changes made to your application data over time
- It's an immutable system, which means no entry can be removed or modified. Can also add a cryptographic signature to verify no financial transactions are disappearing
- Under the hood, there's a QLDB journal containing a sequence of modifications
- When modifications are created, a cryptographic hash is created to verify nothing is deleted or modified. This can be verified by anyone using the database
- 2-3x better performance than common ledger blockchain frameworks
- Manipulate data using SQL
- There's no decentralisation component in accordance with financial regulation rules, unlike Amazon Mananged Blockchain
Amazon Managed Blockchain
- Blockchain enables us to build applications where multiple parties can execute transactions without needing a trusted, central authority
- Managed Blockchain: A managed service to join public blockchain networks or create your own scalable private network
- Compatible with frameworks such as Hyperledger Fabric and Ethereum
AWS Glue
- Glue: A managed extract, transform and load (ETL) service
- Useful for preparing and transforming data for analytics
- Fully serverless service
- Glue Data Catalog: A catalog of datasets in your AWS infrastructure
- Used by Athena, Redshift, EMR
DMS
- Database Migration Service (DMS): A service to quickly and securely migrate databases to AWS
- Resilient and self-healing
- The source database remains available during the migration
- Supports:
- Homogeneous migrations: Same database technology for source and destination
- Ex. Oracle to Oracle
- Heterogeneous migrations: Different database technology for source and destination
- Ex. Microsoft SQL Server to Aurora
- Homogeneous migrations: Same database technology for source and destination