Graduate Program KB

Databases

  • There are limitations to storing data on a disk (EFS, EBS, EC2 Instance Store, S3)

  • Databases are optimised for a purpose, they come with different features, shapes and constraints

  • Databases allow you to:

    • Structure data
    • Build indexes to efficiently query / search
    • Define relationships between datasets
  • Relational Databases

    • Similar in appearance to Excel
    • Queries can be performed with SQL
  • NoSQL Databases

    • Same as non-SQL or non-relational databases
    • Built for specific data models and have flexible schemas for creating modern applications
    • Benefits:
      • Flexible (simple to evolve data model)
      • Scalable (designed to scale-out using distributed clusters)
      • High-performance (optimised for specific data model)
      • Highly functional (types optimised for data model)
    • A common form of data that fits into a NoSQL model is JSON
      • Data can be nested
      • Fields can change
      • Support for different types of data

Databases & Shared Responsibility on AWS

  • Benefits of AWS offering different manageable databases:
    • Quick provisioning
    • High availability
    • Vertical and horizontal scaling
    • Automated backup & restore, operations, upgrades
    • OS patching
    • Monitoring, alerting
  • Databases can be run on EC2 but then you'd lose these benefits which takes more effort to address

Amazon RDS

  • Relational Database Service (RDS): A managed database service using the query language SQL

    • Create databases in the cloud managed by AWS
      • Postgres
      • MySQL
      • MariaDB
      • Microsoft SQL Server
      • IBM DB2
      • Aurora
  • Advantages of RDS over deploying database on EC2:

    • Automated provisioning
    • OS patching
    • Continuous backups and restoration at specific timestamps
    • Monitoring dashboards
    • Read replicas improve read performance
    • Multiple AZ setup for disaster recovery
    • Maintenance windows for upgrades
    • Vertical and horizontal scaling capabilities
    • Storage backed by EBS
  • However, you can't SSH into the RDS database instance

Amazon Aurora

  • Aurora: A proprietary technology from AWS

    • Aurora DB supports PostgreSQL and MySQL
    • It's "AWS cloud optimised" and claims:
      • 5x performance improvement over MySQL on RDS
      • 3x performance improvement over PostgresSQL on RDS
    • Storage dynamically grows up to 128 TB, in increments of 10 GB
    • Costs 20% more than RDS but more efficient
    • Not available in free tier
  • Amazon Aurora Serverless option

    • Automated database instantiation and auto-scaling based off usage (don't need to plan for capacity)
    • Also supports PostgreSQL and MySQL
    • No servers means no management overhead
    • Cost is pay per second which can be more cost-efficient
    • Use cases: Good for infrequent, intermittent or unpredictable workloads

RDS Deployments

  • Read Replicas: Scale the read workload of your database

    • Create up to 15 Read Replicas
    • Data written only to main database, not to any replicas
  • Multi-AZ: Failover in case of AZ outage

    • Data read / written only to main database
    • Can only have 1 other AZ as failover
  • Multi-Region: Disaster recovery in case of region issue

    • Applications in different regions have better read performance since they read from a local database (low latency)
    • Replication cost for replicating database across regions through a network

Amazon ElastiCache

  • ElasticCache is a managed database service compatible with Redis or Memcached data stores
    • Caches are in-memory databases with high performance and low latency
    • The goal is to help reduce load off databases for read intensive workloads
    • AWS takes care of patching, maintenance, recovery, backups, monitoring, etc.

DynamoDB

  • DynamoDB: A fully managed and highly available database with replication across 3 AZ

    • Not a relational database (NoSQL)
    • Scales to massive workloads, distributed "serverless" database
    • Millions of requests per second, hundreds of TB of storage
    • Fast and consistent in performance
    • Low latency retrieval, up to single digit millisecond latency
    • Integrated with IAM for security, authorisation and administration
    • Low cost and auto-scaling capabilities
    • Standard & Infrequent Access Table Class
  • DynamoDB is a key/value database

    • Primary key: Made up of partition key and sort key (optional)
    • Products: All attributes of the row
  • DynamoDB Accelerator (DAX): A fully managed in-memory cache for DynamoDB

    • Microsecond latency retrieval, about 10x faster than standard retrieval
    • Secure, scalable and highly available
    • DAX is integrated and only used for DynamoDB, while ElastiCache can be used for other databases
  • Global tables

    • Make a DynamoDB table accessible in multiple regions with low latency
    • Active-Active replication (read / write to any AWS Region)

Redshift

  • Redshift: A database based off PostgreSQL, but not used for online transaction processing (OLTP)

    • It's online analytical processing (OLAP), which is used for analytics and data warehousing
    • Data is loaded once every hour, not second
    • 10x better performance than other data warehouses, scaling to petabytes of data
    • Column-based (columnar) storage rather than row-based
    • Uses Massively Parallel Query Execution (MPP) engine to perform computations very quickly
    • Pay as you go based on instances provisioned
    • Provides an SQL interface for performing queries
    • Integrated with Business Intelligence (BI) tools such as AWS Quicksight or Tableau for creating dashboards
  • Redshift Serverless option

    • Automatically provisions and scales data warehouse underlying capacity
    • Run analytics workloads without managing data warehouse infrastructure
    • Pay only for what you use, more cost-efficient
    • Use cases: Reporting, dashboarding applications, real-time analytics

Amazon EMR

  • Elastic MapReduce (EMR): A managed cluster platform for creating Hadoop clusters (Big Data)
    • Hadoop clusters are used to analyse and process vast amounts of data
    • Clusters can be made of hundreds of EC2 instances
    • Supports Apache Spark, HBase, Presto, Flink
    • EMR takes care of all provisioning and configuration
    • Auto-scaling and integrated with Spot instances
    • Use cases: Data processing, machine learning, web indexing, big data

Amazon Athena

  • Athena: A serverless query service to perform analytics against S3 objects
    • Uses standard SQL language to query files
    • Supports CSV, JSON, ORC, Avro and Parquet
    • Pricing is around $5 per TB of data scanned
      • Use compressed or columnar data for saving scanning costs
    • Use cases: Business intelligence, analytics, reporting, analyse & query VPC Flow Logs, ELB Logs, CloudTrail trails

Amazon Quicksight

  • Quicksight: A serverless machine learning powered business intelligence service to create interactive dashboards
    • Fast, scales automatically, embeddable
    • Pricing is per-session based
    • Integrated with RDS, Aurora, Athena, Redshift, S3, etc.
    • Use cases: Business analytics, building visualisations, perform ad-hoc analysis, get business insights using data

DocumentDB

  • DocumentDB: A fully managed proprietary NoSQL database service that supports document data structures
    • Just like how Aurora is an "AWS-implementation" of PostgresSQL and MySQL...
      • DocumentDB is the same for MongoDB which is a NoSQL database
    • MongoDB is used to store, query and index JSON data
    • DocumentDB has similar deployment concepts as Aurora
    • Highly available with replication across 3 AZ
    • Storage automatically grows in increments of 10 GB
    • Automatically scales to workloads with millions of requests per seconds

Amazon Neptune

  • Neptune: A fully managed graph database
    • Ex. a graph dataset could be a social network, including users, posts, comments, likes, shares, etc.
    • Highly available across 3 AZ, with up to 15 Read Replicas
    • Used to build and run applications with highly connected datasets
      • Optimised to run complex queries
    • Store up to billions of relations and query the graph with milliseconds of latency
    • Highly available with replications across multiple AZ
    • Use cases: Knowledge graphs, fraud detection, recommendation engines, social networking

Amazon Timestream

  • Timestream: A fully managed, fast, scalable and serverless time series database
    • Automatically scales up and down to adjust capacity
    • Store and analyse trillions of events per day
    • Thousands of times faster than relational databases
    • About 10% the cost of relational databases
    • Has built-in time series analytics functions which can help you identify patterns in your data in near real-time

Amazon QLDB

  • Quantum Ledger Database (QLDB): A fully managed ledger database of financial transactions
    • A ledger is a book recording financial transactions
    • Serverless and highly available with replication across 3 AZ
    • Used to review history of all the changes made to your application data over time
    • It's an immutable system, which means no entry can be removed or modified. Can also add a cryptographic signature to verify no financial transactions are disappearing
      • Under the hood, there's a QLDB journal containing a sequence of modifications
      • When modifications are created, a cryptographic hash is created to verify nothing is deleted or modified. This can be verified by anyone using the database
    • 2-3x better performance than common ledger blockchain frameworks
    • Manipulate data using SQL
    • There's no decentralisation component in accordance with financial regulation rules, unlike Amazon Mananged Blockchain

Amazon Managed Blockchain

  • Blockchain enables us to build applications where multiple parties can execute transactions without needing a trusted, central authority
  • Managed Blockchain: A managed service to join public blockchain networks or create your own scalable private network
    • Compatible with frameworks such as Hyperledger Fabric and Ethereum

AWS Glue

  • Glue: A managed extract, transform and load (ETL) service
    • Useful for preparing and transforming data for analytics
    • Fully serverless service
    • Glue Data Catalog: A catalog of datasets in your AWS infrastructure
      • Used by Athena, Redshift, EMR

DMS

  • Database Migration Service (DMS): A service to quickly and securely migrate databases to AWS
    • Resilient and self-healing
    • The source database remains available during the migration
    • Supports:
      • Homogeneous migrations: Same database technology for source and destination
        • Ex. Oracle to Oracle
      • Heterogeneous migrations: Different database technology for source and destination
        • Ex. Microsoft SQL Server to Aurora