DynamoDB - lessons learned

Published on Jul 25, 2022

dynamodbdatabaseserverless

Let's talk about AWS DynamoDB. The reason for me to start using DynamoDB was very simple (or you can call it naive), that is it is a managed cloud service. Frankly speaking, I don't enjoy maintaining IT infrastructure as much as I do with codebase, therefore I personally prefer those services which do not require me to fiddle with VPC, subnets etc. (not that I can't do it, but I'm a lazy developer trying my best to avoid it). Just like many other beginners, I made tons of mistakes using DynamoDB without realizing them... until perhaps 2 years ago. Now I'm going to talk about some of the lessons I've learned and hopefully it can help you as well.

Lesson 1 - Read the documentation

I'm not even joking. One of the best ways to avoid rookie mistakes is to read and understand the official documentation for DynamoDB. Besides that, you should also read the whitepapaer. The purpose is not to just read it, but to understand it. Many beginners' mistakes are made due to the lack of basic knowledge of DynamoDB; they consider DynamoDB only as a simple key-value store which leads to misuse that is unnoticed until production time.

Lesson 2 - You don't need many tables

One of my early mistakes was using DynamoDB like a SQL database. Particularly, I created one table per domain entity. For example, if a e-commerce system has entities such as User, Order, Product, I'd create corresponding tables users, orders, products, just like what I usually do in SQL database systems. If you are thinking about "I'd do the same", then let me tell you now - don't.

A DynamoDB table is not like SQL table. If you consider it as a bucket, then you can put all kinds of stuff in it, because it's schemaless. This means that you can create one table to rule all the entities. Suppose the structures of entities are like this:

type user = { 
  type: 'User'
  username: string
  date_created: string
}
type order = {
  type: 'Order'
  order_items: string[]
  date_created: string
}
type product = {
  type: 'Product'
  product_code: string
  price: number
  date_created: string
}

With all of them put in one table, t may look like what's displayed below (the table is a bit wide, so you need to scroll it horizontally to see all columns).

pk	type	username	order_items	product_code	price	date_created
USER#42423	User	john.doe				2018-05-23T18:14:31+10:00
ORDER#53778924	Order		['product_001', 'product_005', 'product_142']			2017-07-23T22:29:42+10:00
PRODUCT#34789	Product			product_045	36.50	2015-08-23T22:31:42+10:00

The table above shows you that different kinds of items can co-exist in the same DynamoDB table. For columns that are related to an entity, their values simply don't exist. For example, the order item does not have username.

You should also notice that the pk (primary hash key) has a different prefix per item type. This is the usual approach to distinguish different types of items. IDs across different types may clash but they will not after being prefixed with their own types.

Lesson 3 - Use sort key

Each DynamoDB table must have one primary index, which contains a required hash key and an optional sort key (also named range key). Many beginners like myself ignore sort key and use UUID for hash key, making the whole table like a simple key-value store. This is not entirely useless but you also lost the chance to enjoy the query power provided by DynamoDB.

Hash key is not to be used as UUID. You should consider hash key as partitioning key. Hash key is not required to guarantee uniqueness, while hash key + sort + key is. Imagine you walk into a library - how do you quickly find the book you want? If you know the category of the book, say, fiction, then you can go straight to the fiction area - this is just like using a hash key to quickly narrow down the searching scope. Next, knowing the name of the book, you can find it without checking the name of every book in that area, thanks to the fact that the books in fiction area have already been sorted. Following this example, the data we store in DynamoDB may look like below (Only primary index columns are shown).

pk	sk
BOOK#fiction	BOOK#Anna Karenina
BOOK#fiction	BOOK#To Kill a Mockingbird
BOOK#programming	BOOK#Clean Code
BOOK#programming	BOOK#The Pragmatic Programmer

The example above only shows 4 rows of data, but you can imagine for a library, there could be a few thousand books per category. Without the sort key, you can only rely on hash key. That means, whenever you want to find a book in the library, you can only use its category to narrow down the searching scope to a certain level, and after that, you have to scan every book until you find the book. The time complexity is roughly O(1) + O(N) = O(N). With sort key, you're able to perform a search on a sorted list, that can be as fast as O(logN), which is much faster, especially when you're facing a large number of books of the same category.

The difference is more dramatic in DynamoDB, where a query can only be performed on hash key + sort key - everything else will be a scan. If you want to find programming books with names ranging from M to P, which were written by Martin Fowler, you will need to ask DynamoDB to query with pk='programming' AND sk BETWEEN 'M' and 'P', and then scan its outcome to find all the books with filter author='Martin Fowler'. So, the complexity will be something like O(1) + O(logN) + O(n), where the n is usually a much smaller number than N which can be ignored, so it is still roughly equal to O(1) + O(logN) = O(logN).

Knowing how to use sort key wisely is key to preforming fast queries on DynamoDB.

Lesson 4 - One to many

One-to-many relationship is widely seen in any domain data models. The common mistake I've seen is that people try to solve this by using multiple tables. Let's explore a better way to handle this in a simple example: suppose a Blog has many Comments. First question is: why don't we have table blogs and table comments? Because doing that, you will end up with:

Paying for 2 tables' primary indexes;
Paying for 2 queries when you want to fetch a blog with its comments;
Losing the ability to update blog and comments in a single transaction.

To model this relationship in one table, we can use the same hash key for both blog and comments, and give them different sort keys. For example:

pk	sk
BLOG#1234	BLOG#1234
BLOG#1234	COMMENT#3245
BLOG#1234	COMMENT#3246
BLOG#1678	BLOG#1678
BLOG#1678	COMMENT#4577

Because Blog and Comment share the same hash key, we can query a blog with its comments just by pk=$blog_pk, for example, pk='BLOG#1234'. This will give us 3 items: the first 3 rows in the table above. During deserialization, we can use their sort keys to determine what type of items they are - a blog's sort key starts with BLOG#, while comments start with COMMENT#.

This technique can solve all sorts of one-to-many relationships. You can follow the same pattern to link other entities to Blog. If X has many P, Q, R, S, then by querying pk='X#1234', you can get all the items owned by X with X itself. If you're only interested in X and P, you can query pk='X#1234' and sk BEGIN_WITH 'P#' - I think you've got the idea now.

Lesson 5 - Be aware of eventual consistency

If I were to write "TL;DR" for this lesson, it'd be "do not read what you just saved". I know it sounds a bit odd, but reading this should tell you the more about it.

The mistake I made was that I had 2 functions, in the first of which I save an item to a DynamoDB table and immediately after that I pass the saved item's ID to the second function, where I read the saved item and use it to process another set of data. This approach has the problem that occasionally I got stale data in the second function, despite the fact that the first function just updated the data. I should have known better that DynamoDB by default does not guarantee strongly consistent read.

This is not to say that DynamoDB sucks. There's a trade off here, because consistency comes with its own price:

Strongly consistent reads might not be available if there is a network delay or outage;
May have higher latency than eventually consistent reads;
Are not supported on global secondary indexes;
Use more throughput capacity than eventually consistent reads.

That being said, if you accept the disadvantages above or you are willing to handle the errors caused by them, you can consider using strongly consistent read; otherwise, design your application logic to expect eventual consistency.

You also should know that strongly consistent reads are only available on primary index.

Lesson 6 - Get the DynamoDB book

This is not an advertisement, nor am I the author of this DynamoDB book. My manager bought this book and shared it with me. I'm learned quite a lot from this book and the lessons I wrote in this page only covers perhaps 10 percent of what the book can offer. I consider this book is a must read once you understand the basics of DynamoDB.

There you have it. I hope these 6 lessons can help you have a smoother journey using DynamoDB and hopefully avoid all the beginner mistakes like I made.