[PYTHON] [blackbird-dynamodb] Monitoring AWS DynamoDB

This Plugins gets various DynamoDB Metrics.

What metric does it get?

You can get Metrics in each Table and Response time in each API.

Table Metrics

First of all, from Metrics of each table. The Statistics in the table below shows what kind of formula to get (Sum is the total value per unit time, Avg is the average value per unit time).

Metric Name	Statistics	Detail
UserErrors	Sum	Client error
SystemErros	Sum	AWS side error(I want to think that there will be few)
ThrottleRequests	Sum	Number of Throttled Requests that have reached the Provisioned Capacity Units limit
ReadThrottleEvents	Sum	Read requests that have reached the upper limit of Provisioned Capacity Units
WriteThrottleEvents	Sum	Write requests that have reached the upper limit of Provisioned Capacity Units
ProvisionedReadCapacityUnits	Maximum	Number of ProvisionedReadCapacityUnits specified in Table
ProvisionedWriteCapacityUnits	Maximum	Number of ProvisionedWriteCapacityUnits specified in Table
ConsumedReadCapacityUnits	Maximum	Maximum number of Read Capacity Units consumed per unit time
ConsumedReadCapacityUnits	Average	Average number of Read Capacity Units consumed per unit time
ConsumedWriteCapacityUnits	Maximum	Maximum number of WriteCapacityUnits consumed per unit time
ConsumedWriteCapacityUnits	Average	Average number of WriteCapacityUnits consumed per unit time

about UserErrors

I think it feels like UserErrors, but it's an error when calling DynamoDB on the SDK side. Specifically, when the 4XX status code is returned.

Authentication error
Missing parameters (Hash Key or Range Query does not specify Range Key)
Not enough Provisioned Throghput

about Each CapacityUnits

DynamoDB is a distributed DB with multiple shards on the back side. However, the CapacityUnits you set are assigned to the entire Table. It's easy to think that shard is not a problem because it divides and scales out in proportion to the size of Capacity Units. I think it's great that you don't actually have to worry about scale-out or load.

However, since CapacityUnits is assigned to a Table, each shard will be the value divided by the number of shards. I'm getting a LimitExceededException even though ConsumedReadCapacityUnits hasn't reached the limit at all! !! !! !! That can happen.

This is because (well, if you read the documentation properly, it's written quite well, but ...) the hash key is widely distributed and the value is hard, so hot shard is created. Once it happens, I think it's quite difficult to fix it, so I'd like to do it well at the design stage if possible.

Each Operation Metrics

You can get the Response Time when the API succeeds and the number of Items returned when issuing the range query. Be careful, for example, if the Scan API or Query API is returning too many Items. (If anything, it seems that you will notice that the latency of ELB has deteriorated first)

Regarding ResponseTime, there are the following,

PutItem
DeleteItem
UpdateItem
GetItem
BatchGetItem
BatchWriteItem
Scan
Query

For the number of acquired items

Scan
Query

there is. Both values are designed to get Maximum and Average.

How to Install

Excuse me I haven't prepared pip and RPM yet , so I'll deal with it as soon as possible.

Configuration

The options are as follows.

Key Name	Default	Require	Detail
region_name	us-east-1	No	AWS region name
aws_access_key_id	-	Yes	AWS Acces Key ID
aws_secret_access_key	-	Yes	AWS Secret Access Key
table_name	-	Yes	Table name of DynamoDB
hostname	-	Yes	Hostname on Zabbix(It's a good idea to make it first)
module	-	Yes	Which plugin to use, so heredynamodbIs fixed at
ignore_metrics	-	No	If you have a Metric that you don't want to get, please separate it with commas.
ignore_operations	-	No	If there is an Operation Metric that you do not want to get, please separate it with commas.

about ignore_XXXXX parameter

If you don't want to get ignore_metrics (CloudWatch also costs money if you call API beyond the one month free tier), please separate them with commas like ʻignore_metrics = UserErrors, ConsumedWriteCapacityUnits`.

The same is true for ignore_operations, and if you don't use BatchWrite and Scan API, please write like ʻignore_operations = BatchWriteItem, Scan`!

Example

Sample config file

#The section name can be anything, but since it is the name of the Thread that is generated internally, it may be better not to overlap it with others. Since the Debug Log has a Thread Name, I use the DynamoDB Table name.
[ANYTHING_OK]

# AWS Information
region_name = ap-northeast-1
aws_access_key_id = XXXXXXXXXX
aws_secret_access_key = YYYYYYYYYY

#Please enter the Table name of DynamoDB
table_name = YOUR_DYNAMODB_TABLE_NAME

#Hostname on Zabbix. I'm worried about what to do if it is not tied to such a server due to restrictions on zabbix, but I create and use the same Hostname as the Table name.
hostname = HOSTNAME_ON_ZBX_SERVER

#It is fixed to dynamodb.
module = dynamodb

It looks like this!