SuperMassive – Fast durable, in-memory, distributed key-value database

Published at

1 day ago

Main Article

https://github.com/supermassivedb/supermassive

SuperMassive is a massively scalable, in-memory, distributed, sharded, fault-tolerant, and self-healing key-value database.

Features

Highly scalable Scale horizontally with ease. Simply add more nodes to the cluster.
Distributed Data is distributed across multiple nodes in a sharded fashion.
Robust Health Checking System Health checks are performed on all nodes, if any node is marked unhealthy we will try to recover it.
Smart Data Distribution uses a sequence-based round-robin approach for distributing writes across primary nodes. This ensures that all primary nodes get an equal share of writes.
Automatic Fail-over Automatic fail-over of primary nodes on write failure. If a primary node is unavailable for a write, we go to the next available primary node.
Parallel Read Operations Read operations are performed in parallel.
Consistency Management Timestamp-based version control to handle conflicts. The most recent value is always returned, the rest are deleted.
Fault-tolerant Replication and fail-over are supported. If a node goes down, the cluster will continue to function.
Self-healing Automatic data recovery. A node can recover from a journal. A node replica can recover from a primary node via a check point like algorithm.
Simple Protocol Simple protocol PUT, GET, DEL, INCR, DECR, REGX, STAT, RCNF, PING.
Async Node Journal Operations are written to a journal asynchronously. This allows for fast writes and recovery.
Multi-platform Linux, Windows, MacOS
Thoroughly Tested Extensive unit and integration tests for different scenarios. We are always looking for more tests to add.

Discord

Join the SuperMassive Discord server to chat with the maintainers and other users. We are always looking for feedback, bugs and discussion.

Benchmarks

Local, 2 shards, no replicas. 32kb key and value size.

`:+ssssssssssssssssss+:`           -------------------------------
-+ssssssssssssssssssyyssss+-         OS: Ubuntu 23.04 x86_64
.ossssssssssssssssssdMMMNysssso.       Kernel: 6.2.0-39-generic
/ssssssssssshdmmNNmmyNMMMMhssssss/      Uptime: 2 days, 5 hours, 59 mins
+ssssssssshmydMMMMMMMNddddyssssssss+     Packages: 2755 (dpkg), 30 (snap)
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Shell: bash 5.2.15
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Resolution: 1080x1920, 1920x1080
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   DE: GNOME 44.3
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM: Mutter
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   WM Theme: Adwaita
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Theme: Yaru [GTK2/3]
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Icons: Yaru [GTK2/3]
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/    Terminal: gnome-terminal
+sssssssssdmydMMMMMMMMddddyssssssss+     CPU: 11th Gen Intel i7-11700K (16) @ 4.900GH
/ssssssssssshdmNNNNmyNMMMMhssssss/      GPU: AMD ATI Radeon RX 5500/5500M / Pro 5500
.ossssssssssssssssssdMMMNysssso.       GPU: Intel RocketLake-S GT1 [UHD Graphics 75
-+sssssssssssssssssyyyssss+-         GPU: NVIDIA GeForce GT 730
`:+ssssssssssssssssss+:`           Memory: 10682MiB / 47928MiB
.-/+oossssoo+/-.


Total Operations: 1000000
- GET operations: 799511
- PUT operations: 150322
- DEL operations: 50167
Time Elapsed: 3.232200699s

Operations/sec: 309386.73
- GETs/sec: 247358.09
- PUTs/sec: 46507.63
- DELs/sec: 15521.00

You can benchmark yourself using benchmarking tool at Benchmarker

Getting Started

You can use TLS for your client-cluster communication and cluster-node node-replica communication. A cluster can be started with TLS and so can other instance types based on configuration files.

When starting a cluster instance you provide a --username and --password. When accessing through a client like netcat you now need to authenticate with AUTH user\0password.

The user\0password should be encoded in base64.

There is only 1 user for the system. You set it, you can change the user credentials any restart. It is advisable to call an environment variable to populate these flags or another method to keep the password secure.

Starting either a cluster, node, or node replica you always also provide a --shared-key flag which is used to authentication cluster to node, node to node replica communication. All keys in the chain must match for communication to be successful. This should also be kept secure.

(echo -n "AUTH " && echo -n $"username\\0password" | base64 && cat) | nc -C localhost 4000
OK authenticated

-C is used for CRLF line endings. This is required for the protocol.

A cluster requires nodes to write to. By default .cluster, .node, .nodereplica yaml configs are created. These configurations give you an example of how a cluster is setup with 1 node and 1 replica.

Configurations

Cluster

health-check-interval: 2
server-config:
    address: localhost:4000
    use-tls: false
    cert-file: /
    key-file: /
    read-timeout: 10
    buffer-size: 1024
node-configs:
    - node:
        server-address: localhost:4001
        use-tls: false
        ca-cert-file: ""
        connect-timeout: 5
        write-timeout: 5
        read-timeout: 5
        max-retries: 3
        retry-wait-time: 1
        buffer-size: 1024
      replicas:
        - server-address: localhost:4002
          use-tls: false
          ca-cert-file: ""
          connect-timeout: 5
          write-timeout: 5
          read-timeout: 5
          max-retries: 3
          retry-wait-time: 1
          buffer-size: 1024

You can add more nodes and replicas to the cluster by adding more node-configs. A node acts as a primary shard and a replica acts as a read replica to the primary shard.

Node

health-check-interval: 2
max-memory-threshold: 75
server-config:
    address: localhost:4001
    use-tls: false
    cert-file: /
    key-file: /
    read-timeout: 10
    buffer-size: 1024
read-replicas:
    - server-address: localhost:4002
      use-tls: false
      ca-cert-file: /
      connect-timeout: 5
      write-timeout: 5
      read-timeout: 5
      max-retries: 3
      retry-wait-time: 1
      buffer-size: 1024

Node Replica

server-config:
    address: localhost:4002
    use-tls: false
    cert-file: /
    key-file: /
    read-timeout: 10
    buffer-size: 1024
max-memory-threshold: 75

Examples

PUT key1 value1
OK key-value written

GET key1
key1 value1

DEL key1
OK key-value deleted

GET key1
ERR key-value not found

-- Match keys with REGEX
--- Match keys starting with prefix 'user_'
REGX ^user_
--- Match keys ending with a suffix '2024'
REGX _2024$
--- Contains 'session'
REGX session
--- Match keys with 'user' and '2024'
REGX user.*2024
--- Match keys with 'user' or '2024'
REGX user|2024
--- Range match keys between 'log_10', 'log_15'
REGX log_(1[0-5])
-- Results are returned OK KEY VALUE CRLF KEY VALUE CRLF...

-- You can offset and limit the results
REGX user.*2024 0 10 -- Offset 0, Limit 10
REGX user.*2024 10 10 -- Offset 10, Limit 10

PUT key1 100
OK key-value written

INCR key1
OK 101

DECR key1
key1 100

INCR key1 10
key1 110

DECR key1 10
key1 100

PUT key2 1.5
OK key-value written

INCR key2 1.1
key2 2.6

DECR key2 1.1
key2 1.5

STAT -- get stats on all nodes in the cluster
OK
CLUSTER localhost:4000
    current_sequence 0
    client_connection_count 1
PRIMARY localhost:4001 -- get stats on a specific node
DISK
    sync_enabled true
    sync_interval 128ms
    avg_page_size 1024.00
    file_mode -rwxrwxr-x
    is_closed false
    last_page 99
    storage_efficiency 0.9846
    file_name .journal
    page_size 1024
    total_pages 100
    total_header_size 1600
    total_data_size 102400
    page_utilization 1.0000
    header_overhead_ratio 0.0154
    file_size 104000
    modified_time 2025-02-23T04:39:31-05:00
MEMORY
    load_factor 0.3906
    grow_threshold 0.7500
    max_probe_length 2
    empty_buckets 156
    utilization 0.3906
    needs_grow false
    needs_shrink false
    size 256
    used 100
    shrink_threshold 0.2500
    avg_probe_length 0.2600
    empty_bucket_ratio 0.6094
REPLICA localhost:4002 -- Will list primary, then all replica stats under each primary
.. more

RCNF -- reload configuration files, will reload for entire cluster, nodes, and replicas.  Good when you want to change configurations without restarting the cluster or nodes.
OK configs reloaded

[!NOTE] There are NO transactions. All commands except PUT, PING are ran in parallel. PUT selects 1 node based on current sequence and writes to it. On get, we always return the most recent value of a key. If there are multiple values for a key only 1 value lives on if this occurs, rest are deleted.

Replica consistency?

When a replica is down, the primary node will not be able to write to it. The primary node will continue to write to the other replicas. When the replica comes back up, the primary node will send the missing data to the replica. The replica will then be in sync with the primary node.

This is using the journal pages and a specific piece of the protocol A primary after connected to replica will send a STARTSYNC, a replica will then send a SYNCFROM pgnum where pgnum is the last page number in the replica journal. The primary will then send missing operations to the replica.

Communication looks like this

Replica goes online, primary connects to replica sends STARTSYNC
Replica sends SYNCFROM pgnum to primary
Primary starts sending pages to replica until it reaches end of its journal
Replica writes pages to its journal
PRIMARY is done sending pages to replica once DONESYNC is sent to replica
Primary and replica are now in sync

All nodes are full?

Add more nodes to the cluster. The cluster will automatically distribute the data across the new nodes. Primaries can shrink based on deletes allowing more data to be written over time based on new values taking precedence.

Contributing

To contribute to SuperMassive simply fork, make your changes, and submit a pull request with a detailed description on why, what changes were made.

License

BSD-3-Clause

gittech. site