Seminar 12: Implementation of Distributed Hash Table (DHT) in python
Goal: To study and practice the implementation of the partitioning method DHT in python for scaling a distributed system.
Definition: Scalability describes an application which can increase its capacity as its load increases. The added resources (capacity) fall into two broad categories: horizontal (scale out) and vertical (scale up) scaling.
- Scaling out – (replicating between multiple nodes)
- Adding a new computer to a distributed software application
- Scaling up – (upgrading to more powerful hardware)
- The addition of CPUs, memory or storage to a single computer
Scaling out dimensions:
- Functional decomposition: The process of taking an application and breaking it down into individual parts.
- Partitioning: When a dataset no longer fits on a single node, it needs to be partitioned across multiple nodes.
- Key-value stores methods – Decentralized architectures
- Range partitioning
- Hash partitioning: Distributed Hash Table (DHT)
- Key-value stores methods – Decentralized architectures
- Duplication: Adding more capacity through replicated instances equipped with load balancers (request routing)
Data and functionality can be scaled in distributed systems following the above techniques. In this seminar, we will learn how to scale data by partition it into multiple nodes.
Hash Partitioning:
- Hash function is utilized to assign keys to partitions
- Hash functions maps a potentially non-uniformly distributed key scape to a uniformly distributed hash space.
- Distributed Hash Table (DHT)
- Data is distributed across multiple nodes, where each node stores additional information to form a ring topology
- Distribute (key, value) pairs over (millions) of nodes (peers)
- Any node can perform a query with a key – with a small number of messages
- Pairs are evenly distributed over nodes without any central coordination
- Each node only knows about a small number of other nodes
- Nodes are continuously coming and going (churn)
- Rule: assign the key-value pair to the node that has the closest id
Exercise: We will get acquainted with the DHT in python. A DHT is typically deployed over a decentralized system architecture (peer-to-peer). By sorting the collection of nodes into a ring, the main purpose of a DHT is to facilitate the location of nodes (containing data) giving a key value (lookup). The DHT is built by taking the p2p network as a reference. In this p2p network, each node can be the client or the server. Nodes in the network are the "same" (same application or protocol). Nodes are connected with other nodes and each node provides a TCP/IP server on a specific port for other nodes to connect. The code for part of the implementation is given here:(Code).
- The file input-file.txt describes the DHT elements.
- key-space defines the range of values that the ring stores.
- nodes describe the list of actual nodes in the ring storing key data. Nodes are assigned with values within the #key-space. If nodes with values outside that range are introduced, then the nodes are discarded and a warning message should be triggered, e.g., 110 is not a valid node. At least one node should be defined in the file. If there is just a single node, this node has a reference to itself.
- shortcuts describe direct connections between nodes that can be used to reduce lookup time.
- Notice that each node has two references by default – successor and next successor. Remember also that each successor node stores all the key values before it, for instance, node 17 stores key values in the range of 6 to 17). Nodes just send requests to successor nodes (neighbors) or nodes in their finger table (if any).
- As you can see in the code repository, there are three files - main.py, node.py and nodeconnection.py. In nodeconnection.py, The class NodeConnection represents connections between two nodes. With this, a node can send a message to successor and the successor can respond if needed. In node.py, the Node class represents a node in DHT. In main.py, it loads the input file, initials the ring and exexcute the commands.
Working Pipeline
$ python main.py input-file.txt
- List: List the nodes in the ring (sorted from lower to higher). It also shows the references that each node is storing. (: Shortcuts, S: successor, NS: next successor)
Task: Based on the exercise, Please implement the DHT by replicating the list function using gRPC. The server will store the key-value pairs, and the client will make requests to the server to retrieve or store data. You can define a gRPC service called ListKeys and implement the ListKeys method by iterating through its local key-value store in the input-file.txt and returning a list of all the keys. The gRPC client can then call the ListKeys method to retrieve the list of keys from the server.
Deliverables: please contain the source python file, and also the screenshot of results in terminals. Then put all the file in a single zip.
Link:
NOTE: Please watch the video or ask us through Slack if something is not clear for you. We usually don't change the points after the deadline.