Paper review

Bigtable: A Distributed Storage System for Structured Data

 

Bigtable is a distributed storage system, column-oriented data store, scalable, highly available and high performing for managing structured data at Google. this system was crucial in the development of powerful open source databases like Cassandra. It is used by differents Google products and projects, including Google Analytics, Google Finance, Google Earth and Google search. These products use this kind of storage for a variety of demanding workloads.

 

Data Model

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

API

The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights.

Google infrastructure on which Bigtable depends

Bigtable uses Google File System GFS for storing logs and data in SSTable file format. SSTable provides an immutable, ordered (key, value) map. An SSTable consists of a sequence of blocks and a block index to locate the blocks. When the SSTable is opened, this index is loaded into the main memory for fast lookup. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable).


Bigtable also uses a highly available and persistent distributed lock service called Chubby for handling synchronization issues. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Each Chubby client maintains a session with a Chubby service.

Implementation

Master assigns a tablet to a tablet server. When a tablet server has capacity, master can send the “load tablet” to the tablet server. When the tablet server dies the master can reassign the tablet to another tablet server.  Once the master restarts, it can go over the chubby namespace and discover all the tablet servers. Then master also scans the METADATA table to see which are the currently unassigned tablets and then assign it accordingly. Also master will take an exclusive lock on a common chubby file to have a single leader and to avoid multiple instantiations of the master.

In addition to the above, master is checking periodically with the tablet servers to check which tablets are they serving. If the master cannot reach them, it can take the chubby lock on the file and if it gets it then it knows that tablet server is having trouble reaching the chubby file. The persistence is provides by GFS. But each Write operation is written to a log. There is memtable that keeps track of the most recent operations since the last checkpoint. Older updates are stored in SSTables on disk.  New records continuously are reaching the tablet server. It then keeps on increasing the memtable and then after a threshold, memtable is frozen and converted to an on-disk SSTable. This offloads the memory pressure on the tablet server. It also acts as a checkpoint and reduces the amount of data to be recreated from the commit log. This is called a minor compaction. It is possible that these minor compactions create too many small SSTables and hence a major compaction can be run in the background to create a consolidated SSTable.

 

Relevant refinements

 

Performance

Despite the fact that this document offers the company's own evaluations of performance, the next URL will give you better and clear explanation about the performance.

https://people.cs.kuleuven.be/~bettina.berendt/teaching/2010-11-2ndsemester/ctdb/Mini-workshops/H8_copres_Appermont.pdf

 

 

Ramses Alexander Coraspe Valdez

July 2, 2020