谷歌使用什么数据库?

Bigtable

结构化数据的分布式存储系统

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

一些功能

fast and extremely large-scale DBMS a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases. designed to scale into the petabyte range it works across hundreds or thousands of machines it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration each table has multiple dimensions (one of which is a field for time, allowing versioning) tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

体系结构

BigTable不是关系数据库。它不支持连接，也不支持富sql类查询。每个表都是一个多维稀疏映射。表由行和列组成，每个单元格都有一个时间戳。具有不同时间戳的单元格可以有多个版本。时间戳允许执行诸如“选择此Web页面的n个版本”或“删除比特定日期/时间更老的单元格”之类的操作。

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

表存储为不可变的sstable和日志尾部(每台机器一个日志)。当机器耗尽系统内存时，它使用谷歌专有压缩技术(BMDiff和Zippy)压缩一些平板电脑。小的压缩只涉及少数的平板电脑，而大的压缩涉及整个表系统和恢复硬盘空间。

Bigtable药片的位置存储在单元格中。任何特定平板电脑的查找都由一个三层系统处理。客户机获得一个指向met0表的点，而这个表只有一个。META0表跟踪许多META1片剂，其中包含正在查找的片剂的位置。META0和META1都大量使用预取和缓存来最小化系统中的瓶颈。

实现

BigTable构建在谷歌文件系统(GFS)上，GFS用作日志和数据文件的备份存储。GFS为sstable提供了可靠的存储，sstable是一种google专有的文件格式，用于持久化表数据。

BigTable大量使用的另一个服务是Chubby，这是一个高可用性、可靠的分布式锁服务。Chubby允许客户端获取一个锁，可能将它与一些元数据相关联，它可以通过向Chubby发送keep alive消息来更新这些元数据。锁存储在类似文件系统的分层命名结构中。

在Bigtable系统中有三种主要的服务器类型:

Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed. Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

例子来自谷歌的研究论文:

A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.

API

BigTable的典型操作是创建和删除表和列族，从行中写入数据和删除列。BigTable在API中为应用程序开发人员提供了这些函数。事务支持行级，但不支持跨多个行键。

这里是研究论文的PDF链接。

在这里你可以找到谷歌的Jeff Dean在华盛顿大学演讲的视频，讨论谷歌后端使用的Bigtable内容存储系统。

2008-12-12 14:53:52

虽然谷歌所有的主要应用程序都使用BigTable，但他们也在其他(可能是次要的)应用程序中使用MySQL。

2008-12-12 15:05:23

Spanner是谷歌的全球分布式关系数据库管理系统(RDBMS)，是BigTable的继承者。谷歌声称它不是一个纯粹的关系系统，因为每个表必须有一个主键。

这是论文的链接。

Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

谷歌发明的另一个数据库是Megastore。摘要如下:

Megastore is a storage system developed to meet the requirements of today's interactive online services. Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. This paper describes Megastore's semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.

2013-09-28 18:44:59

Bigtable

结构化数据的分布式存储系统

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

一些功能

fast and extremely large-scale DBMS a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases. designed to scale into the petabyte range it works across hundreds or thousands of machines it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration each table has multiple dimensions (one of which is a field for time, allowing versioning) tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

体系结构

BigTable不是关系数据库。它不支持连接，也不支持富sql类查询。每个表都是一个多维稀疏映射。表由行和列组成，每个单元格都有一个时间戳。具有不同时间戳的单元格可以有多个版本。时间戳允许执行诸如“选择此Web页面的n个版本”或“删除比特定日期/时间更老的单元格”之类的操作。

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

表存储为不可变的sstable和日志尾部(每台机器一个日志)。当机器耗尽系统内存时，它使用谷歌专有压缩技术(BMDiff和Zippy)压缩一些平板电脑。小的压缩只涉及少数的平板电脑，而大的压缩涉及整个表系统和恢复硬盘空间。

Bigtable药片的位置存储在单元格中。任何特定平板电脑的查找都由一个三层系统处理。客户机获得一个指向met0表的点，而这个表只有一个。META0表跟踪许多META1片剂，其中包含正在查找的片剂的位置。META0和META1都大量使用预取和缓存来最小化系统中的瓶颈。

实现

BigTable构建在谷歌文件系统(GFS)上，GFS用作日志和数据文件的备份存储。GFS为sstable提供了可靠的存储，sstable是一种google专有的文件格式，用于持久化表数据。

BigTable大量使用的另一个服务是Chubby，这是一个高可用性、可靠的分布式锁服务。Chubby允许客户端获取一个锁，可能将它与一些元数据相关联，它可以通过向Chubby发送keep alive消息来更新这些元数据。锁存储在类似文件系统的分层命名结构中。

在Bigtable系统中有三种主要的服务器类型:

Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed. Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

例子来自谷歌的研究论文:

A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.

API

BigTable的典型操作是创建和删除表和列族，从行中写入数据和删除列。BigTable在API中为应用程序开发人员提供了这些函数。事务支持行级，但不支持跨多个行键。

这里是研究论文的PDF链接。

在这里你可以找到谷歌的Jeff Dean在华盛顿大学演讲的视频，讨论谷歌后端使用的Bigtable内容存储系统。

2008-12-12 14:53:52

这是他们自己创建的，叫做Bigtable。

http://en.wikipedia.org/wiki/BigTable

在数据库上有一篇谷歌的论文:

http://research.google.com/archive/bigtable.html

2008-12-12 14:49:52

谷歌主要使用Bigtable。

Bigtable是一种分布式存储系统，用于管理结构化数据，旨在扩展到非常大的规模。