是Oracle, MySQL还是他们自己做的?


这是他们自己创建的,叫做Bigtable。

http://en.wikipedia.org/wiki/BigTable

在数据库上有一篇谷歌的论文:

http://research.google.com/archive/bigtable.html


Bigtable

结构化数据的分布式存储系统

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

一些功能

fast and extremely large-scale DBMS a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases. designed to scale into the petabyte range it works across hundreds or thousands of machines it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration each table has multiple dimensions (one of which is a field for time, allowing versioning) tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

体系结构

BigTable不是关系数据库。它不支持连接,也不支持富sql类查询。每个表都是一个多维稀疏映射。表由行和列组成,每个单元格都有一个时间戳。具有不同时间戳的单元格可以有多个版本。时间戳允许执行诸如“选择此Web页面的n个版本”或“删除比特定日期/时间更老的单元格”之类的操作。

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

表存储为不可变的sstable和日志尾部(每台机器一个日志)。当机器耗尽系统内存时,它使用谷歌专有压缩技术(BMDiff和Zippy)压缩一些平板电脑。小的压缩只涉及少数的平板电脑,而大的压缩涉及整个表系统和恢复硬盘空间。

Bigtable药片的位置存储在单元格中。任何特定平板电脑的查找都由一个三层系统处理。客户机获得一个指向met0表的点,而这个表只有一个。META0表跟踪许多META1片剂,其中包含正在查找的片剂的位置。META0和META1都大量使用预取和缓存来最小化系统中的瓶颈。

实现

BigTable构建在谷歌文件系统(GFS)上,GFS用作日志和数据文件的备份存储。GFS为sstable提供了可靠的存储,sstable是一种google专有的文件格式,用于持久化表数据。

BigTable大量使用的另一个服务是Chubby,这是一个高可用性、可靠的分布式锁服务。Chubby允许客户端获取一个锁,可能将它与一些元数据相关联,它可以通过向Chubby发送keep alive消息来更新这些元数据。锁存储在类似文件系统的分层命名结构中。

在Bigtable系统中有三种主要的服务器类型:

Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed. Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

例子来自谷歌的研究论文:

A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.

API

BigTable的典型操作是创建和删除表和列族,从行中写入数据和删除列。BigTable在API中为应用程序开发人员提供了这些函数。事务支持行级,但不支持跨多个行键。


这里是研究论文的PDF链接。

在这里你可以找到谷歌的Jeff Dean在华盛顿大学演讲的视频,讨论谷歌后端使用的Bigtable内容存储系统。


虽然谷歌所有的主要应用程序都使用BigTable,但他们也在其他(可能是次要的)应用程序中使用MySQL。


正如其他人所提到的,谷歌使用了一种名为BigTable的本地解决方案,他们已经发布了几篇论文,将其描述到现实世界中。

Apache开发人员实现了这些论文中提出的思想,称为HBase。HBase是更大的Hadoop项目的一部分,根据他们的网站,Hadoop是一个软件平台,让人们可以轻松地编写和运行处理大量数据的应用程序。其中一些基准相当令人印象深刻。他们的网站是http://hadoop.apache.org。


知道BigTable不是一个关系数据库(像MySQL),而是一个巨大的(分布式的)哈希表,具有非常不同的特征也很方便。您可以自己在谷歌AppEngine平台上使用BigTable(有限版本)。

除了上面提到的Hadoop之外,还有许多其他实现试图解决与BigTable相同的问题(可伸缩性、可用性)。我昨天看到一篇不错的博客文章,在这里列出了其中的大部分。


谷歌主要使用Bigtable。

Bigtable是一种分布式存储系统,用于管理结构化数据,旨在扩展到非常大的规模。

欲了解更多信息,请从这里下载该文档。

谷歌的一些应用程序还使用Oracle和MySQL数据库。

如果您能提供更多信息,我们将不胜感激。


Spanner是谷歌的全球分布式关系数据库管理系统(RDBMS),是BigTable的继承者。谷歌声称它不是一个纯粹的关系系统,因为每个表必须有一个主键。

这是论文的链接。

Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

谷歌发明的另一个数据库是Megastore。摘要如下:

Megastore is a storage system developed to meet the requirements of today's interactive online services. Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. This paper describes Megastore's semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.


谷歌服务具有多语言持久性体系结构。BigTable利用了它的大多数服务,如YouTube,谷歌搜索,谷歌分析等。搜索服务最初使用MapReduce作为索引基础设施,但后来在Caffeine发布期间过渡到BigTable。

谷歌云数据存储在谷歌的生产中有超过100个面向内部和外部用户的应用程序。应用程序,如Gmail, Picasa,谷歌日历,Android市场和AppEngine使用云数据存储和Megastore。

谷歌趋势使用MillWheel进行流处理。谷歌广告最初使用MySQL,后来迁移到F1 DB -一个自定义编写的分布式关系数据库。Youtube在Vitess中使用MySQL。谷歌在谷歌文件系统的帮助下在商用服务器上存储艾字节的数据。

来源:谷歌数据库:谷歌服务如何存储pb - exabyte规模的数据?

YouTube数据库-它如何存储这么多的视频而不耗尽存储空间?