我的背景——在Hadoop世界工作了4周。使用Cloudera的Hadoop VM对Hive, Pig和Hadoop进行了一些尝试。已阅读谷歌关于Map-Reduce和GFS的论文(PDF链接)。

我明白——

猪的语言猪的拉丁语是一种转变 来自(适合程序员的思维方式) SQL喜欢声明式的 编程和Hive的查询语言密切相关 类似于SQL。 Pig位于Hadoop之上 原则也可以凌驾于之上 德律阿得斯。我可能错了,但蜂巢错了 与Hadoop紧密耦合。 都是Pig Latin和Hive命令 编译映射和减少作业。

我的问题是——当一个(比如猪)可以达到目的时,拥有两者的目标是什么?难道只是因为雅虎宣传了Pig !和Facebook的Hive ?


当前回答

我发现这个是最有帮助的(尽管它已经有一年的历史了)——http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

它特别谈到了Pig vs Hive,以及他们在雅虎的工作时间和地点。我发现这很有见地。一些有趣的笔记:

关于数据集的增量更改/更新:

方法来连接新的增量数据并使用 结果与以前的结果完全连接在一起就是 正确的方法。这只需要几分钟。标准数据库 操作可以以这种增量的方式在Pig Latin中实现, 这使得Pig成为这个用例的好工具。

关于通过流媒体使用其他工具:

猪与流媒体的集成也使研究人员很容易 使用他们已经调试过的Perl或Python脚本 数据集,并在一个巨大的数据集上运行。

关于使用Hive进行数据仓库:

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as ODBC.

其他回答

我发现这个是最有帮助的(尽管它已经有一年的历史了)——http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

它特别谈到了Pig vs Hive,以及他们在雅虎的工作时间和地点。我发现这很有见地。一些有趣的笔记:

关于数据集的增量更改/更新:

方法来连接新的增量数据并使用 结果与以前的结果完全连接在一起就是 正确的方法。这只需要几分钟。标准数据库 操作可以以这种增量的方式在Pig Latin中实现, 这使得Pig成为这个用例的好工具。

关于通过流媒体使用其他工具:

猪与流媒体的集成也使研究人员很容易 使用他们已经调试过的Perl或Python脚本 数据集,并在一个巨大的数据集上运行。

关于使用Hive进行数据仓库:

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as ODBC.

Pig-latin is data flow style, is more suitable for software engineer. While sql is more suitable for analytics person who are get used to sql. For complex task, for hive you have to manually to create temporary table to store intermediate data, but it is not necessary for pig. Pig-latin is suitable for complicated data structure( like small graph). There's a data structure in pig called DataBag which is a collection of Tuple. Sometimes you need to calculate metrics which involve multiple tuples ( there's a hidden link between tuples, in this case I would call it graph). In this case, it is very easy to write a UDF to calculate the metrics which involve multiple tuples. Of course it could be done in hive, but it is not so convenient as it is in pig. Writing UDF in pig much is easier than in Hive in my opinion. Pig has no metadata support, (or it is optional, in future it may integrate hcatalog). Hive has tables' metadata stored in database. You can debug pig script in local environment, but it would be hard for hive to do that. The reason is point 3. You need to set up hive metadata in your local environment, very time consuming.

看看“dezyre”文章中关于猪和蜂巢的坚果壳比较

Hive在分区、服务器、Web接口和JDBC/ODBC支持方面优于PIG。

一些差异:

Hive is best for structured Data & PIG is best for semi structured data Hive is used for reporting & PIG for programming Hive is used as a declarative SQL & PIG as a procedural language Hive supports partitions & PIG does not Hive can start an optional thrift based server & PIG cannot Hive defines tables beforehand (schema) + stores schema information in a database & PIG doesn't have a dedicated metadata of database Hive does not support Avro but PIG does. EDIT: Hive supports Avro, specify the serde as org.apache.hadoop.hive.serde2.avro Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.

Hive的设计是为了吸引一个熟悉SQL的社区。它的哲学是我们不需要另一种脚本语言。Hive支持用户选择语言的map和reduce转换脚本(可以嵌入到SQL子句中)。它在Facebook上被熟悉SQL的分析人员以及使用Python编程的数据挖掘人员广泛使用。在Pig中SQL兼容性的努力已经被放弃了,所以这两个项目之间的区别是非常明显的。

支持SQL语法也意味着它可以与现有的BI工具(如Microstrategy)集成。Hive有一个ODBC/JDBC驱动程序(这是一个正在进行的工作),应该可以在不久的将来实现这一点。它还开始添加对索引的支持,这应该允许支持在这种环境中常见的向下钻取查询。

最后——这与问题无关——Hive是一个执行分析查询的框架。虽然它的主要用途是查询平面文件,但它没有理由不能查询其他存储。目前,Hive可以用于查询存储在Hbase中的数据(它是一个键值存储,就像大多数RDBMS内部的键值存储一样),HadoopDB项目已经使用Hive来查询联邦RDBMS层。

简而言之,要对两者进行一个非常高水平的概述:

1) Pig是hadoop上的关系代数

2) Hive是一个SQL over hadoop(比Pig高一级)