我的背景——在Hadoop世界工作了4周。使用Cloudera的Hadoop VM对Hive, Pig和Hadoop进行了一些尝试。已阅读谷歌关于Map-Reduce和GFS的论文(PDF链接)。
我明白——
猪的语言猪的拉丁语是一种转变
来自(适合程序员的思维方式)
SQL喜欢声明式的
编程和Hive的查询语言密切相关
类似于SQL。
Pig位于Hadoop之上
原则也可以凌驾于之上
德律阿得斯。我可能错了,但蜂巢错了
与Hadoop紧密耦合。
都是Pig Latin和Hive命令
编译映射和减少作业。
我的问题是——当一个(比如猪)可以达到目的时,拥有两者的目标是什么?难道只是因为雅虎宣传了Pig !和Facebook的Hive ?
Pig-latin is data flow style, is more suitable for software engineer. While sql is more suitable for analytics person who are get used to sql. For complex task, for hive you have to manually to create temporary table to store intermediate data, but it is not necessary for pig.
Pig-latin is suitable for complicated data structure( like small graph). There's a data structure in pig called DataBag which is a collection of Tuple. Sometimes you need to calculate metrics which involve multiple tuples ( there's a hidden link between tuples, in this case I would call it graph). In this case, it is very easy to write a UDF to calculate the metrics which involve multiple tuples. Of course it could be done in hive, but it is not so convenient as it is in pig.
Writing UDF in pig much is easier than in Hive in my opinion.
Pig has no metadata support, (or it is optional, in future it may integrate hcatalog). Hive has tables' metadata stored in database.
You can debug pig script in local environment, but it would be hard for hive to do that. The reason is point 3. You need to set up hive metadata in your local environment, very time consuming.
看看“dezyre”文章中关于猪和蜂巢的坚果壳比较
Hive在分区、服务器、Web接口和JDBC/ODBC支持方面优于PIG。
一些差异:
Hive is best for structured Data & PIG is best for semi structured data
Hive is used for reporting & PIG for programming
Hive is used as a declarative SQL & PIG as a procedural language
Hive supports partitions & PIG does not
Hive can start an optional thrift based server & PIG cannot
Hive defines tables beforehand (schema) + stores schema information in a database & PIG doesn't have a dedicated metadata of database
Hive does not support Avro but PIG does. EDIT: Hive supports Avro, specify the serde as org.apache.hadoop.hive.serde2.avro
Pig also supports additional COGROUP feature for performing outer joins but hive does not. But both Hive & PIG can join, order & sort dynamically.