Spark中DataFrame、Dataset和RDD的区别

我只是想知道在Apache Spark中RDD和DataFrame (Spark 2.0.0 DataFrame只是数据集[行]的类型别名)之间的区别是什么?

你能把一个转换成另一个吗?

当前回答

所有(RDD、DataFrame和DataSet)在一张图片中。

图片致谢

RDD

RDD是可以并行操作的元素的容错集合。

DataFrame

DataFrame是一个被组织成命名列的数据集。它是概念上等价于关系数据库中的表或数据框架，但是在底层有更丰富的优化。

数据集

数据集是数据的分布式集合。Dataset是Spark 1.6中新增的接口，提供rdd的优点 (强类型，能够使用强大的lambda函数) Spark SQL优化执行引擎的好处。注意: 在Scala/Java中，Dataset of Rows (Dataset[Row])通常被称为DataFrames。

用一个代码片段对它们进行了很好的比较。

源

问:你能把一个转换成另一个，像RDD到DataFrame，反之亦然?

是的，两者都有可能

1. 使用.toDF() RDD到DataFrame

val rowsRdd: RDD[Row] = sc.parallelize(
  Seq(
    Row("first", 2.0, 7.0),
    Row("second", 3.5, 2.5),
    Row("third", 7.0, 5.9)
  )
)

val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")

df.show()
+------+----+----+
|    id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+

在Spark中将RDD对象转换为Dataframe

2. 使用.rdd()方法将DataFrame/DataSet转换为RDD

val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD

2017-07-22 09:37:56

其他回答

因为DataFrame是弱类型的，开发人员没有得到类型系统的好处。例如，假设你想从SQL中读取一些东西，并对其运行一些聚合:

val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

当你说people("deptId")时，你得到的不是Int或Long对象，你得到的是你需要操作的Column对象。在具有丰富类型系统的语言(如Scala)中，您最终失去了所有类型安全，这增加了在编译时可以发现的运行时错误的数量。

相反，输入数据集[T]。当你这样做时:

val people: People = val people = sqlContext.read.parquet("...").as[People]

您实际上得到了一个People对象，其中deptId是一个实际的整型而不是列型，从而利用了类型系统。

从Spark 2.0开始，DataFrame和DataSet api将是统一的，其中DataFrame将是DataSet[Row]的类型别名。

2016-05-18 13:39:42

通过谷歌搜索“DataFrame definition”可以很好地定义一个DataFrame:

数据帧是一种表格，或者是一种二维的类似数组的结构每一列包含对一个变量的测量，以及每一行包含一个大小写。

因此，由于其表格格式，DataFrame具有额外的元数据，这允许Spark在最终查询上运行某些优化。

另一方面，RDD只是一个弹性分布式数据集(Resilient Distributed Dataset)，它更像是一个数据黑箱，不能对其进行优化，因为可以对其执行的操作不受约束。

然而，你可以通过RDD方法从一个DataFrame到一个RDD，你也可以通过toDF方法从一个RDD到一个DataFrame(如果RDD是一个表格格式)

一般来说，由于内置的查询优化，建议尽可能使用DataFrame。

2015-07-20 03:09:05

大部分答案都是正确的，我只想补充一点

在Spark 2.0中，这两个API (DataFrame +DataSet)将统一为一个API。

统一DataFrame和Dataset:在Scala和Java中，DataFrame和Dataset是统一的，即DataFrame只是Dataset of Row的类型别名。在Python和R中，由于缺乏类型安全，DataFrame是主要的编程接口。”

数据集类似于rdd，但是，它们不使用Java序列化或Kryo，而是使用专门的Encoder来序列化对象，以便在网络上进行处理或传输。

Spark SQL支持两种将现有rdd转换为数据集的方法。第一种方法使用反射来推断包含特定类型对象的RDD的模式。这种基于反射的方法可以生成更简洁的代码，如果在编写Spark应用程序时已经知道模式，这种方法也能很好地工作。

创建数据集的第二种方法是通过编程接口，该接口允许您构造一个模式，然后将其应用于现有的RDD。虽然此方法更详细，但它允许您在运行时之前不知道列及其类型时构造数据集。

在这里你可以找到RDD tof数据帧对话的答案

如何将rdd对象转换为数据帧在火花

2016-11-20 13:53:39

首先，DataFrame是从SchemaRDD演变而来的。

是的. .Dataframe和RDD之间的转换是绝对可能的。

下面是一些示例代码片段。

df。rdd就是rdd [Row]

下面是一些创建数据框架的选项。

1) yourrddOffrow。toDF转换为DataFrame。 2)使用sql context的createDataFrame Val df = spark。createDataFrame (rddOfRow模式)

where schema can be from some of below options as described by nice SO post.. From scala case class and scala reflection api import org.apache.spark.sql.catalyst.ScalaReflection val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType] OR using Encoders import org.apache.spark.sql.Encoders val mySchema = Encoders.product[MyCaseClass].schema as described by Schema can also be created using StructType and StructField val schema = new StructType() .add(StructField("id", StringType, true)) .add(StructField("col1", DoubleType, true)) .add(StructField("col2", DoubleType, true)) etc...

事实上，现在有3个Apache Spark api ..

火灾等级:

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release. The RDD API provides many transformation methods, such as map(), filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

抽样的例子:

rdd.filter(_.age > 21) // transformation
   .map(_.last)// transformation
.saveAsObjectFile("under21.bin") // action

示例:RDD按属性过滤

rdd.filter(_.age > 21)

DataFrame火

Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans

示例SQL样式:

df。Filter ("age > 21");

限制: 因为代码是按名称引用数据属性的，所以编译器不可能捕捉到任何错误。如果属性名不正确，则只有在运行时创建查询计划时才会检测到错误。

DataFrame API的另一个缺点是它非常以scala为中心，虽然它确实支持Java，但支持是有限的。

例如，当从现有的Java对象RDD创建DataFrame时，Spark的Catalyst优化器无法推断模式，并假设DataFrame中的任何对象都实现了scala。产品界面。Scala case类解决了这个问题，因为它们实现了这个接口。

数据集火

The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API. When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release. Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant.

示例数据集API SQL样式:

dataset.filter(_.age < 21);

DataFrame和DataSet之间的评估不同:

阴极级流..(解密spark峰会上的数据框架和数据集演示)

进一步阅读…databricks文章-三个Apache Spark api的故事:rdd vs dataframe和数据集

2016-08-19 07:23:53

Apache Spark - RDD, DataFrame和DataSet

Spark RDD –

RDD代表弹性分布式数据集。只读记录的分区集合。RDD是最基本的数据结构的火花。它允许程序员在内存中执行计算采用容错方式的大型集群。因此，加快任务的速度。

星火数据帧 –

与RDD不同，数据被组织成命名列。比如一张表在关系数据库中。的不可变分布式集合数据。Spark中的DataFrame允许开发人员在上面强加一个结构数据的分布式集合，允许更高层次的抽象。

Spark数据集-

Apache Spark中的数据集是DataFrame API的扩展提供类型安全的面向对象编程接口。数据集通过暴露表达式来利用Spark的Catalyst优化器和数据字段到查询计划器。

2019-12-11 17:54:15

Spark中DataFrame、Dataset和RDD的区别

推荐文章

最新文章

标签