我只是想知道在Apache Spark中RDD和DataFrame (Spark 2.0.0 DataFrame只是数据集[行]的类型别名)之间的区别是什么?

你能把一个转换成另一个吗?


当前回答

Spark RDD(弹性分布式数据集):

RDD is the core data abstraction API and is available since very first release of Spark (Spark 1.0). It is a lower-level API for manipulating distributed collection of data. The RDD APIs exposes some extremely useful methods which can be used to get very tight control over underlying physical data structure. It is an immutable (read only) collection of partitioned data distributed on different machines. RDD enables in-memory computation on large clusters to speed up big data processing in a fault tolerant manner. To enable fault tolerance, RDD uses DAG (Directed Acyclic Graph) which consists of a set of vertices and edges. The vertices and edges in DAG represent the RDD and the operation to be applied on that RDD respectively. The transformations defined on RDD are lazy and executes only when an action is called

Spark DataFrame

Spark 1.3 introduced two new data abstraction APIs – DataFrame and DataSet. The DataFrame APIs organizes the data into named columns like a table in relational database. It enables programmers to define schema on a distributed collection of data. Each row in a DataFrame is of object type row. Like an SQL table, each column must have same number of rows in a DataFrame. In short, DataFrame is lazily evaluated plan which specifies the operations needs to be performed on the distributed collection of the data. DataFrame is also an immutable collection.

Spark数据集:

作为DataFrame api的扩展,Spark 1.3还引入了DataSet api,在Spark中提供严格类型和面向对象的编程接口。它是不可变的、类型安全的分布式数据集合。像DataFrame一样,DataSet APIs也使用Catalyst引擎来实现执行优化。DataSet是DataFrame api的扩展。

〇其他差异

其他回答

A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0

从使用的角度来看,RDD vs DataFrame:

RDDs are amazing! as they give us all the flexibility to deal with almost any kind of data; unstructured, semi structured and structured data. As, lot of times data is not ready to be fit into a DataFrame, (even JSON), RDDs can be used to do preprocessing on the data so that it can fit in a dataframe. RDDs are core data abstraction in Spark. Not all transformations that are possible on RDD are possible on DataFrames, example subtract() is for RDD vs except() is for DataFrame. Since DataFrames are like a relational table, they follow strict rules when using set/relational theory transformations, for example if you wanted to union two dataframes the requirement is that both dfs have same number of columns and associated column datatypes. Column names can be different. These rules don't apply to RDDs. Here is a good tutorial explaining these facts. There are performance gains when using DataFrames as others have already explained in depth. Using DataFrames you don't need to pass the arbitrary function as you do when programming with RDDs. You need the SQLContext/HiveContext to program dataframes as they lie in SparkSQL area of spark eco-system, but for RDD you only need SparkContext/JavaSparkContext which lie in Spark Core libraries. You can create a df from a RDD if you can define a schema for it. You can also convert a df to rdd and rdd to df.

我希望这能有所帮助!

Apache Spark - RDD, DataFrame和DataSet

Spark RDD –

RDD代表弹性分布式数据集。只读 记录的分区集合。RDD是最基本的数据结构 的火花。它允许程序员在内存中执行计算 采用容错方式的大型集群。因此,加快任务的速度。

星火数据帧 –

与RDD不同,数据被组织成命名列。比如一张表 在关系数据库中。的不可变分布式集合 数据。Spark中的DataFrame允许开发人员在上面强加一个结构 数据的分布式集合,允许更高层次的抽象。

Spark数据集-

Apache Spark中的数据集是DataFrame API的扩展 提供类型安全的面向对象编程接口。数据集 通过暴露表达式来利用Spark的Catalyst优化器 和数据字段到查询计划器。

Dataframe是Row对象的RDD,每个对象代表一条记录。一个 Dataframe还知道它的行的模式(即数据字段)。虽然Dataframes 看起来像常规的rdd,它们内部以更有效的方式存储数据,利用它们的模式。此外,它们还提供了rdd上不可用的新操作,例如运行SQL查询的能力。数据帧可以从外部数据源、查询结果或常规rdd中创建。

参考文献:Zaharia M., et al。学习火花(O'Reilly, 2015)

首先,DataFrame是从SchemaRDD演变而来的。

是的. .Dataframe和RDD之间的转换是绝对可能的。

下面是一些示例代码片段。

df。rdd就是rdd [Row]

下面是一些创建数据框架的选项。

1) yourrddOffrow。toDF转换为DataFrame。 2)使用sql context的createDataFrame Val df = spark。createDataFrame (rddOfRow模式)

where schema can be from some of below options as described by nice SO post.. From scala case class and scala reflection api import org.apache.spark.sql.catalyst.ScalaReflection val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType] OR using Encoders import org.apache.spark.sql.Encoders val mySchema = Encoders.product[MyCaseClass].schema as described by Schema can also be created using StructType and StructField val schema = new StructType() .add(StructField("id", StringType, true)) .add(StructField("col1", DoubleType, true)) .add(StructField("col2", DoubleType, true)) etc...

事实上,现在有3个Apache Spark api ..

火灾等级:

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release. The RDD API provides many transformation methods, such as map(), filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

抽样的例子:

rdd.filter(_.age > 21) // transformation
   .map(_.last)// transformation
.saveAsObjectFile("under21.bin") // action

示例:RDD按属性过滤

rdd.filter(_.age > 21)

DataFrame火

Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans

示例SQL样式:

df。Filter ("age > 21");

限制: 因为代码是按名称引用数据属性的,所以编译器不可能捕捉到任何错误。如果属性名不正确,则只有在运行时创建查询计划时才会检测到错误。

DataFrame API的另一个缺点是它非常以scala为中心,虽然它确实支持Java,但支持是有限的。

例如,当从现有的Java对象RDD创建DataFrame时,Spark的Catalyst优化器无法推断模式,并假设DataFrame中的任何对象都实现了scala。产品界面。Scala case类解决了这个问题,因为它们实现了这个接口。

数据集火

The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API. When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release. Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant.

示例数据集API SQL样式:

dataset.filter(_.age < 21);

DataFrame和DataSet之间的评估不同:

阴极级流..(解密spark峰会上的数据框架和数据集演示)

进一步阅读…databricks文章-三个Apache Spark api的故事:rdd vs dataframe和数据集