我只是想知道在Apache Spark中RDD和DataFrame (Spark 2.0.0 DataFrame只是数据集[行]的类型别名)之间的区别是什么?
你能把一个转换成另一个吗?
我只是想知道在Apache Spark中RDD和DataFrame (Spark 2.0.0 DataFrame只是数据集[行]的类型别名)之间的区别是什么?
你能把一个转换成另一个吗?
当前回答
因为DataFrame是弱类型的,开发人员没有得到类型系统的好处。例如,假设你想从SQL中读取一些东西,并对其运行一些聚合:
val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
当你说people("deptId")时,你得到的不是Int或Long对象,你得到的是你需要操作的Column对象。在具有丰富类型系统的语言(如Scala)中,您最终失去了所有类型安全,这增加了在编译时可以发现的运行时错误的数量。
相反,输入数据集[T]。当你这样做时:
val people: People = val people = sqlContext.read.parquet("...").as[People]
您实际上得到了一个People对象,其中deptId是一个实际的整型而不是列型,从而利用了类型系统。
从Spark 2.0开始,DataFrame和DataSet api将是统一的,其中DataFrame将是DataSet[Row]的类型别名。
其他回答
通过谷歌搜索“DataFrame definition”可以很好地定义一个DataFrame:
数据帧是一种表格,或者是一种二维的类似数组的结构 每一列包含对一个变量的测量,以及每一行 包含一个大小写。
因此,由于其表格格式,DataFrame具有额外的元数据,这允许Spark在最终查询上运行某些优化。
另一方面,RDD只是一个弹性分布式数据集(Resilient Distributed Dataset),它更像是一个数据黑箱,不能对其进行优化,因为可以对其执行的操作不受约束。
然而,你可以通过RDD方法从一个DataFrame到一个RDD,你也可以通过toDF方法从一个RDD到一个DataFrame(如果RDD是一个表格格式)
一般来说,由于内置的查询优化,建议尽可能使用DataFrame。
Apache Spark提供了三种类型的api
抽样 DataFrame 数据集
这里是RDD, Dataframe和Dataset之间的api比较。
RDD
Spark提供的主要抽象是一个弹性分布式数据集(RDD),它是跨集群节点划分的元素集合,可以并行操作。
抽样特性:
Distributed collection: RDD uses MapReduce operations which is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance. Immutable: RDDs composed of a collection of records which are partitioned. A partition is a basic unit of parallelism in an RDD, and each partition is one logical division of data which is immutable and created through some transformations on existing partitions.Immutability helps to achieve consistency in computations. Fault tolerant: In a case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is the biggest benefit of RDD because it saves a lot of efforts in data management and replication and thus achieves faster computations. Lazy evaluations: All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset . The transformations are only computed when an action requires a result to be returned to the driver program. Functional transformations: RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. Data processing formats: It can easily and efficiently process data which is structured as well as unstructured data. Programming Languages supported: RDD API is available in Java, Scala, Python and R.
抽样的局限性:
没有内置优化引擎: 在处理结构化数据时,rdd无法利用Spark的高级优化器,包括catalyst优化器和Tungsten执行引擎。开发人员需要根据每个RDD的属性来优化它。 处理结构化数据: 与Dataframe和数据集不同,rdd不推断所摄取数据的模式,并要求用户指定它。
Dataframes
Spark在Spark 1.3版本中引入了Dataframes。Dataframe克服了rdd所面临的主要挑战。
DataFrame是一个分布式的数据集合,它被组织成命名的列。它在概念上等同于关系数据库或R/Python Dataframe中的表。除了Dataframe, Spark还引入了catalyst优化器,它利用高级编程特性来构建可扩展的查询优化器。
Dataframe特点:-
Distributed collection of Row Object: A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, but with richer optimizations under the hood. Data Processing: Processing structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL, etc). It can read and write from all these various datasources. Optimization using catalyst optimizer: It powers both SQL queries and the DataFrame API. Dataframe use catalyst tree transformation framework in four phases, 1.Analyzing a logical plan to resolve references 2.Logical plan optimization 3.Physical planning 4.Code generation to compile parts of the query to Java bytecode. Hive Compatibility: Using Spark SQL, you can run unmodified Hive queries on your existing Hive warehouses. It reuses Hive frontend and MetaStore and gives you full compatibility with existing Hive data, queries, and UDFs. Tungsten: Tungsten provides a physical execution backend whichexplicitly manages memory and dynamically generates bytecode for expression evaluation. Programming Languages supported: Dataframe API is available in Java, Scala, Python, and R.
Dataframe限制:
编译时类型安全: 如前所述,Dataframe API不支持编译时安全,这限制了你在不知道结构时操作数据。下面的示例在编译时工作。但是,在执行这段代码时,您将得到一个运行时异常。
例子:
case class Person(name : String , age : Int)
val dataframe = sqlContext.read.json("people.json")
dataframe.filter("salary > 10000").show
=> throws Exception : cannot resolve 'salary' given input age , name
这很有挑战性,特别是当您正在处理多个转换和聚合步骤时。
无法操作域对象(丢失域对象): 一旦将域对象转换为数据框架,就不能从中重新生成数据框架。在下面的例子中,一旦我们从personRDD创建了personDF,我们将不会恢复Person类的原始RDD (RDD[Person])。
例子:
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContext.createDataframe(personRDD)
personDF.rdd // returns RDD[Row] , does not returns RDD[Person]
Datasets火
Dataset API is an extension to DataFrames that provides a type-safe, object-oriented programming interface. It is a strongly-typed, immutable collection of objects that are mapped to a relational schema. At the core of the Dataset, API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. Spark 1.6 comes with support for automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.
数据集的特性:
Provides best of both RDD and Dataframe: RDD(functional programming, type safe), DataFrame (relational model, Query optimazation , Tungsten execution, sorting and shuffling) Encoders: With the use of Encoders, it is easy to convert any JVM object into a Dataset, allowing users to work with both structured and unstructured data unlike Dataframe. Programming Languages supported: Datasets API is currently only available in Scala and Java. Python and R are currently not supported in version 1.6. Python support is slated for version 2.0. Type Safety: Datasets API provides compile time safety which was not available in Dataframes. In the example below, we can see how Dataset can operate on domain objects with compile lambda functions.
例子:
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContext.createDataframe(personRDD)
val ds:Dataset[Person] = personDF.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 25)
// error : value salary is not a member of person
ds.rdd // returns RDD[Person]
互操作:数据集允许您轻松地将现有的rdd和dataframe转换为数据集,而无需样板代码。
数据集API限制:-
需要类型转换为字符串: 目前从数据集中查询数据需要我们将类中的字段指定为字符串。查询完数据后,必须将列强制转换为所需的数据类型。另一方面,如果我们在数据集上使用map操作,它将不会使用Catalyst优化器。
例子:
ds.select(col("name").as[String], $"age".as[Int]).collect()
不支持Python和R:从1.6版开始,数据集只支持Scala和Java。Python支持将在Spark 2.0中引入。
Datasets API与现有的RDD和Dataframe API相比,具有更好的类型安全性和函数式编程优势。面对API中类型强制转换需求的挑战,您仍然无法获得所需的类型安全性,并将使您的代码变得脆弱。
DataFrame相当于RDBMS中的表,也可以以类似于rdd中的“原生”分布式集合的方式进行操作。与rdd不同,dataframe跟踪模式并支持各种关系操作,从而实现更优化的执行。 每个DataFrame对象表示一个逻辑计划,但由于它们的“惰性”性质,直到用户调用特定的“输出操作”才会执行。
大部分答案都是正确的,我只想补充一点
在Spark 2.0中,这两个API (DataFrame +DataSet)将统一为一个API。
统一DataFrame和Dataset:在Scala和Java中,DataFrame和Dataset是统一的,即DataFrame只是Dataset of Row的类型别名。在Python和R中,由于缺乏类型安全,DataFrame是主要的编程接口。”
数据集类似于rdd,但是,它们不使用Java序列化或Kryo,而是使用专门的Encoder来序列化对象,以便在网络上进行处理或传输。
Spark SQL支持两种将现有rdd转换为数据集的方法。第一种方法使用反射来推断包含特定类型对象的RDD的模式。这种基于反射的方法可以生成更简洁的代码,如果在编写Spark应用程序时已经知道模式,这种方法也能很好地工作。
创建数据集的第二种方法是通过编程接口,该接口允许您构造一个模式,然后将其应用于现有的RDD。虽然此方法更详细,但它允许您在运行时之前不知道列及其类型时构造数据集。
在这里你可以找到RDD tof数据帧对话的答案
如何将rdd对象转换为数据帧在火花
Spark RDD(弹性分布式数据集):
RDD is the core data abstraction API and is available since very first release of Spark (Spark 1.0). It is a lower-level API for manipulating distributed collection of data. The RDD APIs exposes some extremely useful methods which can be used to get very tight control over underlying physical data structure. It is an immutable (read only) collection of partitioned data distributed on different machines. RDD enables in-memory computation on large clusters to speed up big data processing in a fault tolerant manner. To enable fault tolerance, RDD uses DAG (Directed Acyclic Graph) which consists of a set of vertices and edges. The vertices and edges in DAG represent the RDD and the operation to be applied on that RDD respectively. The transformations defined on RDD are lazy and executes only when an action is called
Spark DataFrame
Spark 1.3 introduced two new data abstraction APIs – DataFrame and DataSet. The DataFrame APIs organizes the data into named columns like a table in relational database. It enables programmers to define schema on a distributed collection of data. Each row in a DataFrame is of object type row. Like an SQL table, each column must have same number of rows in a DataFrame. In short, DataFrame is lazily evaluated plan which specifies the operations needs to be performed on the distributed collection of the data. DataFrame is also an immutable collection.
Spark数据集:
作为DataFrame api的扩展,Spark 1.3还引入了DataSet api,在Spark中提供严格类型和面向对象的编程接口。它是不可变的、类型安全的分布式数据集合。像DataFrame一样,DataSet APIs也使用Catalyst引擎来实现执行优化。DataSet是DataFrame api的扩展。
〇其他差异