map和flatMap之间的区别是什么，以及它们各自的良好用例?

谁能给我解释一下map和flatMap之间的区别，以及它们各自的良好用例是什么?

“flatten the results”是什么意思? 它有什么好处?

当前回答

map:它通过对RDD的每个元素应用函数来返回一个新的RDD。.map中的函数只能返回一个项。

flatMap:与map类似，它通过对RDD的每个元素应用函数来返回一个新的RDD，但输出是扁平的。

同样，flatMap中的函数可以返回一个元素列表(0或更多)

例如:

sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()

输出:[[1,2]，[1,2,3]，[1,2,3,4]]

sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()

输出:注意o/p在单个列表[1,2,1,2,3， 1,2,3,4]

来源:https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey/

2018-06-26 22:45:47

其他回答

map返回相同数量元素的RDD，而flatMap可能不会。

flatMap过滤丢失或不正确数据的示例用例。

map在各种各样的情况下使用，其中输入和输出的元素数量是相同的。

number.csv

Map.py添加add.csv中的所有数字。

from operator import *

def f(row):
  try:
    return float(row)
  except Exception:
    return 0

rdd = sc.textFile('a.csv').map(f)

print(rdd.count())      # 7
print(rdd.reduce(add))  # 15.0

py使用flatMap在添加之前过滤掉缺失的数据。与以前的版本相比，增加的数字更少。

from operator import *

def f(row):
  try:
    return [float(row)]
  except Exception:
    return []

rdd = sc.textFile('a.csv').flatMap(f)

print(rdd.count())      # 5
print(rdd.reduce(add))  # 15.0

2016-02-14 23:20:29

这可以归结为你最初的问题:你所说的扁平化是什么意思?

当您使用flatMap时，“多维”集合就变成了“一维”集合。

val array1d = Array ("1,2,3", "4,5,6", "7,8,9")  
//array1d is an array of strings

val array2d = array1d.map(x => x.split(","))
//array2d will be : Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) )

val flatArray = array1d.flatMap(x => x.split(","))
//flatArray will be : Array (1,2,3,4,5,6,7,8,9)

当你想使用flatMap时，

你的地图功能的结果是创建多层结构但所有你想要的是一个简单的-平面-一维结构，通过删除所有的内部分组

2018-03-03 07:04:54

通常我们在hadoop中使用字数计算示例。我将使用相同的用例，将使用map和flatMap，我们将看到它如何处理数据的区别。

下面是示例数据文件。

hadoop is fast
hive is sql on hdfs
spark is superfast
spark is awesome

上面的文件将使用map和flatMap进行解析。

使用地图

>>> wc = data.map(lambda line:line.split(" "));
>>> wc.collect()
[u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome']

输入有4行，输出大小也是4，即N个元素==> N个元素。

使用flatMap

>>> fm = data.flatMap(lambda line:line.split(" "));
>>> fm.collect()
[u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome']

输出与map不同。

让我们为每个键赋值1以获得单词计数。

fm:使用flatMap创建的RDD wc:使用map创建RDD

>>> fm.map(lambda word : (word,1)).collect()
[(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)]

然而，RDD wc上的flatMap将给出以下不希望看到的输出:

>>> wc.flatMap(lambda word : (word,1)).collect()
[[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1]

如果使用map而不是flatMap，则无法获得单词计数。

根据定义，map和flatMap的区别是:

map:它通过对每个元素应用给定的函数来返回一个新的RDD RDD。函数在map中只返回一个项。 flatMap:与map类似，它通过应用函数返回一个新的RDD 到RDD的每个元素，但输出是平坦的。

2016-01-15 23:12:10

使用测试。以Md为例:

➜  spark-1.6.1 cat test.md
This is the first line;
This is the second line;
This is the last line.

scala> val textFile = sc.textFile("test.md")
scala> textFile.map(line => line.split(" ")).count()
res2: Long = 3

scala> textFile.flatMap(line => line.split(" ")).count()
res3: Long = 15

scala> textFile.map(line => line.split(" ")).collect()
res0: Array[Array[String]] = Array(Array(This, is, the, first, line;), Array(This, is, the, second, line;), Array(This, is, the, last, line.))

scala> textFile.flatMap(line => line.split(" ")).collect()
res1: Array[String] = Array(This, is, the, first, line;, This, is, the, second, line;, This, is, the, last, line.)

如果您使用映射方法，您将得到测试线。md，对于flatMap方法，您将得到字数。

map方法类似于flatMap，它们都返回一个新的RDD。map方法经常使用返回一个新的RDD, flatMap方法经常使用分割词。

2016-06-17 07:41:27

对于所有想要PySpark相关的人:

示例转换:flatMap

>>> a="hello what are you doing"
>>> a.split()

['hello'， 'what'， 'are'， 'you'， 'doing']

>>> b=["hello what are you doing","this is rak"]
>>> b.split()

回溯(最近一次调用): 文件“”，第1行，在 AttributeError: 'list'对象没有属性'split'

>>> rline=sc.parallelize(b)
>>> type(rline)

>>> def fwords(x):
...     return x.split()


>>> rword=rline.map(fwords)
>>> rword.collect()

[[‘你好’,‘什么’,‘是’,‘你’,‘做’],[‘这个’,‘是’,'爱你']]

>>> rwordflat=rline.flatMap(fwords)
>>> rwordflat.collect()

[‘你好’,‘什么’,‘是’,‘你’,‘做’,‘这’,‘是’,‘爱’)

希望能有所帮助。

2017-05-01 16:55:23

map和flatMap之间的区别是什么，以及它们各自的良好用例?

推荐文章

最新文章

标签