在闭包外部调用函数时出现奇怪的行为:
当函数在一个对象中时,一切都在工作
当函数在类中,get:
任务不可序列化:java.io.NotSerializableException:测试
问题是我需要在类而不是对象中编写代码。知道为什么会这样吗?Scala对象是否序列化(默认?)?
这是一个工作代码示例:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
这是一个无效的例子:
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
这会产生错误
sparkexception:任务不可序列化
org.apache.spark.util.ClosureCleaner .ensureSerializable美元(ClosureCleaner.scala: 304)
解决方案-
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)
我也遇到过类似的问题,从Grega的回答中我理解到
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
after.collect().map(println(_))
}
def someFunc(a:Int) = a+1
}
你的doIT方法试图序列化someFunc(_)方法,但由于方法是不可序列化的,它试图序列化类测试,这也是不可序列化的。
为了让你的代码工作,你应该在doIT方法中定义someFunc。例如:
def doIT = {
def someFunc(a:Int) = a+1
//function definition
}
val after = rddList.map(someFunc(_))
after.collect().map(println(_))
}
如果有多个函数进入图中,那么所有这些函数都应该对父上下文可用。
在Spark 2.4中,很多人可能会遇到这个问题。Kryo序列化已经变得更好,但在许多情况下,您不能使用spark.kryo.unsafe=true或幼稚的Kryo序列化器。
为了快速修复,请尝试在Spark配置中更改以下内容
spark.kryo.unsafe="false"
OR
spark.serializer="org.apache.spark.serializer.JavaSerializer"
我通过使用显式广播变量和新的内置twitter-chill api来修改我遇到或亲自编写的自定义RDD转换,将它们从RDD转换过来。将row =>映射到rdd。mapPartitions(partition =>{函数。
例子
老方法(不太好)
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val outputRDD = rdd.map(row => {
val value = sampleMap.get(row._1)
value
})
替代(更好的)方式
import com.twitter.chill.MeatLocker
val sampleMap = Map("index1" -> 1234, "index2" -> 2345)
val brdSerSampleMap = spark.sparkContext.broadcast(MeatLocker(sampleMap))
rdd.mapPartitions(partition => {
val deSerSampleMap = brdSerSampleMap.value.get
partition.map(row => {
val value = sampleMap.get(row._1)
value
}).toIterator
})
这种新方法将每个分区只调用广播变量一次,这更好。如果不注册类,仍然需要使用Java Serialization。
def upper(name: String) : String = {
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
val emp_details = """[{"id": "1","name": "James Butt","country": "USA"},
{"id": "2", "name": "Josephine Darakjy","country": "USA"},
{"id": "3", "name": "Art Venere","country": "USA"},
{"id": "4", "name": "Lenna Paprocki","country": "USA"},
{"id": "5", "name": "Donette Foller","country": "USA"},
{"id": "6", "name": "Leota Dilliard","country": "USA"}]"""
val df_emp = spark.read.json(Seq(emp_details).toDS())
val df_name=df_emp.select($"id",$"name")
val df_upperName= df_name.withColumn("name",toUpperName($"name")).filter("id='5'")
display(df_upperName)
这会产生错误
sparkexception:任务不可序列化
org.apache.spark.util.ClosureCleaner .ensureSerializable美元(ClosureCleaner.scala: 304)
解决方案-
import java.io.Serializable;
object obj_upper extends Serializable {
def upper(name: String) : String =
{
var uppper : String = name.toUpperCase()
uppper
}
val toUpperName = udf {(EmpName: String) => upper(EmpName)}
}
val df_upperName=
df_name.withColumn("name",obj_upper.toUpperName($"name")).filter("id='5'")
display(df_upperName)