类路径:
ClassPath受到的影响取决于您提供的内容。有几种方法可以在类路径上设置一些东西:
spark.driver.extraClassPath或者它的别名——driver-class-path在运行驱动程序的节点上设置额外的类路径。
在Worker节点上设置额外的类路径。
如果您希望某个JAR同时作用于Master和Worker,您必须在both标志中分别指定它们。
分隔符:
遵循与JVM相同的规则:
Linux:冒号,:
例如:——conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
Windows:分号,;
例如:——conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"
文件地理分布:
这取决于你运行工作的模式:
Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:
16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b
16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server
16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922.
16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732
16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
Cluster mode - In cluster mode Spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will not set an HTTP server. You have to manually make your JAR files available to all the worker nodes via HDFS, S3, or Other sources which are available to all nodes.
接受文件的URI
在“提交应用程序”中,Spark文档很好地解释了文件的可接受前缀:
When using spark-submit, the application jar along with any jars
included with the --jars option will be automatically transferred to
the cluster. Spark uses the following URL scheme to allow different
strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
server.
hdfs:, http:, https:, ftp: - these pull down files and JARs
from the URI as expected
local: - a URI starting with local:/ is
expected to exist as a local file on each worker node. This means that
no network IO will be incurred, and works well for large files/JARs
that are pushed to each worker, or shared via NFS, GlusterFS, etc.
注意,每个jar和文件都被复制到工作目录中
执行器节点上的SparkContext。
如前所述,JAR文件被复制到每个Worker节点的工作目录。具体在哪里?它通常在/var/run/spark/work下面,你会看到它们是这样的:
drwxr-xr-x 3 spark spark 4096 May 15 06:16 app-20160515061614-0027
drwxr-xr-x 3 spark spark 4096 May 15 07:04 app-20160515070442-0028
drwxr-xr-x 3 spark spark 4096 May 15 07:18 app-20160515071819-0029
drwxr-xr-x 3 spark spark 4096 May 15 07:38 app-20160515073852-0030
drwxr-xr-x 3 spark spark 4096 May 15 08:13 app-20160515081350-0031
drwxr-xr-x 3 spark spark 4096 May 18 17:20 app-20160518172020-0032
drwxr-xr-x 3 spark spark 4096 May 18 17:20 app-20160518172045-0033
当您查看内部时,您将看到您部署的所有JAR文件:
[*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
[*@*]$ ll
total 89988
-rwxr-xr-x 1 spark spark 801117 May 8 17:34 awscala_2.10-0.5.5.jar
-rwxr-xr-x 1 spark spark 29558264 May 8 17:34 aws-java-sdk-1.10.50.jar
-rwxr-xr-x 1 spark spark 59466931 May 8 17:34 com.mycode.code.jar
-rwxr-xr-x 1 spark spark 2308517 May 8 17:34 guava-19.0.jar
-rw-r--r-- 1 spark spark 457 May 8 17:34 stderr
-rw-r--r-- 1 spark spark 0 May 8 17:34 stdout
影响选择:
最重要的是要明白优先级。如果您通过代码传递任何属性,它将优先于您通过spark-submit指定的任何选项。Spark文档中提到了这一点:
将传递作为标志或属性文件中指定的任何值
添加到应用程序中,并与指定的应用程序合并
SparkConf。直接在SparkConf上设置的属性最高
优先级,然后将标志传递给spark-submit或spark-shell,然后
spark-defaults.conf文件中的选项
所以一定要把这些值设置在正确的位置,这样当一个优先级高于另一个时,你就不会感到惊讶了。
让我们分析一下问题中的每个选项:
--jars vs SparkContext.addJar: These are identical. Only one is set through Spark submit and one via code. Choose the one which suits you better. One important thing to note is that using either of these options does not add the JAR file to your driver/executor classpath. You'll need to explicitly add them using the extraClassPath configuration on both.
SparkContext.addJar vs SparkContext.addFile: Use the former when you have a dependency that needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.
--conf spark.driver.extraClassPath=... or --driver-class-path: These are aliases, and it doesn't matter which one you choose
--conf spark.driver.extraLibraryPath=..., or --driver-library-path ... Same as above, aliases.
--conf spark.executor.extraClassPath=...: Use this when you have a dependency which can't be included in an über JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.
--conf spark.executor.extraLibraryPath=... This is passed as the java.library.path option for the JVM. Use this when you need a library path visible to the JVM.
为了简单起见,我可以添加额外的内容,这样假设是否安全
应用程序jar文件同时使用3个主要选项:
您可以安全地假设这只适用于客户端模式,而不适用于集群模式。正如我之前所说。另外,你给出的例子有一些多余的参数。例如,将JAR文件传递到——driver-library-path是无用的。如果您希望它们在类路径上,则需要将它们传递给extraClassPath。最终,当你在驱动程序和worker上部署外部JAR文件时,你需要:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar