因为azkaban任务需要用到spark,打出来的jar包好几百兆,打包成.zip上传azkaban就会卡住很久很久,甚至失败。
怎么办呢?
jar包之所以大,是因为pom.xml中使用了shade插件将依赖的spark/hadoop-client等依赖全部包含到了jar包中。
解决方法很简单,只需要修改pom.xml,令hadoop/spark相关的depenency全部改成provided模式,这样编译代码没有问题,而打包时则不会把它们打入jar包中,就像这样:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
<!-- 依赖 --> <dependencies> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.0.1</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>3.3.0</version> <scope>provided</scope> </dependency> <dependency> <!-- Spark dependency --> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.0.1</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/com.alibaba/fastjson --> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.74</version> </dependency> <!-- https://mvnrepository.com/artifact/org.projectlombok/lombok --> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <version>1.18.16</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.12</artifactId> <version>3.0.1</version> <scope>provided</scope> </dependency> <!-- https://mvnrepository.com/artifact/junit/junit --> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.13.1</version> <scope>test</scope> </dependency> </dependencies> |
为什么可以这样做呢?因为hadoop/spark服务器环境中已经自带了这些jar包,当我们用spark-submit/hadoop命令启动jar包时,都会自动通过-classpath指向。
我们看一下服务器上的情况:
[root@10 spark]# pwd
/usr/local/service/spark[root@10 spark]# ll jars/spark-hive_2.12-3.0.0.jar
-rw-r–r– 1 hadoop hadoop 693523 Jul 20 16:06 jars/spark-hive_2.12-3.0.0.jar[root@10 spark]# ll jars/spark-sql_2.12-3.0.0.jar
-rw-r–r– 1 hadoop hadoop 7119016 Jul 14 11:59 jars/spark-sql_2.12-3.0.0.jar
正常来说这些jar包都是随hadoop/spark/hive环境一起安装的,如果没有也可以手动把这些jar包分发到Hadoop集群里,这样就不需要把这些依赖jar打入我们的jar里面了(注意,非hadoop/spark/hive相关的jar包还是需要打到jar里的,除非你手动分发到hadoop集群上)。
如果文章帮助您解决了工作难题,您可以帮我点击屏幕上的任意广告,或者赞助少量费用来支持我的持续创作,谢谢~
