How to execute Java Jar file from PySpark application?


#1

Hi,

I have a very interesting scenario :).

I have a java jar application (purely written in java, this app is not written for spark), which reads the pdf files and extract the content and save into destination folder. I have to execute this jar file on the Hadoop cluster, using Spark to leverage the distributed environment. My source files are on the hdfs and destination would also be on the hdfs.

How can I use this jar from my pyspark application? Any clue / template would be highly appreciated.

Note: I am able to execute this jar from my local directory’

java -jar jar-filename… options local locations (it worked fine)

If I want to read source from hdfs location, it does not read. I searched and found the info that i have to create the spark context / spark session etc…

Also, I am able to read the hdfs location with my pyspark application but how can I embed my jar file within my app OR how can I execute java jar file within the pyspark app?

I only found one example on YouTube but which is related to Spark Jar.

https://www.youtube.com/watch?v=5uHG0aqir5s&t=185s

Thanks in advance!