Find HDFS Path Location

apache-spark

#1

Hello ,

How can I know my HDFS path to create rdd using HDFS? In sc.textFile(HDFS Path). My file is available in HDFS.

Thanks
Devendra Shukla


#2

Hi @Devendra

to create RDD with HDFS path absolute path is not required, relative will be fine.


#3

Thanks for your reply Ravi.Actually, I’m using IntelliJ IDE for the code so for pass my hdfs path in IntelliJ, I have to connect IntelliJ to my cluster so can you please tell me how do I connect my IntelliJ to my cluster so I can pass hdfs path.

Thanks
Devendra Shukla


#4

Hi @Devendra,

I understood your query, please answer below questions:

  1. Is you Intellij IDE is community edition or commercial edition?
  2. You’re mentioning ‘my cluster’, you mean any VM? or your cluster in your cloud (AWS,Azure,GCP etc…)?

If you can answer my above queries, then i can help you.


#5

Hello Ravi,

Thank you for your reply. Below are my answers-

  1. Is you Intellij IDE is community edition or commercial edition?
    My Intellij is community edition 2017.

  2. You’re mentioning ‘my cluster’, you mean any VM? or your cluster in your cloud (AWS,Azure,GCP etc…)?
    No my VM is not in Cloud and Yes I mean any cluster. Right now i’m using itversity clusters(hortanwork).


#6

Community edition doesn’t has remote kernel connectivity for interactive execution of your spark code.
Try with Commercial edition or ship your JAR to your cluster & execute it.

For remote connectivity you need firewall ports to be enabled for Scala/Python remote kernels, it can be possible with only for your machines, itversity or any public accessible cluster these features are not available.

Bottom line: Try shipping JAR to cluster & use ‘spark-submit’ to run your programs.
You can mention “hdfs:////.csv” in your program.