Spark 1.2.1 on Big Data Cluster

Hello Durga Sir & Team ,

I am unable to download Spark 1.2.1 from http://spark.apache.org/downloads.html.

Please help me finding any alternative and procedure to make the Spark setup for CCA175 Certification using BigData Lab by Itversity.

Regards ,
Amit

Dear Amit,

You don’t require to insatll it in Bigdata lab.
It is already installed.

Simply you type –
pyspark or
spark-shell

you will get the spark cmd.

See …
Hope it will help you clearly …

gw01 login: mohitgupta108
Password:
Last login: Tue Nov 8 07:28:04 on pts/8
CentOS Linux release 7.2.1511 (Core)

Linux gw01.itversity.com 3.14.32-xxxx-grs-ipv6-64 #9 SMP Thu Oct 20 14:53:52 CEST 2016 x86_64 x86_64 x86_64 GNU/Linux

server : 266639
hostname : gw01.itversity.com
eth0 IPv4 : 149.56.24.210
eth0 IPv6 : 2607:5300:61:9d2::/64

[mohitgupta108@gw01 ~]$ pyspark

in this particular case , Spark 1.6.2 is available however I want to make the setup for Spark 1.2.1 which is the part of CCA175 certification. Is there any alternative for the same ???

There is no need of Spark 1.2.x. CCA 175 does not have data frames or spark sql which is the major difference between spark 1.2.1 and higher versions starting from Spark 1.3.x

You can safely practice on Spark 1.6, with out getting into Spark SQL and Data Frames.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
jdbcurl = "jdbc:mysql://nn01.itversity.com:3306/retail_db?user=retail_dba&password=itversity"
df = sqlContext.load(source=“jdbc”, url=jdbcurl, dbtable=“departments”)

I tried the above code, I am getting the below error message :frowning: anyone please help me to go ahead.

/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/context.py:535: UserWarning: load is deprecated. Use read.load() instead.
warnings.warn(“load is deprecated. Use read.load() instead.”)
Traceback (most recent call last):
File “”, line 1, in
File “/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/context.py”, line 536, in load
return self.read.load(path, source, schema, **options)
File “/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/readwriter.py”, line 139, in load
return self._df(self._jreader.load())
File “/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py”, line 813, in call
File “/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py”, line 45, in deco
return f(*a, **kw)
File “/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py”, line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o40.load.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:50)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:50)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:120)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

hello @amit0900 first set the driver class by typing below command –

pyspark --driver-class-path /usr/share/java/mysql-connector-java.jar

Then Use this below query it works:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
jdbcurl = “jdbc:mysql://nn01.itversity.com:3306/retail_db?user=retail_dba&password=itversity”
df = sqlContext.load(url=jdbcurl, source=“jdbc”,dbtable=“departments”)

for i in df.collect():
print i

Please Let us know still any issues

Awesome Thanks to you Santosh :slight_smile:

1 Like

Dear Santosh,

  1. How will you come to know that mysql-connector-java.jar is located at /usr/share/java ?

  2. Do you think the same path is going to be in Cloudera setup also ?

Thanks in Advance :slight_smile:

We are just trying to connect to remote db using jdbc driver. In our case using the mysql connector.

Mostly it will be located in /usr/share/ folder. Yes, the driver is available in the same path in cloudera too. Its been taught by durga sir in tutorials. Thanks!

It is standard path and most likely it will be the same path.
If you could not find there, you can ask SA to run this command sudo find / -name “mysql-connector-java.jar”

@itversity

Hi Santosh/itversity,

I have tried to set the driver path using
os.environ[‘SPARK_CLASSPATH’]= “/usr/share/java/mysql-connector-java.jar”

But I still get the error as mentioned by Amit.Can you let me know if I am making any mistake here.

Helo Abhi,

You can set either by using pyspark or using os environment variable.

launch the pyspark as below:

pyspark --driver-class-path /usr/share/java/mysql-connector-java.jar
( first check if there is a connector in the following location mostly it should be available )

then execute these commands:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
jdbcurl = "jdbc:mysql://nn01.itversity.com:3306/retail_db?user=retail_dba&password=itversity"
df = sqlContext.load(url=jdbcurl, source=“jdbc”,dbtable=“departments”)

for i in df.collect():
print i

Above, please note that url should be the first parameter. Please let me know if still run into any issues. Thanks!

1 Like

Thanks Santhosh.It works now!!!

1 Like

Hi Santosh,

When i am trying to execute the following command -

from pyspark import SparkContext, SparkConf

it gives me the error as ImportError:No module named context

I am stuck please help.
spark version is 1.5.0-cdh5.5.0

Can you please share what exactly the error message which you are experiencing ??

ImportError: No Module named Context

Where are you trying? On quickstart VM or big data lab?