Sqoop vs Spark for Data Ingestion

oracle
apache-spark
sqoop

#1

@viswanath.raju
I am trying to pull data from oracle to HDFS. I tried using Sqoop as well as Spark.

Sqoop code:

import
"-D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom""
--connect
jdbc:oracle:thin:@<HOST>:1521/<SID>
--username
<username>
--password-file
/user/config/password
--direct
--num-mappers
1
--query
select * from <oracle_schema>.<oracle_table> where $CONDITIONS
--as-textfile
--delete-target-dir
--target-dir
/user/hive/warehouse/stage_load.db/<HIVETABLE>

Pyspark code:

from pyspark.sql import HiveContext,SparkSession

spark = SparkSession.builder.master('yarn') \
        .appName('Create Dataframe over JDBC') \
        .config("spark.driver.memory", "2g") \
        .config("spark.logConf", "true") \
        .config("spark.setLoglevel", "ERROR") \
        .getOrCreate()

query = "select * from <oracle_schema>.<oracle_table>"

oracleDF = spark.read.jdbc("jdbc:oracle:thin:@<HOST>:1521/<SID>", query, properties={"user": "<username>", "password": "<password>"})
oracleDF.count()

oracleDF.write.saveAsTable("stage_load.hivetable", format="parquet",  mode="overwrite")

hiveDF = spark.sql("select * from stage_load.hivetable")
hiveDF.count()

if oracleDF.count() == hiveDF.count():
        print "count matched"
else:
        print "count is not matched"

spark.stop()

Source:
Oracle table with 1 million records

Sqoop ran in 2.25 minutes whereas Spark ran in 3.5 hrs. Can anyone help me ??

Also suggest me, which is good for data ingestion either sqoop/spark??
What is the best practices??

Thanks