Exercise 12 - Importing data from MySQL to Hadoop

Resources:

  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Python.
  • Click here for $35 coupon for CCA 175 Spark and Hadoop Developer using Scala.
  • Click here for $25 coupon for HDPCD:Spark using Python.
  • Click here for $25 coupon for HDPCD:Spark using Scala.
  • Click here for access to state of the art 13 node Hadoop and Spark Cluster

Description

  • This is to test the knowledge about connecting to database and copy data to HDFS as well as Hive tables.
  • Understand validating connectivity to MySQL
  • Ability to run import table from MySQL to HDFS
  • Create hive database and import data into hive tables.
  • Documentation for reference - https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Problem Statement

  • List databases, list tables and validate that you can connect to database using Sqoop with JDBC URL. As part of validation you should query from one of the tables in MySQL
  • Host: nn01.itversity.com
  • Retail Database Names: retail_db, retail_import, retail_export
  • Username to access Retail databases: retail_dba (itversity)
  • NYSE Database: nyse
  • Username to access nyse database: nyse_ro
  • Table Name: stocks_eod
  • Importing retail_db to both HDFS
  • Import all tables
  • file format - avrodata-file
  • Importing retail_db to Hive
  • Better to create a database with your lab account id and use it while import
  • Import all tables
  • File format - text file
  • Delimiter - |
  • Import nyse.stocks_eod to HDFS
  • Try to use compression - snappy codec
  • Use 8 parallel threads
3 Likes

Import nyse.stocks_eod to HDFS : use compression

sqoop import --connect “jdbc:mysql://nn01.itversity.com/nyse” --username nyse_ro --password itversity --target-dir “/user/selva2389/sqoop_nyse_import_compress” --table stocks_eod --delete-target-dir --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --num-mappers 8 --split-by stockticker

For above Sqoop, i’m getting below error:

Whats went wrong with my sqoop import?

List Databases:
sqoop-list-databases --connect “jdbc:mysql://nn01.itversity.com” --username nyse_ro --P

List Tables :
sqoop-list-tables --connect “jdbc:mysql://nn01.itversity.com/nyse” --username nyse_ro --P

eval to data check using sqoop:
sqoop eval --connect “jdbc:mysql://nn01.itversity.com/nyse” --username nyse_ro --P --query “select count(1) from stock_eod”

Sqoop import all tables as avro file format

sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com/retail_db” --username retail_dba --P --warehouse-dir “/user/selva2389/sqoop_retail_import-all-tables” --as-avrodatafile

Sqoop import all - Hive import

sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com/retail_db” --username retail_dba --password itversity --hive-import --hive-database selva2389 --fields-terminated-by ‘|’ --as-textfile

Solved my issue using split-by column on key date column instead of varchar key column.

sqoop import --connect “jdbc:mysql://nn01.itversity.com/nyse” --username nyse_ro --password itversity --target-dir “/user/selva2389/sqoop_nyse_import_compress” --table stocks_eod --delete-target-dir --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --num-mappers 8 –split-by tradedate

1 Like

Also error clearly show to use parameter when you want to split on column with string type.

2 Likes

Thanks @itversity! Now i got 2 solutions for this based on your video!

List Databases :

sqoop list-databases --connect “jdbc:mysql://quickstart.cloudera:3306/” --username retail_dba --password cloudera
or
sqoop eval --connect “jdbc:mysql://quickstart.cloudera:3306/” --username retail_dba --password cloudera --query “show databases”

List Tables :

sqoop list-tables --connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” --username retail_dba --password cloudera
or
sqoop eval --connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” --username retail_dba --password cloudera --query “show tables”

query from one of the tables in MySQL :

sqoop eval --connect “jdbc:mysql://quickstart.cloudera:3306/retail_db” --username retail_dba --password cloudera --query “select * from orders limit 10”

Importing retail_db to both HDFS
** > Import all tables**
** > file format - avrodata-file**

sqoop import-all-tables
–connect “jdbc:mysql://quickstart.cloudera:3306/retail_db”
–username=retail_dba
–password=cloudera
–as-avrodatafile
–warehouse-dir /user/cloudera/exercise12

Importing retail_db to Hive
** > Better to create a database with your lab account id and use it while import**
** > Import all tables**
** > File format - text file**
** > Delimiter - |**

hive> create database exercise12_db;

sqoop import-all-tables
–connect “jdbc:mysql://quickstart.cloudera:3306/retail_db”
–username=retail_dba
–password=cloudera
–hive-import
–hive-overwrite
–hive-database exercise12_db
–create-hive-table
–hive-home /user/hive/warehouse/exercise12_db.db
–fields-terminated-by ‘|’

Import nyse.stocks_eod to HDFS
** > Try to use compression - snappy codec**
** > Use 8 parallel threads**

sqoop improt -m 8
–connect “jdbc:mysql://quickstart.cloudera:3306/nyse”
–username=nyse_ro
–password=itversity
–table stocks_eod
–hive-import
–create-hive-table
–hive-database exercise12_db
–hive-home /user/hive/warehouse/exercise12_db.db
–compress
–compression-codec org.apache.hadoop.io.compress.SnappyCodec

List databases from retail_db

sqoop list-databases
–connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity

List databases from nyse

sqoop list-databases
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity

List tables from retail_db

sqoop list-tables
–connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity

List tables from nyse

sqoop list-tables
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity

Import all tables from retail_db to HDFS

sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity
–warehouse-dir /user/varunu28/exercise_import/text_format

Import all tables from retail_db to HDFS in avro format

sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity
–warehouse-dir /user/varunu28/exercise_import/avro_format
–as-avrodatafile

Import all tables from retail_db to Hive with delimiter as ‘|’

sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
–username retail_dba
–password itversity
–hive-import
–hive-database varunu28
–fields-terminated-by ‘|’

Import nyse.stocks_eod to HDFS using snappy codec compression and 8 parallel threads

// Solution 1 by using split-by
sqoop import -m 8
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity
–table stocks_eod
–warehouse-dir /user/varunu28/exercise_import/nyse_import
–compress
–compression-codec org.apache.hadoop.io.compress.SnappyCodec
–split-by tradedate

// Solution 2 by giving Dorg parameter in sqoop import command
sqoop import “-Dorg.apache.sqoop.splitter.allow_text_splitter=true”
–m 8
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity
–table stocks_eod
–warehouse-dir /user/varunu28/exercise_import/nyse_import
–compress
–compression-codec org.apache.hadoop.io.compress.SnappyCodec

  1. list databases

    sqoop list-databases
    –connect “jdbc:mysql://nn01.itversity.com:3306”
    –username retail_dba
    –password itversity

  2. List tables from retail_db

    sqoop list-tables
    –connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
    –username retail_dba
    –password itversity

  3. List tables from retail_import

    sqoop list-tables
    –connect “jdbc:mysql://nn01.itversity.com:3306/retail_import”
    –username retail_dba
    –password itversity

  4. list tables from retail_export

    sqoop list-tables
    –connect “jdbc:mysql://nn01.itversity.com:3306/retail_export”
    –username retail_dba
    –password itversity

  5. List databases from nyse

sqoop list-databases
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity

  1. List tables from nyse

sqoop list-tables
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity

  1. Import tables from retail_db to HDFS

    sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
    –username retail_dba
    –password itversity
    –warehouse-dir /user/shrikant/exercise12_import

  2. Import tables from retail_db to HDFS in avro format

    sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
    –username retail_dba
    –password itversity
    –warehouse-dir /user/shrikant/exercise12_import/avro_format
    –as-avrodatafile

  3. Import tables from retail_db to Hive where delimiter as ‘|’

    sqoop import-all-tables --connect “jdbc:mysql://nn01.itversity.com:3306/retail_db”
    –username retail_dba
    –password itversity
    –hive-import
    –hive-database shrikant
    –fields-terminated-by ‘|’

  4. Import table stocks_eod from nyse to HDFS using snappy codec compression and 8 parallel threads

sqoop import
–connect “jdbc:mysql://nn01.itversity.com:3306/nyse”
–username nyse_ro
–password itversity
–table stocks_eod
–warehouse-dir /user/shrikant/exercise12_import/nyse_import
-m 8
–compress
–compression-codec org.apache.hadoop.io.compress.SnappyCodec
–split-by tradedate