Sqoop import query using $conditions


#1

Could some one please help to develop the equal-ant to the below query for sqoop import ?

select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue desc

The above query needs to be developed using $CONDITIONS in order to run inside sqoop import. I tried different ways keeping $CONDITIONS in different places, but couldnt get it working.

Any logic behind where and how to use $CONDITIONS ?


#2

Following worked.

sqoop import
–connect jdbc:mysql://localhost/retail_db
–username root
–password cloudera
–target-dir /user/cloudera/0205/sqoop2/orderitems5
–query ‘select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id and $CONDITIONS group by o.order_id and $CONDITIONS having sum(oi.order_item_subtotal) > 1000 and $CONDITIONS order by order_revenue’
–split-by order_id


#3

But results seems to be not correct. It returned 0 records.


#4

sqoop-import
–connect jdbc:mysql://quickstart.cloudera:3306/retail_db
–username retail_dba
–password cloudera
–query “select * from orders where order_date like ‘2014-02%’ and $CONDITIONS”
–split-by order_id
–target-dir /user/cloudera/sqoop_import/retail_db/orders
–append
–num-mappers 2

To find out more information go to below github link:


#5

How is this equalent to the query that i asked ?


#6

I tried to provide you the reference query under which $CONCITIONS should be use under where clause only but it seems that you are not getting my point.

Let me provide you the complete solution over here.

Step-1: Make sure to execute required query under MySQL

[cloudera@quickstart ~]$ mysql -u retail_dba -h quickstart.cloudera -p -P 3306
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 322
Server version: 5.1.73 Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use retail_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)

mysql> select 
    ->   o.order_id, 
    ->   round(sum(oi.order_item_subtotal),2) order_revenue 
    -> from 
    ->   orders o join order_items oi on o.order_id = oi.order_item_order_id 
    -> group by 
    ->   o.order_id 
    -> having 
    ->   sum(oi.order_item_subtotal) > 1000 
    -> order by order_revenue 
    -> limit 10;
+----------+---------------+
| order_id | order_revenue |
+----------+---------------+
|    12878 |       1001.90 |
|    13698 |       1001.91 |
|    19803 |       1002.89 |
|    25582 |       1003.84 |
|    47578 |       1003.90 |
|    49113 |       1003.91 |
|    31941 |       1004.77 |
|    37376 |       1004.80 |
|     1429 |       1004.80 |
|    10404 |       1004.82 |
+----------+---------------+
10 rows in set (0.18 sec)

Step-2: Execute below sqoop command ($CONDITIONS should be part of where clause only)

[cloudera@quickstart ~]$ sqoop-import \
  --connect jdbc:mysql://quickstart.cloudera:3306/retail_db \
  --username retail_dba \
  --password cloudera \
  --query "select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id where \$CONDITIONS group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue" \
  --split-by order_id \
  --target-dir /user/cloudera/tmp_order_revenue \
  --num-mappers 2 

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/05/06 06:05:03 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.12.0
18/05/06 06:05:03 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/05/06 06:05:03 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/05/06 06:05:03 INFO tool.CodeGenTool: Beginning code generation
18/05/06 06:05:04 INFO manager.SqlManager: Executing SQL statement: select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id where  (1 = 0)  group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue
18/05/06 06:05:04 INFO manager.SqlManager: Executing SQL statement: select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id where  (1 = 0)  group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue
18/05/06 06:05:04 INFO manager.SqlManager: Executing SQL statement: select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id where  (1 = 0)  group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue
18/05/06 06:05:04 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-cloudera/compile/a9d272fe6ca6022d735e05c4b2e37484/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/05/06 06:05:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/a9d272fe6ca6022d735e05c4b2e37484/QueryResult.jar
18/05/06 06:05:06 INFO mapreduce.ImportJobBase: Beginning query import.
18/05/06 06:05:06 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
18/05/06 06:05:07 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/05/06 06:05:07 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/05/06 06:05:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/05/06 06:05:09 INFO mapreduce.JobSubmitter: number of splits:2
18/05/06 06:05:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1525607376030_0019
18/05/06 06:05:10 INFO impl.YarnClientImpl: Submitted application application_1525607376030_0019
18/05/06 06:05:10 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1525607376030_0019/
18/05/06 06:05:10 INFO mapreduce.Job: Running job: job_1525607376030_0019
18/05/06 06:05:17 INFO mapreduce.Job: Job job_1525607376030_0019 running in uber mode : false
18/05/06 06:05:17 INFO mapreduce.Job:  map 0% reduce 0%
18/05/06 06:05:27 INFO mapreduce.Job:  map 50% reduce 0%
18/05/06 06:05:28 INFO mapreduce.Job:  map 100% reduce 0%
18/05/06 06:05:29 INFO mapreduce.Job: Job job_1525607376030_0019 completed successfully
18/05/06 06:05:29 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=304888
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=225
		HDFS: Number of bytes written=103226
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Job Counters 
		Launched map tasks=2
		Other local map tasks=2
		Total time spent by all maps in occupied slots (ms)=16931
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=16931
		Total vcore-milliseconds taken by all map tasks=16931
		Total megabyte-milliseconds taken by all map tasks=17337344
	Map-Reduce Framework
		Map input records=7519
		Map output records=7519
		Input split bytes=225
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=311
		CPU time spent (ms)=1960
		Physical memory (bytes) snapshot=333574144
		Virtual memory (bytes) snapshot=3020337152
		Total committed heap usage (bytes)=277217280
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=103226
18/05/06 06:05:29 INFO mapreduce.ImportJobBase: Transferred 100.8066 KB in 22.1199 seconds (4.5573 KB/sec)
18/05/06 06:05:29 INFO mapreduce.ImportJobBase: Retrieved 7519 records.

Step-3: Verify output under HDFS

[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/tmp_order_revenue
Found 3 items
-rw-r--r--   1 cloudera cloudera          0 2018-05-06 06:05 /user/cloudera/tmp_order_revenue/_SUCCESS
-rw-r--r--   1 cloudera cloudera      48421 2018-05-06 06:05 /user/cloudera/tmp_order_revenue/part-m-00000
-rw-r--r--   1 cloudera cloudera      54805 2018-05-06 06:05 /user/cloudera/tmp_order_revenue/part-m-00001

I hope that it will be useful to you.


#7

Appreciate your response. Thanks.

I tried the sqoop query that you suggested before. But results were not same.

Here is the sqoop query that i tried.

sqoop import --connect jdbc:mysql://localhost/retail_db \ --username root --password cloudera
–target-dir /user/cloudera/sqoop/0206/daily_revenue --delete-target-dir
–query ‘select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id where $CONDITIONS group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue desc’
–split-by order_id

scala> val dr = sc.textFile("/user/cloudera/sqoop/0206/daily_revenue")
dr: org.apache.spark.rdd.RDD[String] = /user/cloudera/sqoop/0206/daily_revenue MapPartitionsRDD[453] at textFile at :27

scala> dr.take(10).foreach(println)
14539,1899.9
9506,1849.9
7630,1829.86
10129,1799.89
5443,1749.86
10699,1749.8
9084,1719.9
17185,1699.91
5299,1699.91
2801,1699.9

Then i run the similar query in mysql
mysql> select o.order_id, round(sum(oi.order_item_subtotal),2) order_revenue from orders o join order_items oi on o.order_id = oi.order_item_order_id group by o.order_id having sum(oi.order_item_subtotal) > 1000 order by order_revenue desc limit 10
-> ;
±---------±--------------+
| order_id | order_revenue |
±---------±--------------+
| 68703 | 3449.91 |
| 68724 | 2859.89 |
| 68858 | 2839.91 |
| 68809 | 2779.86 |
| 68766 | 2699.90 |
| 68806 | 2629.92 |
| 68821 | 2629.92 |
| 68778 | 2629.90 |
| 68848 | 2399.96 |
| 68875 | 2399.95 |
±---------±--------------+
10 rows in set (0.10 sec)