Count doesn't match in parquet import in Scala


#1

Hello,

I am practicing the sqoop together with scala, here is what I found when I import mysql to hdfs in parquet format, and then load the parquet file in Scala, two parquet files are generated.

When I load each of the parquet file in Scala, it shows 6218 for each, so presumably adding them up should be 12436, but I got 12435 when I load it with folder as the input.

sqlContext.load("/user/paslechoix/sqoop_import/retail_db/customers/0a821e98-cbc2-4a50-bc8e-82a839b6f0ca.parquet", “parquet”).count
6218 in total

sqlContext.load("/user/paslechoix/sqoop_import/retail_db/customers/eb5da0eb-58c2-4096-b4b5-433cab44ee2f.parquet", “parquet”).count
6218 in total

6218+6218 = 12436

sqlContext.load("/user/paslechoix/sqoop_import/retail_db/customers", “parquet”).count
12435 in total

This result is NOT acceptable, the sum should be exactly the same

Here is how I generate the parquet format:

sqoop import
–connect jdbc:mysql://ms.itversity.com:3306/retail_db
–username retail_user
–password itversity
–table customers
–warehouse-dir /user/paslechoix/sqoop_import/retail_db
–num-mappers 2
–as-parquetfile

Any idea?

Thank you.


#2

@paslechoix
I tried the same and the count seems to match for me
It shows 6218 for the first file and 6217 for the second.
For the complete file it shows the correct sum of 12435


#3

Thank you Varun,

I just did it again from scratch and this time the count matches. I thought maybe I messed up the parquet file name in the first time, but I double checked my record and I did not mess them up.

This is weird.Indeed this reminds me some rumor saying sqoop is not reliable to be used in production, and in my work, we did not use sqoop as our data migration from Netezza to the hive but a much more complicated ingestion process with not only comparing the total count but also compare each cell.

Thank you for your time.

I will not trust sqoop from now on.


#4

I checked my first result again and the two parquet files have different file size, presumably they contain different number of records.

So, different sources as input, Scala generates different counts for them?

I don’t understand this.

Which one is not trustable?

If it is Sqoop, but sqoop does generates two parquets with different sizes;
then Scala import (sqlContext.load)?

I invite everyone to do a test yourself and see what you get.

Thank you.


#5

Can you please elaborate how that data ingestion takes place in your workflow? Like what particular tool is used in place of sqoop and what are the checks you are keeping to verify the ingestion of data?


#6

are you asking my work workflow? we use scripting programming and netezza connector, we compare each cell during the verification and generate comparison result table for auditing purpose.