Filter RDD not gettting the right count


#1

In a scenario I am working on products table:

Step 1. From mySQL:

mysql> select count(1) from products;
±---------+
| count(1) |
±---------+
| 1345 |
±---------+
1 row in set (0.00 sec)

mysql> select count(1) from products where length(product_price)>0;
±---------+
| count(1) |
±---------+
| 1345 |
±---------+
1 row in set (0.00 sec)

That means all records’ price has a value in it.

Step 2. load the data into RDD:
val productsRDD = sc.textFile(“p93_products”)
scala> productsRDD.count
res0: Long = 1345

Step 3. filter out the working data:
val filteredRDD = productsRDD.filter(rec=>rec.split(",")(4).length>0)
1344

Why there is one missing? in mysql raw data, all price field has length > 0
What is wrong?

Thank you.


#2

Hi,

In the filter command the row id you gave is 4 where as the row id for product_price is 5.
For the Row ID =4, it is showing the row count as 1344 because you are splitting by comma.
In the previous field there is a comma inside the field value.

I am attaching the image for clarification


#3

Thank you very much for catching that.

In cases like this, with comma inside the field content, what would be your suggestion to handle it then?