How to remove BAD data from Millions of records


#1

Scenario: Suppose If i’m having ten million records with Tab delimited format. In the first column which is already tab delimited value for some records . How would you identify and rectify using any tools and techniques in Hadoop .

Kindly someone help me.


#2

@itversity, @venkatreddy-amalla please help us with the answer


#3

Above scenario how you consider the row is noise because file is tab delimited. please elaborate.


#4

With every dataset they will share dataset description. If few fields value in records itself has tab, then we need to identify those fields, let’s say an address field can contain spaces, symbols and delimiters. and read them using regular expressions. There should be some record delimiter in the file.


#5

Can you please be more clear on this , For example :
If you have Praveen Kumar in the first column , If we split with tab delimited , Praveen and kumar will be divided into two columns .


#6

Hi Ravi & Raj,
In that situation usually data will come like below

name age address
Rajesh 22 “123,street1, city1, KA”
“selva kumar” 33 “193,street5, city4, KL”

So we can process the data using escape character.
If suppose you are not getting the data like this and wants to remove that data. You can filter data split and length.
If splitted row > totNoColumn then reject it.


#7

Could you please explain in detail @suresh_selvaraj bro . Thanks for you reply .