Kindly answer for my questions

Dear Folks,

Below are my doubts. Anybody knows please answer me.

Que:1

I have 60 millions of records in Hive Table and I just want to query the table using select condition eg: “Select * from table name”. Just select query.
It takes few minutes or hours to display the results. My client requirement is to improve the query performance to get quicker results in seconds, how can I do this?

Que:2

I have order table with four fields as order_id, order_date, order_item_order_ir, order_status. Which field is used for partition and which one for bucketing and why?

Que:3

My question is wrong sorry, How we predict number of mappers based on table in RDBMS. (Eg Table with millions of records) Not using -m parameter.

Que:4

In Flume HDFS sink, I have to split my incoming web logs data based upon country, how can we? where I configure this property.
Any sample sink config, please share the code else give me an idea.
Eg: If it is “India” data goes to India folder like that…

Que:5

In HDFS default replica is 3, I want to change it for my existing data which resides in HDFS is it possible?

Que:6

What is the purpose of map- side join? What is the concept behind this?

Que:7

What is the purpose of hive meta store? Where my meta store resides and is need to be available in all nodes (Eg: I have 5 node cluster).

Ans 1: Use Tez engine or ORC format or Use vectorization

Ans 2: Don’t know.Will check and update

Ans 3: By Default number of mappers are four. You can change the number in your sqoop command using parameter -m or --num-mappers

Ans 4: Don’t know. Will check and update

Ans 5: Yes we can change HDFS replication factor in hdfs-site.xml by changing the following parameter.

dfs.replication 3 Block Replication

Ans 6: Map -side join can be used when all the records for a particular key must reside in the same partition and which is mandatory. E.g.: In a given data set, if you want divide the data into same no.of partitions sorted by same key we can use the map side join.

Ans 7: Hive metastore service stores the metadata for Hive tables and partitions in a relational database. The RDMS can be MySQL or PostgreSQL or Oracle DB. No, Meta store wont reside in all nodes. Meta store resides on a node which you configure in the cluster or it can be remote node out of the cluster also.

1 Like

Thanks for your answers.

Que:1: I thought of using Tez Engine not recommended, select query just dump the result to us, no map-reduce process is running right?

Que: 3 My question is wrong sorry, How can we predict number of mappers based on table in RDBMS. (Eg Table with millions of records). Not using -m parameter.

Que:5 I thing we can’t able to change the replica of existing data resided in HDFS.

correct me if am wrong

If you are going for certification, May be yes until unless they specify you should not use that. But if nothing is mentioned specifically in the task instructions, then you can run your Pig and Hive scripts using Tez to reduce time when ever applicable.

If it is required to reduce the replication factor of existing files, then the following command can be used to adjust the replication factor of all files in the file system.

hadoop dfs -setrep -w <REPLICATION_FACTOR> -R /

1 Like

Going forward, please break down into multiple topics, so that they are easily manageable and searchable.

Que:1
I have 60 millions of records in Hive Table and I just want to query the table using select condition eg: “Select * from table name”. Just select query.
It takes few minutes or hours to display the results. My client requirement is to improve the query performance to get quicker results in seconds, how can I do this?

I do not understand why some one have to select * from table_name on a table with millions of records. If you have to dump the data to a file you can use hadoop fs -get command