Tuesday 11 August 2015

What is Hadoop Hive With Explaination

Despite its simplicity and unique power, Hadoop is something that is new for many. It has to be learnt and understood for easy application. To address these difficulties of mastering new programming environment, some developers on Facebook came up with the runtime Hadoop support structure called Hadoop HiveWith this new structure anyone who is well conversant in SQL can use the Hadoop platform with ease and convenience.
Hadoop Hive allows the SQL developers to use and write the HQL or the Hive Query Language. The statements written in HQL are similar to those written in SQL. Limited in commands HQL is still very useful. Statements are usually broken down by the Hive into MapReduce jobs in Hadoop. Thereafter they are executed in Hadoop cluster.
People with knowledge of SQL or RDBMS may find the system familiar. Hive queries can be run in different ways like using shell or command line interface, from JDBC or ODBC applications. For this the users has to use the Hive JDBC/ODBC drivers or the Hive Thrift Client.
Hive Thrift Client is similar to other database clients. It would be installed on the client side machine usually. Of course in 3-tier architecture it can be installed in the middle level. From client side it would communicate with Hive services that are running on the server side. Advantage of using Hive Thrift Client is that it can be used within the applications such as C++, PHP, Java, Python, and Ruby etc.
The process is similar to accessing database in SQL using DB2 or Informix but there are several differences as well. The reason is that Hive is based on Hadoop as well as MapReduce features. Hadoop is meant to handle long sequential scans and therefore the Hive queries have high latency. That is why Hive may not be ideal for the applications that need fast response. Also Hive is read based and does not require a lot of write operations.
Basically Hive is data warehouse infrastructure that is developed on Hadoop. It provides the user with the facilities of data summarization, query as well as analysis. It was internally developed by Facebook but is now used by many other companies like Netfix and Amazon.
Basic advantage of using Hadoop Hive is that it helps analysis of the large datasets that are stored in either Hadoop HDFS or other compatible programming environments. Queries are converted in to map/reduce jobs by using the HQL language. Besides map/reduce the jobs are also transformed in the Apache Tez and Spark as all of them can run the Hadoop YARN.
Hadoop Hive stores metadata in the Apache Derby database that is embedded into it. The storage is also made in other optionally used databases like MySQL etc. Hive supports four main filed formats such as the TEXTFILE, RCFILE, ORC, and SEQUENTIALFILE. The facilities are available in all Hive versions from 0.10 onwards.
Similarly, there are various storage types in Hive like plain text, HBase, ORC, RCFile and others. Since Metadata is stored in RDBMS it reduces the runtime of checks during the query executions. Hive also has both built in algorithms as well as user defined functions.

Latest Hadoop Interview Questions 2015

Q1. What is the channel of communication between client and name node and what is the channel of communication between client and data node?
Ans. The mode of communication in both cases is SSH. 
Q2. How is data stored on the rack?
Ans. While loading files into cluster the contents are divided into block. He or she gets three data nodes for each of the blocks in the file. This indicates where the block would be stored. 
Q3. What is the basic rule for data storage in rack?
Ans. The basic principle is that for each bloc of data two copies would be stored in one rack and a third copy would be stored in a different rack. This rule is named as the Replica replacement policy in Hadoop. 
Q4. Why is the 2nd and 3rd data placed in rack 2 only?
Ans. This is done in order to avoid failure of data node. 
Q5. What happens when both rack 2 as well as data node fails?
Ans. If this happens then data retrieval would become impossible. Hence the solution would be replicating the data more times than thrice only. 
Q6. How can data be replicated more times?
Ans.  Default replication number is 3 in Hadoop and it can be set to more numbers to avoid consequences of rack 2 and data node failure at the same time. 
Q7. Is the secondary name node a substitute for the name node?
Ans. No! The secondary name node is not a substitute for the name node and it is used to constantly read data from RAM of name node and writing it back to hard disk of file system. But if name node fails the entire Hadoop system would go down. 
Q8. Is there any difference between Gen1 and 2 Hadoop name nodes?
Ans. While in Gen 1 the name node is the single point failure for Hadoop in case of Gen 2 there are active and passive name nodes. When active name node fails the passive name node takes over. 
Q9. What is task accomplished by MapReduce?
Ans. MapReduce can be considered to be the heart of Hadoop. It has two parts namely; ‘map’ and ‘reduce’. Map would process the data to get intermediate output and reduce helps generating the final output. 
Q10. What are the uses of two parts map and reduce in Hadoop?
They allow distributed processing of map as well as the reduction operations in Hadoop by step wise generation of intermediary and final output. 
Q11. How the map and reduce work?
Ans. Name node divides inputs into parts distributing these parts to data nodes. Data nodes process them and return key-value pair to generate intermediate output. Reducer collects the key-value pairs from all data nodes and generates final output. 
Q12. what is meant by Key-value pairs?
Ans. In HDFS the key-value pair denotes the intermediate data that is generated by maps and transmitted to reduceso that final output would be generated. 
Q13. what are the differences between HDFS clusters and MapReduce Engine?
Ans. MapReduce engine is the programming module that is used for retrieving and analyzing data whereas HDFS cluster is the name of entire master-slave configuration.
Q14. Are different servers required for name node and data node?
Ans. Two different servers are required for name nodes and data nodes. 
Q15. Why are different servers required for name node and data node?
Ans. Name node requires highly configurable systems and it stores information about the details of the location of all the files that are stored in different data nodes. Data nodes require low configuration system. That is why different servers are required for the two. 
Q16. Are the number of splits and maps equal to each other?
Ans. Yes! User requires key as well as value-pairs for all input splits and that is why the number of splits and maps are equal. 
Q17. What are the write types in HDFS?
Ans. Two types of writes in HDFS are ‘posted’ and ‘non-posted’. 
Q18. What is the difference between posted and non-posted writes in HDFS?
Ans. There is no necessity of acknowledgement in case of posted writes whereas it is required in case of non-posted writes. Both are asynchronous but non-posted writes are more expensive. 
Q19. Is Reading and Writing both done in parallel in HDFS?
Ans. No! Reading in HDFS is done in parallel but Writing is not. 
Q20. Why Write is not performed in parallel?
Ans. Performing write in parallel can result in data inconsistencies. If two nodes simultaneously write data in to the file, then it would be confusing deciding which one should be stored and accessed. 
Q21. Is Hadoop similar to the NOSQL database?
Ans. Though there are lots of similarities NOSQL does not contain DFS. On the other hand Hadoop is not database but a file system called HDFS and distributed framework for programming which is MapReduce. 
Q22. How do you define MapReduce?
Ans. MapReduce is parallel model for programming and is used to process big data across numerous servers in Hadoop cluster. 
Q23. What are the tasks accomplished by MapReduce?
Ans. MapRduce introduces compute to data whereas the traditional parallelism introduces data to the compute locations. Map job is to take data and converting it into another set where individual elements would be broken into smaller set of data tuples. Reduce would take the key value pairs from map and generate the output. 
Q24. What the program of MapReduce is?
Ans. There are three parts in MapReduce program; namely Driver, Mapper, and Reducer. 
Q25. What are the functions of driver in Hadoop?
Ans. Driver code works on client machine and it builds up the configuration of the job submitting the same to Hadoop cluster. Driver code contains the main () method accepting arguments from all command lines. 
Q26. What are the functions of mapper in Hadoop?
Ans. Mapper code would read the input files as key value pairs and extend the MapReduce Base implementing the mapper interface. 
Q27. What is mapper interface?
Ans. Mapper interface would expect four generics. They define the types of inputs as well as output key value pairs. First two generics define input key value pairs and the second two generics define the output key value. 
Q28. What is the interface that is used to create Mapper as well as Reducer in Hadoop?
Ans. The interfaces are org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapreduce.Reducer respectively. 
Q29. What are the daemon processes that run on Hadoop cluster?
Ans. There are five separate daemons in the Hadoop cluster. Each of them has its own JVM. Name Node, Secondary Name Node and Job tracker run on the master nodes. Data node and task trackers are the two daemons that run on the slave node. 
Q30. What is input split in Hadoop?
Ans. Input split is processed by a single map and represents the data that would be processed by individual mappers. Each split is again divided as records and each record is processed by map to create key value pair. Spilt represents a number of rows and record is the specific number. 
Q31. How is the length of input split measured in Hadoop?
Ans. Length of input split is measured in bytes. 
Q32. What is meant by input format?
Ans. Input format class is fundamental for Map Reduce framework. The class defines data splits and record reader. 
Q33. In a case in M/R system the block size of HDFS is 64 MB. Input format is FileinputFormat with 3 files of size of 64k, 65Mb and 127Mb. What would be the number of input splits in Hadoop framework for such environment?
Ans. There would be 5 splits with 1 for 64K files, 2 for 65MB files and 2 for 127 MB files. 
Q34. What happens when job tracker machine is down?
Ans. In the 1.0 version of Hadoop when it fails all the jobs would restart interrupting overall execution flow. In 2.0 versions the job tracker concept has been substituted by YARN. 
Q35. What is the significance of YARN in the new version of Hadoop?
Ans. With the arrival of YARN the job tracker and task tracker have both disappeared. YARN divides their functionalities into two different daemons namely, resource manager and node manager that is node specific.