Despite its simplicity and unique power, Hadoop is something that is new for many. It has to be learnt and understood for easy application. To address these difficulties of mastering new programming environment, some developers on Facebook came up with the runtime Hadoop support structure called Hadoop Hive. With this new structure anyone who is well conversant in SQL can use the Hadoop platform with ease and convenience.
Hadoop Hive allows the SQL developers to use and write the HQL or the Hive Query Language. The statements written in HQL are similar to those written in SQL. Limited in commands HQL is still very useful. Statements are usually broken down by the Hive into MapReduce jobs in Hadoop. Thereafter they are executed in Hadoop cluster.
People with knowledge of SQL or RDBMS may find the system familiar. Hive queries can be run in different ways like using shell or command line interface, from JDBC or ODBC applications. For this the users has to use the Hive JDBC/ODBC drivers or the Hive Thrift Client.
Hive Thrift Client is similar to other database clients. It would be installed on the client side machine usually. Of course in 3-tier architecture it can be installed in the middle level. From client side it would communicate with Hive services that are running on the server side. Advantage of using Hive Thrift Client is that it can be used within the applications such as C++, PHP, Java, Python, and Ruby etc.
The process is similar to accessing database in SQL using DB2 or Informix but there are several differences as well. The reason is that Hive is based on Hadoop as well as MapReduce features. Hadoop is meant to handle long sequential scans and therefore the Hive queries have high latency. That is why Hive may not be ideal for the applications that need fast response. Also Hive is read based and does not require a lot of write operations.
Basically Hive is data warehouse infrastructure that is developed on Hadoop. It provides the user with the facilities of data summarization, query as well as analysis. It was internally developed by Facebook but is now used by many other companies like Netfix and Amazon.
Basic advantage of using Hadoop Hive is that it helps analysis of the large datasets that are stored in either Hadoop HDFS or other compatible programming environments. Queries are converted in to map/reduce jobs by using the HQL language. Besides map/reduce the jobs are also transformed in the Apache Tez and Spark as all of them can run the Hadoop YARN.
Hadoop Hive stores metadata in the Apache Derby database that is embedded into it. The storage is also made in other optionally used databases like MySQL etc. Hive supports four main filed formats such as the TEXTFILE, RCFILE, ORC, and SEQUENTIALFILE. The facilities are available in all Hive versions from 0.10 onwards.
Similarly, there are various storage types in Hive like plain text, HBase, ORC, RCFile and others. Since Metadata is stored in RDBMS it reduces the runtime of checks during the query executions. Hive also has both built in algorithms as well as user defined functions.
No comments:
Post a Comment