31
JanNew Year Special : Self-Learning Courses: Get any course for just $49! - SCHEDULE CALL
Apache Hive is basically an open source data warehouse system which can handle, query, or analyze large datasets and process structured/non-structured data in Hadoop. Apache Hive is built on Hadoop big data platform.
This article discusses the most important data source of HIVE which is Hive tables. Hive has made the task of writing complex Map-Reduce jobs easier and it is written in HiveQL language. HiveQL is very much similar to Structured Query Language.
Apache Hive has a number of beneficial features and a few of them are listed below:
A Detailed Discussion on Apache Hive Data Models
As you know that what is Hive now, here is the time to discuss Hive data model. Following diagram shows the Hive data model:
Imagecourtesy:data-flair.com
Hive, which is an open source data warehouse and built on the top of Hadoop can analyze and store even large datasets, stored in Hadoop files. Apache Hive can store data in following three forms:
Further, we will discuss all three briefly for your reference and why it is important for data storage.
Tables
Apache Hive tables are similar to relational database tables. Tables of Hive consist of data and are their layout is described with the help of associated metadata. Filter, join, project and union operations can be performed on these tables.
Usually, in Hadoop, the data are stored in HDFS but Hive stores metadata in a relational database rather than HDFS. Two types of tables exist is Hive:
Managed or Internal Tables
Read: A Beginner's Tutorial Guide For Pyspark - Python + Spark
Managed Tables of Hive are also called internal tables and are the default tables. If without specifying the type user develop this table, then it will be of an internal type.
All managed tables are created or stored in HDFS and the data of the tables are created or stored in the /user/hive/warehouse directory of HDFS.
If one deletes the table then both the table data and metadata are deleted from HDFS. Following syntax is used to create any managed table:
Create table tablename(Var String, Var Int) row format delimited fields terminated by ‘;’;
Through above command, we can create a managed table. Toload the data in the table created by above command we will have to use the following command:
Load data local in ‘path of the file’ into table tablename;
While if you want to see or check the data of any particular managed table then the following syntax can be used for that:
Hadoop dfs –ls hdfs://localhost:9000/user/hive/warehouse/tablename
In order to delete the table, you can use the following command -
Drop table tablename
While creating any managed or internal table the delimiters ‘;’ should be used to terminate any field. The location of the table can be overwritten by using LOCATION command. Here in Hive tables, the data can be loaded from two locations one is HDFS and other is a local file system.
The user can also create the virtual tables through views and they are faster than creating actual tables. Views can be used just as actual tables and be used in other queries as well. Indexing can also be done for table data to speed up query processing.
Read: Key Features & Components Of Spark Architecture
The columns which are commonly used in querying the table data should be indexed. Indexing columns can make the query processing faster. Dropping a table can delete the data from HDFS.
External Table
External tables are used for external use means when the table data resides outside Hive then these tables are used. These tables are used generally when you want to delete metadata from the table and keep the table data as it is. While deletion only table schema gets deleted. Following are the syntaxe used for external tables: To create the table, we use the following command:
Create external table tablename(Var String, Var Int) row format delimited fields terminated by ‘;’;
To check the table data following command is used:
Load data local inpath ‘path of the file’ into table tablename;
To see or check the table data following command can be used for external tables:
Hadoop dfs –ls hdfs://localhost:9000/user/hive/warehouse/tablename
To delete or drop the table following command can be used:
Drop table tablename; When to use Internal or External Tables?
The user can use any of the tables as per the requirement, but there are a few considerations to choose the type of table. Internal Tables must be used when:
External Tables must be used when:
Read: What Is Hue? Hue Hadoop Tutorial Guide for Beginners
Partition
Table of Hive is organized in partitions by grouping same types of data together based on any column or partition key. Every table has a partition key for identification. Partitions can speed up query and slicing process. Following syntax is used to create a partition:
Create Table tablename(var String,var Int) Partitioned BY (partition1 String, partition2 String);
Partitioning increases the speed of querying by reducing its latency as it only scans relevant data rather than scanning full data.
Buckets
Tables or partitions can again be sub-divided into buckets in Hive for that you will have to use the hash function. Following syntax is used to create table buckets:
Create Table tablename Partitioned BY (partition1 data_type, partition2 data_type ) Clustered BY (column1, column2) Sorted BY(column_name Asc:Desc,___) Into num_buckets Buckets;
Hive buckets are just files in the table directory which may be partitioned or un-partitioned. You can even choose n buckets to partition the data.
Final Words:
From the above discussion, this is clear that Hive data can be categorized into three types on the granular level one is a table, other is a partition and the last is a bucket. The user can use these data models as per his needs and requirements. Hive is a popular database tool for Hadoop developers and has many features apart from internal and external tables.
The concept of tables used in Hive has increased its productivity and the developers can not only use these tables as per their convenience even this concept has increased query processing up to great extent.
Hive can be used even by non-programmers as it has a simple syntax for creating and accessing the database. So, if you are planning to learn Hadoop or Hive then join JanBask’s Training and certification program right away to accelerate your learning.
Read: Your Complete Guide to Apache Hive Installation on Ubuntu Linux
A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.
Cyber Security
QA
Salesforce
Business Analyst
MS SQL Server
Data Science
DevOps
Hadoop
Python
Artificial Intelligence
Machine Learning
Tableau
Search Posts
Related Posts
Receive Latest Materials and Offers on Hadoop Course
Interviews