New Year Special : Self-Learning Courses: Get any course for just $49!  - SCHEDULE CALL

- Hadoop Blogs -

Your Complete Guide to Apache Hive Data Models

Apache Hive is basically an open source data warehouse system which can handle, query, or analyze large datasets and process structured/non-structured data in Hadoop. Apache Hive is built on Hadoop big data platform.

This article discusses the most important data source of HIVE which is Hive tables. Hive has made the task of writing complex Map-Reduce jobs easier and it is written in HiveQL language. HiveQL is very much similar to Structured Query Language.

Apache Hive has a number of beneficial features and a few of them are listed below:

  • The user can perform data querying, summarization, and analysis in an easier manner.
  • As it supports the concepts of external tables so the user need not store data actually in HDFS to process it.
  • It can fit even in Hadoop’s low-level interface perfectly.
  • Data partitioning is also supported by Apache Hive and therefore the performance is improved up to great extent.
  • The logical plans are optimized by rule-based optimizer.
  • Hive is scalable, familiar and extensible.
  • The user need not possess any experience of programming language, only knowledge of SQL is enough.
  • Structured data can be easily processed in Hive just like in SQL.
  • Ad-hoc queries can also be processed easily in Hive.

A Detailed Discussion on Apache Hive Data Models

As you know that what is Hive now, here is the time to discuss Hive data model. Following diagram shows the Hive data model:

Imagecourtesy:data-flair.com

Hive, which is an open source data warehouse and built on the top of Hadoop can analyze and store even large datasets, stored in Hadoop files. Apache Hive can store data in following three forms:

  • Table
  • Partition
  • Bucket

Further, we will discuss all three briefly for your reference and why it is important for data storage.

Tables

Apache Hive tables are similar to relational database tables. Tables of Hive consist of data and are their layout is described with the help of associated metadata. Filter, join, project and union operations can be performed on these tables.

Usually, in Hadoop, the data are stored in HDFS but Hive stores metadata in a relational database rather than HDFS. Two types of tables exist is Hive:

  • Managed or Internal Tables
  • External Tables.

Managed or Internal Tables

Read: A Beginner's Tutorial Guide For Pyspark - Python + Spark

Managed Tables of Hive are also called internal tables and are the default tables. If without specifying the type user develop this table, then it will be of an internal type.

All managed tables are created or stored in HDFS and the data of the tables are created or stored in the /user/hive/warehouse directory of HDFS.

If one deletes the table then both the table data and metadata are deleted from HDFS. Following syntax is used to create any managed table:

Create table tablename(Var String, Var Int) row format delimited fields terminated by ‘;’;

Through above command, we can create a managed table. Toload the data in the table created by above command we will have to use the following command:

Load data local in ‘path of the file’ into table tablename;

While if you want to see or check the data of any particular managed table then the following syntax can be used for that:

Hadoop dfs –ls hdfs://localhost:9000/user/hive/warehouse/tablename

In order to delete the table, you can use the following command -

Drop table tablename

While creating any managed or internal table the delimiters ‘;’ should be used to terminate any field. The location of the table can be overwritten by using LOCATION command. Here in Hive tables, the data can be loaded from two locations one is HDFS and other is a local file system.

The user can also create the virtual tables through views and they are faster than creating actual tables. Views can be used just as actual tables and be used in other queries as well. Indexing can also be done for table data to speed up query processing.

Read: Key Features & Components Of Spark Architecture

The columns which are commonly used in querying the table data should be indexed. Indexing columns can make the query processing faster. Dropping a table can delete the data from HDFS.

External Table

External tables are used for external use means when the table data resides outside Hive then these tables are used. These tables are used generally when you want to delete metadata from the table and keep the table data as it is. While deletion only table schema gets deleted. Following are the syntaxe used for external tables: To create the table, we use the following command:

Create external table tablename(Var String, Var Int) row format delimited fields terminated by ‘;’;

To check the table data following command is used:

Load data local inpath ‘path of the file’ into table tablename;

To see or check the table data following command can be used for external tables:

Hadoop dfs –ls hdfs://localhost:9000/user/hive/warehouse/tablename

To delete or drop the table following command can be used:

Drop table tablename; When to use Internal or External Tables?

The user can use any of the tables as per the requirement, but there are a few considerations to choose the type of table. Internal Tables must be used when:

  • Temporary data is to be used
  • When no external data is there and Hive will use only internal data
  • You do not want table data after deletion

External Tables must be used when:

Read: What Is Hue? Hue Hadoop Tutorial Guide for Beginners
  • The data is also used externally means if the data files are processed and used by any program that does not lock the files
  • There is no control over data settings like ‘dirs accessing’ and you have another program to perform these operations
  • You are creating a fresh table not dependent on an existing table
  • You can access table schema and create the table along with pointing its data location.

  Partition

Table of Hive is organized in partitions by grouping same types of data together based on any column or partition key. Every table has a partition key for identification. Partitions can speed up query and slicing process. Following syntax is used to create a partition:

Create Table tablename(var String,var Int) Partitioned BY (partition1 String, partition2 String);

Partitioning increases the speed of querying by reducing its latency as it only scans relevant data rather than scanning full data.

Buckets

Tables or partitions can again be sub-divided into buckets in Hive for that you will have to use the hash function. Following syntax is used to create table buckets:

Create Table tablename Partitioned BY (partition1 data_type, partition2 data_type ) Clustered BY (column1, column2) Sorted BY(column_name Asc:Desc,___) Into num_buckets Buckets;

Hive buckets are just files in the table directory which may be partitioned or un-partitioned. You can even choose n buckets to partition the data.

Final Words:

From the above discussion, this is clear that Hive data can be categorized into three types on the granular level one is a table, other is a partition and the last is a bucket. The user can use these data models as per his needs and requirements. Hive is a popular database tool for Hadoop developers and has many features apart from internal and external tables.

The concept of tables used in Hive has increased its productivity and the developers can not only use these tables as per their convenience even this concept has increased query processing up to great extent.

Hive can be used even by non-programmers as it has a simple syntax for creating and accessing the database. So, if you are planning to learn Hadoop or Hive then join JanBask’s Training and certification program right away to accelerate your learning.

Read: Your Complete Guide to Apache Hive Installation on Ubuntu Linux


fbicons FaceBook twitterTwitter lingedinLinkedIn pinterest Pinterest emailEmail

     Logo

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

Cyber Security Course

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
Cyber Security Course

Upcoming Class

7 days 25 Jan 2025

QA Course

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
QA Course

Upcoming Class

-0 day 18 Jan 2025

Salesforce Course

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
Salesforce Course

Upcoming Class

7 days 25 Jan 2025

Business Analyst Course

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
Business Analyst Course

Upcoming Class

7 days 25 Jan 2025

MS SQL Server Course

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
MS SQL Server Course

Upcoming Class

7 days 25 Jan 2025

Data Science Course

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
Data Science Course

Upcoming Class

7 days 25 Jan 2025

DevOps Course

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
DevOps Course

Upcoming Class

6 days 24 Jan 2025

Hadoop Course

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
Hadoop Course

Upcoming Class

-0 day 18 Jan 2025

Python Course

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
Python Course

Upcoming Class

14 days 01 Feb 2025

Artificial Intelligence Course

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
Artificial Intelligence Course

Upcoming Class

7 days 25 Jan 2025

Machine Learning Course

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
Machine Learning Course

Upcoming Class

20 days 07 Feb 2025

 Tableau Course

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
 Tableau Course

Upcoming Class

-0 day 18 Jan 2025

Search Posts

Reset

Receive Latest Materials and Offers on Hadoop Course

Interviews