The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore. Hive query language provides the basic SQL like operations. SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which fulfils the condition.
DDL Statements
- CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
- DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
- TRUNCATE TABLE
- ALTER DATABASE/SCHEMA, TABLE, VIEW
- MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
- SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
- DESCRIBE DATABASE/SCHEMA, table_name, view_name
PARTITION statements are usually options of TABLE statements, except for SHOW PARTITIONS.
Few DDL statements are listed
Create Database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, …)];Drop Database
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
Alter Database
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
Use Database
USE database_name;
USE DEFAULT;
Create Table
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name — (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], …)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], …)] [CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, …) — (Note: Available in Hive 0.10.0 and later)]ON ((col_value, col_value, …), (col_value, col_value, …), …)
[STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format]| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)] — (Note: Available in Hive 0.6.0 and later)
] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, …)] — (Note: Available in Hive 0.6.0 and later)
[AS select_statement]; — (Note: Available in Hive 0.5.0 and later; not supported for external tables)CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type — (Note: Available in Hive 0.7.0 and later)
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| STRING
| BINARY — (Note: Available in Hive 0.8.0 and later)
| TIMESTAMP — (Note: Available in Hive 0.8.0 and later)
| DECIMAL — (Note: Available in Hive 0.11.0 and later)
| DECIMAL(precision, scale) — (Note: Available in Hive 0.13.0 and later)
| DATE — (Note: Available in Hive 0.12.0 and later)
| VARCHAR — (Note: Available in Hive 0.12.0 and later)
| CHAR — (Note: Available in Hive 0.13.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], …>
union_type
: UNIONTYPE < data_type, data_type, … > — (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char] — (Note: Available in Hive 0.13 and later)
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, …)]
file_format:
: SEQUENCEFILE
| TEXTFILE — (Default, depending on hive.default.fileformat configuration)
| RCFILE — (Note: Available in Hive 0.6.0 and later)
| ORC — (Note: Available in Hive 0.11.0 and later)
| PARQUET — (Note: Available in Hive 0.13.0 and later)
| AVRO — (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
Partitioned Tables – Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns. Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
If, when creating a partitioned table, you get this error: “FAILED: Error in semantic analysis: Column repeated in partitioning columns,” it means you are trying to include the partitioned column in the data of the table itself. You probably really do have the column defined. However, the partition you create makes a pseudocolumn on which you can query, so you must rename your table column to something else (that users should not query on!). For example, suppose your original unpartitioned table had three columns: id, date, and name.
id int,
date date,
name varchar
Now you want to partition on date. Your Hive definition could use “dtDontQuery” as a column name so that “date” can be used for partitioning (and querying).
create table table_name (
id int,
dtDontQuery string,
name string
)
partitioned by (date string)
Now your users will still query on “where date = ‘…'” but the second column dtDontQuery will hold the original values.
Here’s an example statement to create a partitioned table:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’)
COMMENT ‘This is the page view table’
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;
The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns (including comments). The table is also partitioned and data is stored in sequence files. The data format in the files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’)
COMMENT ‘This is the page view table’
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\001’
STORED AS SEQUENCEFILE;
The above statement lets you create the same table as the previous table. In the previous examples the data is stored in <hive.metastore.warehouse.dir>/page_view. Specify a value for the key hive.metastore.warehouse.dir in the Hive config file hive-site.xml.
External Tables – The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Example:
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’,
country STRING COMMENT ‘country of origination’)
COMMENT ‘This is the staging page view table’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\054’
STORED AS TEXTFILE
LOCATION ‘<hdfs_location>’;
You can use the above statement to create a page_view table which points to any HDFS location for its storage. But you still have to make sure that the data is delimited as specified in the CREATE statement above.
Create Table As Select (CTAS) – Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS) statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the query results are populated. So other users will either see the table with the complete results of the query or will not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with other table properties such as the SerDe and storage format. CTAS has these restrictions
- The target table cannot be a partitioned table.
- The target table cannot be an external table.
- The target table cannot be a list bucketing table.
Example:
CREATE TABLE new_key_value_store
ROW FORMAT SERDE “org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe”
STORED AS RCFile
AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;
The above CTAS statement creates the target table new_key_value_store with the schema (new_key DOUBLE, key_value_pair STRING) derived from the results of the SELECT statement. If the SELECT statement does not specify column aliases, the column names will be automatically assigned to _col0, _col1, and _col2 etc. In addition, the new target table is created using a specific SerDe and a storage format independent of the source tables in the SELECT statement.
Starting with Hive 0.13.0, the SELECT statement can include one or more common table expressions (CTEs), as shown in the SELECT syntax. Being able to select data from one table to another is one of the most powerful features of Hive. Hive handles the conversion of the data from the source format to the destination format as the query is being executed.
Bucketed Sorted Tables
Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’)
COMMENT ‘This is the page view table’
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\001’
COLLECTION ITEMS TERMINATED BY ‘\002’
MAP KEYS TERMINATED BY ‘\003’
STORED AS SEQUENCEFILE;
In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column – in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can be used if any of the columns are lists or maps.
Temporary Tables – A table that has been created as a temporary table will only be visible to the current session. Data will be stored in the user’s scratch directory, and deleted at the end of the session.
If a temporary table is created with a database/table name of a permanent table which already exists in the database, then within that session any references to that table will resolve to the temporary table, rather than to the permanent table. The user will not be able to access the original table within that session without either dropping the temporary table, or renaming it to a non-conflicting name.
Temporary tables have the limitations of partition columns not being supported and no support for creation of indexes.
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, … )
Recover Partitions (MSCK REPAIR TABLE)
MSCK REPAIR TABLE table_name;
ALTER TABLE table_name RECOVER PARTITIONS;
Drop Partitions
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, …] [IGNORE PROTECTION] [PURGE];
Create View
CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], …) ] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value, …)]
AS SELECT …;
Alter View Properties
ALTER VIEW view_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, …)
Drop View
DROP VIEW [IF EXISTS] view_name;
DML Statements
They are similar to SQL statements and their syntax is listed
LOAD
LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]
INSERT
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION … [IF NOT EXISTS]] select_statement2] [INSERT INTO TABLE tablename2 [PARTITION …] select_statement2] …;FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 …)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION …] select_statement2] [INSERT OVERWRITE TABLE tablename2 [PARTITION … [IF NOT EXISTS]] select_statement2] …;Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] …) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] …) select_statement FROM from_statement;
UPDATE
UPDATE tablename SET column = value [, column = value …] [WHERE expression]
DELETE
DELETE FROM tablename [WHERE expression]
Examples
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
INSERT INTO TABLE students
VALUES (‘fred flintstone’, 35, 1.28), (‘barney rubble’, 32, 2.32);
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
INSERT INTO TABLE pageviews PARTITION (datestamp = ‘2014-09-23’)
VALUES (‘jsmith’, ‘mail.com’, ‘sports.com’), (‘jdoe’, ‘mail.com’, null);
INSERT INTO TABLE pageviews PARTITION (datestamp)
VALUES (‘tjohnson’, ‘sports.com’, ‘finance.com’, ‘2014-09-23’), (‘tlee’, ‘finance.com’, null, ‘2014-09-21’);