Contact Us| About Us| Terms| Privacy Policies
Powered by Lorquins Technologies© 2017 All Rights Reserved

Hive Tutorial
Home »  Hive » Hive Data Types

Hive Data Types

Hive supports many of the primitive data types you find in relational databases, as well as three collection data types that are rarely found in relational databases, for reasons we will discuss shortly. A unique feature of Hive, compared to most databases, is that it provides great flexibility in how data is encoded in files. Most databases take total control of the data, both how it is persisted to disk and its life cycle. By letting you control all these aspects, Hive makes it easier to manage and process data with a variety of tools.

Primitive Data Types

Hive supports several sizes of integer and floating-point types, a Boolean type, and character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary fields.

Data Type Data Type Size Data Type examples
TINYINT 1 byte signed integer. 20
SMALLINT 2 byte signed integer. 20
INT 4 byte signed integer. 20
BIGINT 8 byte signed integer. 20
BOOLEAN Boolean true or false. TRUE
FLOAT Single precision floating point. 3.14159
DOUBLE Double precision floating point. 3.14159
STRING Sequence of characters. The character set can be specified. Single or double quotes can be used. 'Now is the time', "for all good men"
TIMESTAMP (v0.8.0+) Integer, float, or string. 1327882394 (Unix epoch seconds), 1327882394.123456789 (Unix epoch seconds plus nanoseconds), and '2012-02-03 12:34:56.123456789' (JDBCcompliant java.sql.Timestamp format)
BINARY (v0.8.0+) Array of bytes. See below discussion for more information

Note that Hive does not support "character arrays" (strings) with maximum-allowed lengths, as is common in other SQL dialects. Relational databases offer this feature as a performance optimization; fixed-length records are easier to index, scan, etc. In the "looser" world in which Hive lives, where it may not own the data files and has to be flexible on file format, Hive relies on the presence of delimiters to separate fields. Also, Hadoop and Hive emphasize optimizing disk reading and writing performance, where fixing the lengths of column values is relatively unimportant.

Values of the new TIMESTAMP type can be integers, which are interpreted as seconds since the Unix epoch time (Midnight, January 1, 1970), floats, which are interpreted as seconds since the epoch time with nanosecond resolution (up to 9 decimal places), and strings, which are interpreted according to the JDBC date string format convention, YYYY-MM-DD hh:mm:ss.fffffffff. TIMESTAMPS are interpreted as UTC times. Built-in functions for conversion to and from time zones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respectively

The BINARY type is similar to the VARBINARY type found in many relational databases. It is not like a BLOB type, since BINARY columns are stored within the record, not separately like BLOBs. BINARY can be used as a way of including arbitrary bytes in a record and preventing Hive from attempting to parse them as numbers, strings, etc. Note that you dont need BINARY if your goal is to ignore the tail end of each record. If a table schema specifies three columns and the data files contain five values for each record, the last two will be ignored by Hive.

What if you run a query that wants to compare a float column to a double column or compare a value of one integer type with a value of a different integer type? Hive will implicitly cast any integer to the larger of the two integer types, cast FLOAT to DOUBLE, and cast any integer value to DOUBLE, as needed, so it is comparing identical types. What if you run a query that wants to interpret a string column as a number? You can explicitly cast one type to another as in the following example, where s is a string column that holds a value representing an integer: ... cast(s AS INT) ...;(To be clear, the AS INT are keywords, so lower-case would be fine.)

Collection Data Types or Complex Data types

Hive has three main complex types: ARRAY, MAP, and STRUCT. These data types are built on top of the primitive data types. ARRAY and MAP are similar to that in Java. STRUCT is a record type, which may contain a set of any type of fields. Complex types allow the nesting of types. The details of complex types are as follows:

Complex Data Type Data Type Description Data Type Example
ARRAY This is a list of items of the same type, such as (val1, val2, and so on). You can access the value using array_name[index], for example, fruit[0]='apple'. ['apple','orange','mango']
MAP This is a set of key-value pairs, such as (key1, val1, key2, val2, and so on). You can access the value using map_name[key], for example, fruit[1]="apple". {1: "apple",2: "orange"}
STRUCT This is a user-defined structure of any type of fields, such as {val1, val2, val3, and so on}. By default, STRUCT field names will be col1, col2, and so on. You can access the value using structs_name.column_name, for example, fruit.col1=1. {1, "apple"}
NAMED STRUCT This is a user-defined structure of any number of typed fields, such as (name1, val1, name2, val2, and so on). You can access the value using structs_name.column_name, for example,"gala". {"apple":"gala","weight kg":1}
UNION This is a structure that has exactly any one of the specified data types. It is available since Hive 0.7.0. It is not commonly used. {2:["apple","orange"]}

For MAP, the type of keys and values are unified. However, STRUCT is more flexible. STRUCT is more like a table whereas MAP is more like an ARRAY with a customized index.