Quicktostudy
11/18/2018   
Hive Tutorial
Home »  Hive » Hive Buckets

Hive Buckets


Besides partition, bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from partition, the bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses the year and month as the top-level partition. If there is a further request to use the employee_id as the third level of partition, it leads to many deep and small partitions and directories. For instance, we can bucket the employee_partitioned table using employee_id as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). By using buckets, Hive can easily and efficiently do sampling

Create a bucket table

--Prepare another dataset and table for bucket table
hive> CREATE TABLE employee_id(name string,employee_id int,work_place ARRAY,
> sex_age STRUCT,skills_score MAP,depart_title MAP>)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';

hive> LOAD DATA LOCAL INPATH
> '/home/quicktostudy/employee_id.txt' OVERWRITE INTO TABLE employee_id

--Create the buckets table
hive> CREATE TABLE employee_id_buckets(name string,employee_id int,work_place ARRAY,
> sex_age STRUCT,skills_score MAP,depart_title MAP>)
>CLUSTERED BY (employee_id) INTO 2 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
> COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';

Bucket numbers

To define the proper number of buckets, we should avoid having too much or too little of data in each bucket. A better choice is somewhere near two blocks of data. For example, we can plan 512 MB of data in each bucket, if the Hadoop block size is 256 MB. If possible, use 2N as the number of buckets.
Bucketing has close dependency on the underlying data loaded. To properly load data to a bucket table, we need to either set the maximum number of reducers to the same number of buckets specified in the table creation (for example, 2) or enable enforce bucketing as follows

hive> set map.reduce.tasks = 2;
hive> set hive.enforce.bucketing = true;

To populate the data to the bucket table, we cannot use LOAD keywords such as what was done in the regular tables since LOAD does not verify the data against the metadata. Instead, INSERT should be used to populate the bucket table as follows

hive> INSERT OVERWRITE TABLE employee_id_buckets
> SELECT * FROM employee_id;

Contact Us| About Us| Terms| Privacy Policies
Powered by Lorquins Technologies© 2017 QuickToStudy.com. All Rights Reserved