Monday, May 25, 2015

How to install HiveServer2 and use Beeline on Pivotal HD 1.x / 2.x

This article will list the steps required to configure HiveServer2 and provide basic information to get your instance up and running. In this article we will consider that you already have a Pivotal HD cluster up and running with HiveServer.

Before we start configuring HiveServer2, let's get some basic understanding on what HiveServer2 brings in.
HiveServer2 is a server interface that enables remote clients to execute queries against Hive and retrieve the results. HiveServer2 accesses Hive data without alteration if you are not changing releases of Hive. You do not need to update or otherwise transform data in order to begin using HiveServer2. Simply enable support, and start using HiveServer2 instead of HiveServer.
HiveServer2 as compared to HiveServer brings in the below:
  1. Support for multi-client concurrency
  2. Ability to authenticate users to prevent untrusted user access
  3. Enforce authorization around permissions to data of a user
  4. Designed to provide better support for open API clients like JDBC and ODBC.
Now, let's start configuring HiveServer2 !!

Step 1: Chose a node on which you will install HiveServer2
Ideally, you can install it at any of the nodes of Pivotal HD cluster, but prefer a master service node. If HiveServer is already installed, you may chose the same node to run HiveServer2, since with due course of time HiveServer will be deprecated and you will not be using it at all. But turn off hiveserver before you start hiveserver2.

Step 2: Install HiveServer2 package
Since the already existing PHD cluster is setup using Pivotal HD Install & Configuration Manager utility (ICM), /etc/yum.repos.d/gphd.repo will be already available on the cluster node and you can use yum to install HiveServer2
[gpadmin@hdw3 ~]$  sudo yum install hive-server2
 Loaded plugins: fastestmirror
 Loading mirror speeds from cached hostfile
  Dependencies Resolved
 ..
 ..
 Installing:
  hive-server2.noarch   0.11.0_gphd_2_1_1_0-83  gphd-admin-localrepo    5.1 k
 ..
 ..
 Installed:
   hive-server2.noarch 0:0.11.0_gphd_2_1_1_0-83
 
 Complete!
 
 Note: If yum is not working. Please install hive-server2 package using rpm command. 
Ex: rpm -ivh <path to PHD binary>/hive/rpm/hive-server2-x.x.x_gphd_x_x_x_x-x.noarch.rpm
 
Step 3: Modify /etc/gphd/hive/conf/hive-site.xml 
  • By default, hiveserver and hiveserver2 tries to bind at port 10000, make sure that you have turned off hiveserver before starting hiveserver2. There is no need to change the port though, but you can change the port number using hive.server2.thrift.port parameter.
 <property>
   <name>hive.server2.thrift.port</name>
   <value>10001</value>
   <description>TCP port number to listen on, default 10000</description>
 </property>
  • Hive includes a locking feature that uses Apache Zookeeper for locking. Zookeeper implements highly reliable distributed coordination. Other than the configuration steps, Zookeeper is invisible to Hive users. You must set the below parameters to avoid any corruption.
 <property>
   <name>hive.support.concurrency</name>
   <description>Whether Hive supports concurrency or not. A Zookeeper instance must be up and running for the default Hive lock manager to support read-write locks.
 </description>
   <value>true</value>
 </property>
 
 <property>
   <name>hive.zookeeper.quorum</name>
   <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
   <value>hdw1.hadoop.local,hdw2.hadoop.local,hdw3.hadoop.local</value>
 </property>
With these setting configured, Hive automatically starts acquiring locks for certain queries. You can see all current lock using SHOW LOCKS command.
There are 2 types of locks supported by Hive, and they are enabled automatically when concurrency is enabled.
- Shared Lock : A shared lock is acquired when a table is read. Multiple, concurrent shared locks are allowed
- Exclusive Locks : An exclusive lock is required for all operations that modify the table is some way. They not only free out other table - mutating operations, they also prevent queries by other processes.
When the table is partitioned, acquiring an exclusive lock on a partition causes shared lock to be acquired on the table itself to prevent incompatible concurrent changes from occurring, such as attempting to drop the table while a partition is being modified.
  • To prevent hive server from opening too many connections with namenode we need to set ipc.client.connection.maxidletime to the default value of 10 seconds.  By default PHD will set this parameter to 1 hour in the core-site.xml which can cause out of memory errors in hive server2.
<property>
  <name>ipc.client.connection.maxidletime</name>
  <value>10000</value>
</property>

Step 4: Setup is ready, let start HiveServer2
[gpadmin@hdw3 conf]$ sudo service hive-server2 start
starting hive-server2, logging to /var/log/gphd/hive/hive-server2.log
                                                           [  OK  ]
Step 5: Now let's use the beeline utility to connect to HiveServer2
Beeline is a JDBC client based on SQLLine CLI developed specifically to interact with HiveServer2. Although, the JDBC driver used communicates with HiveServer2 using HiveServer2’s Thrift APIs. Get adapted to beeline, as hive development has shifted from HiveServer to HiveServer2
Note: In this example, hdw3 is the server where hive-server2 is installed & 10001 is the port defined for hive-server2, so change it appropriately as per your installation.
[gpadmin@hdm1 init.d]$ beeline
 Beeline version 0.11.0-gphd-2.1.1.0 by Apache Hive
 beeline> !connect jdbc:hive2://hdw3:10000 username password org.apache.hive.jdbc.HiveDriver
 Connecting to jdbc:hive2://hdw3:10001
 Connected to: Hive (version 0.11.0-gphd-2.1.1.0)
 Driver: Hive (version 0.11.0-gphd-2.1.1.0)
 Transaction isolation: TRANSACTION_REPEATABLE_READ
 0: jdbc:hive2://hdw3:10000> show tables
 . . . . . . . . . . . . . > ;
 +------------+
 |  tab_name  |
 +------------+
 | passwords  |
 +------------+
 1 row selected (1.487 seconds)
 0: jdbc:hive2://hdw3:10000>

On a secured cluster use the below 
[gpadmin@hdm1 init.d]$ beeline
2: jdbc:hive2://hdm1:10001> !connect jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL
Connecting to jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL
Enter username for jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL: gpadmin
Enter password for jdbc:hive2://hdm1:10001/default;principal=hive/hdm1.gphd.local@GPHD.LOCAL: *******
Connected to: Hive (version 0.12.0-gphd-3.0.1.0)
Driver: Hive (version 0.12.0-gphd-3.0.1.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
3: jdbc:hive2://hdm1:10001/default> show tables;
+------------+
| tab_name |
+------------+
| book     |
| book1    |
| book10   | 

Note: If you don't have privileges or have just secured the cluster, you will not be able to perform any read / write operations and can receive an error like below. Thus, grant appropriate privileges. 
3: jdbc:hive2://hdm1:10001/default> select * from book;
 Error: Error while processing statement: Authorization failed:No privilege 'Select' found for inputs { database:default, table:book, columnName:word}. Use show grant to get more details. (state=,code=403)
0: jdbc:hive2://hdm1.gphd.local:10001/default> grant select on table book to user gpadmin ; 
0: jdbc:hive2://hdm1.gphd.local:10001> select * from book;
 +-------+
 | word  |
 +-------+
 +-------+ 

Miscellaneous:
  • beeline -e "show tables" : List out the tables
  • beeline -f file_containing_sql : Execute a sql contained in a file
  • beeline -help | -h : Print out options for beeline utility
  • !quit : Quit beeline terminal
  • set env:TERM; env:TERM=xterm; : Set the value of a variable in beeline client
  • beeline -u jdbc:hive2://localhost:10001 -n gpadmin -p changeme : Syntax to connect to HiveServer2 using beeline
  • beeline wiki: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

No comments:

Post a Comment