理解与搭建hive

On 2017年2月21日2017年2月21日By yuer

本文需要先完成《理解与搭建hadoop》中的hadoop环境搭建。

Hive的部署需提前关注几点：

Hive运行时直接根据环境变量HADOOP_HOME找到hadoop目录，访问其中的hdfs/mapreduce配置文件。
Hive元数据存储在mysql中（表结构等称为元数据），因此你要自行安装一个mysql server。
Hive能够直接在若干数据源上执行SQL运算，主要是指：Hdfs文件与HBase。

部署hive

下载代码

在官网下载最新的stable release版本，解压存放到如下目录：

[root@localhost hive]# pwd
/root/hive
[root@localhost hive]#  ll
总用量 80
drwxr-xr-x. 3 root root   209 2月  21 11:24 bin
drwxr-xr-x. 2 root root  4096 2月  21 11:24 conf
drwxr-xr-x. 4 root root    34 2月  21 11:24 examples
drwxr-xr-x. 7 root root    68 2月  21 11:24 hcatalog
drwxr-xr-x. 2 root root    44 2月  21 11:24 jdbc
drwxr-xr-x. 4 root root  8192 2月  21 11:24 lib
-rw-r--r--. 1 root root 29003 11月 29 05:35 LICENSE
-rw-r--r--. 1 root root   578 11月 29 22:09 NOTICE
-rw-r--r--. 1 root root  4122 11月 29 05:35 README.txt
-rw-r--r--. 1 root root 18501 11月 30 03:45 RELEASE_NOTES.txt
drwxr-xr-x. 4 root root    35 2月  21 11:24 scripts

[root@localhost hive]# pwd

/root/hive

[root@localhost hive]# ll

总用量 80

drwxr-xr-x. 3 root root 209 2月 21 11:24 bin

drwxr-xr-x. 2 root root 4096 2月 21 11:24 conf

drwxr-xr-x. 4 root root 34 2月 21 11:24 examples

drwxr-xr-x. 7 root root 68 2月 21 11:24 hcatalog

drwxr-xr-x. 2 root root 44 2月 21 11:24 jdbc

drwxr-xr-x. 4 root root 8192 2月 21 11:24 lib

-rw-r--r--. 1 root root 29003 11月 29 05:35 LICENSE

-rw-r--r--. 1 root root 578 11月 29 22:09 NOTICE

-rw-r--r--. 1 root root 4122 11月 29 05:35 README.txt

-rw-r--r--. 1 root root 18501 11月 30 03:45 RELEASE_NOTES.txt

drwxr-xr-x. 4 root root 35 2月 21 11:24 scripts

创建meta数据库

连接mysql，创建hive元信息数据库：

MariaDB [(none)]> create database hive_metastore;
Query OK, 1 row affected (0.00 sec)

1 2	MariaDB [(none)]> create database hive_metastore; Query OK, 1 row affected (0.00 sec)

拷贝并修改hive核心配置

[root@localhost hive]# cp conf/hive-default.xml.template conf/hive-site.xml
[root@localhost hive]# vim conf/hive-site.xml

1 2	[root@localhost hive]# cp conf/hive-default.xml.template conf/hive-site.xml [root@localhost hive]# vim conf/hive-site.xml

hive默认会读取hive-site.xml作为配置文件，而hive-default.xml.template是给我们提供一个完整的配置模板而已。

首先有几个配置需要统一修改一下，它们指定在本机存储各类临时数据的位置，记得创建出对应的目录：

hive.exec.local.scratchdir：这个是hive本地模式才会用的，默认所有任务都会在hadoop上执行，所以此项可以忽略。
hive.downloaded.resources.dir：/root/hive/tmpdir/resources
hive.querylog.location：/root/hive/tmpdir/querylog
hive.server2.logging.operation.log.location：/root/hive/log/operation_logs

接下来，最重要的是配置hive的metastore，也就是利用mysql存储元信息（文档参考）：

mysql的地址（JDBC驱动的配置格式）

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive_metastore?createDatabaseIfNotExist=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/hive_metastore?createDatabaseIfNotExist=true</value>

JDBC connect string for a JDBC metastore.

To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.

For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.

</description>

</property>

mysql的帐号密码

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>
  <property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>mima123</value>
    <description>password to use against metastore database</description>
  </property>
  <property>

<name>javax.jdo.option.ConnectionUserName</name>

<description>Username to use against metastore database</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<description>password to use against metastore database</description>

</property>

mysql的访问驱动

在mysql官网下载最新release版本，将jar包放到hive/lib目录下：

[root@localhost hive]# pwd
/root/hive
[root@localhost hive]# ll lib/mysql-connector-java-5.1.40-bin.jar
-rw-r--r--. 1 root root 990927 2月  21 11:54 lib/mysql-connector-java-5.1.40-bin.jar

[root@localhost hive]# pwd

/root/hive

[root@localhost hive]# ll lib/mysql-connector-java-5.1.40-bin.jar

-rw-r--r--. 1 root root 990927 2月 21 11:54 lib/mysql-connector-java-5.1.40-bin.jar

然后配置驱动为mysql driver的完整类名（和jar包文件名无关）：

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

初始化metastore

执行命令，将会在mysql的hive_metastore数据库中建立meta存储用途的table：

[root@localhost hive]# bin/schematool -dbType mysql -initSchema

1	[root@localhost hive]# bin/schematool -dbType mysql -initSchema

配置log

[root@localhost hive]# cp conf/hive-log4j2.properties.template conf/hive-log4j2.properties
[root@localhost hive]# vim conf/hive-log4j2.properties

1 2	[root@localhost hive]# cp conf/hive-log4j2.properties.template conf/hive-log4j2.properties [root@localhost hive]# vim conf/hive-log4j2.properties

其中的property.hive.log.dir是日志存储目录，默认存储到了/tmp下，因此修改一下：

property.hive.log.dir = /root/hive/log/${sys:user.name}

1	property.hive.log.dir = /root/hive/log/${sys:user.name}

并创建log目录：

[root@localhost hive]# pwd
/root/hive
[root@localhost hive]# mkdir log

[root@localhost hive]# pwd

/root/hive

[root@localhost hive]# mkdir log

创建hdfs的hive仓储目录

向hive导入数据的话会存储到hdfs中的该目录，该目录在conf/hive-site.xml中配置，我们采用默认值：

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

<description>location of default database for the warehouse</description>

</property>

而hive在运行中会生成一些临时中间数据，默认都存储在hdfs中的/tmp目录下，可以具体查看conf/hive-site.xml。

因此，我们只需要在hdfs中建立/user/hive/warehouse以及/tmp两个目录，并保证可以读写：

[root@localhost hive]# /root/hadoop/bin/hadoop fs -mkdir -p /user/hive/warehouse
[root@localhost hive]# /root/hadoop/bin/hadoop fs -mkdir -p /tmp
[root@localhost hive]# /root/hadoop/bin/hadoop fs -chmod g+w /user/hive/warehouse
[root@localhost hive]# /root/hadoop/bin/hadoop fs -chmod g+w /tmp

[root@localhost hive]# /root/hadoop/bin/hadoop fs -mkdir -p /user/hive/warehouse

[root@localhost hive]# /root/hadoop/bin/hadoop fs -mkdir -p /tmp

[root@localhost hive]# /root/hadoop/bin/hadoop fs -chmod g+w /user/hive/warehouse

[root@localhost hive]# /root/hadoop/bin/hadoop fs -chmod g+w /tmp

配置metaserver

hive支持客户端直接访问mysql获取meta（内嵌模式），不过更好的方法是单独部署无状态的metaserver（它们与mysql交互），所有的客户端都通过Thrift RPC访问metaserver来获取meta信息。

为此，我们需要额外修改几个hive-site.xml配置，让客户端知道metaserver的服务地址：

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

<name>hive.metastore.uris</name>

<value>thrift://localhost:9083</value>

<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>

</property>

注意，这个配置只影响客户端访问的地址，而metaserver进程自身的服务端口只能在启动的时候通过命令行参数-p指定，这里采用9083是因为metaserver默认会绑定这个端口，就不需要我们-p去显式的指定了，在后面的启动hive环节会看到。

配置hiveserver2

最新的hive客户端不再直接执行mapreduce，而是将请求发送给hiveserver2（HS2），由HS2来负责解释与执行。HS2支持多任务并发，用户认证，以及后台任务执行，这意味着客户端不需要阻塞的等待结果，只需要定时去轮询HS2查询任务的最新状态即可，这是非常常见的一个需求。

另外，HS2使用Thrift RPC，这样无论什么语言都可以轻松的和HS2进行网络交互，围绕Hive做一些数据平台也不会太复杂。另外，观察到HS2目前也支持接入zookeeper来实现服务发现，保证集群高可用，不过官方也没有给出相关解释，所以暂时仍旧保持单点部署即可。

最后，HS2除了提供Thrift RPC的API外，还自带了一个Web UI，方便我们在页面上与HS2交互，也是很实用的入门工具。

日志配置

默认HS2会将客户端的请求保存到日志中，并且提供API来查询操作历史，因此需要指定日志存储路径：

  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/root/hive/log/operation_logs</value>
    <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
  </property>

<name>hive.server2.logging.operation.log.location</name>

<value>/root/hive/log/operation_logs</value>

<description>Top level directory where operation logs are stored if logging functionality is enabled</description>

</property>

Thrift配置

  <property>
    <name>hive.server2.thrift.bind.host</name>
    <value>172.18.9.75</value>
    <description>Bind host on which to run the HiveServer2 Thrift service.</description>
  </property>
  <property>

<name>hive.server2.thrift.bind.host</name>

<description>Bind host on which to run the HiveServer2 Thrift service.</description>

</property>

Web UI配置

没有做修改，采用默认值即可，如下：

  <property>
    <name>hive.server2.webui.host</name>
    <value>0.0.0.0</value>
    <description>The host address the HiveServer2 WebUI will listen on</description>
  </property>
  <property>
    <name>hive.server2.webui.port</name>
    <value>10002</value>
    <description>The port the HiveServer2 WebUI will listen on. This can beset to 0 or a negative integer to disable the web UI</description>
  </property>

<name>hive.server2.webui.host</name>

<description>The host address the HiveServer2 WebUI will listen on</description>

</property>

<name>hive.server2.webui.port</name>

<description>The port the HiveServer2 WebUI will listen on. This can beset to 0 or a negative integer to disable the web UI</description>

</property>

启动hive

hive没有守护进程启动的命令，所以都启动到后台即可。

环境变量

启动前，先在~/.bashrc中配置一下HIVE_HOME，以便HIVE能在对应路径下找到配置和文件：

export JAVA_HOME=/usr
export HADOOP_HOME=/root/hadoop
export HIVE_HOME=/root/hive

export JAVA_HOME=/usr

export HADOOP_HOME=/root/hadoop

export HIVE_HOME=/root/hive

metastore

nohup bin/hive --service metastore >/dev/null 2>&1 &

1	nohup bin/hive --service metastore >/dev/null 2>&1 &

hiveserver2

nohup bin/hive --service hiveserver2 > /dev/null 2>&1 &

1	nohup bin/hive --service hiveserver2 > /dev/null 2>&1 &

Beeline

bin/beeline -u jdbc:hive2://localhost:10000 -n root

1	bin/beeline -u jdbc:hive2://localhost:10000 -n root

beeline是取代hive客户端的新客户端，它访问HS2来发起hive操作，但是别急着敲下命令，继续往下看：

这里涉及一个hadoop.proxy的概念：默认HS2是以user=anonymous身份访问Hdfs的，我们称HS2是hadoop的一个代理服务。但是，我们实际上希望以root身份去访问hdfs，因为此前创建的hive数据目录都是属于root用户的，anonymous是无法访问的，那么此时就需要给hadoop配置一个proxyuser，意思是HS2代理可以支持用户以root身份访问hdfs，而不是anonymous用户。

为了实现这个能力，需要修改hadoop项目的core-site.xml配置来实现（记得重启namenode和datanode）：

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://172.18.9.75:11000</value>
	</property>
	<property>
		<name>hadoop.proxyuser.root.groups</name>
		<value>*</value>
	</property>
	<property>
		<name>hadoop.proxyuser.root.hosts</name>
		<value>*</value>
	</property>
</configuration>

<name>fs.defaultFS</name>

</property>

<name>hadoop.proxyuser.root.groups</name>

</property>

<name>hadoop.proxyuser.root.hosts</name>

</property>

</configuration>

hadoop.proxyuser.{$user}.groups：配置为*，表示任何用户组的其他登录帐号，都可以登录为user。
hadoop.proxyuser.{$user}.hosts：配置为*，表示来自任何host的其他登录帐号，都可以登录为user。

这两项综合起来的效果，就是说我们告知HS2以anonymous在某个机器上登录了Hdfs，但却可以代理root用户的操作，换句话说：HS2会告知hadoop我虽然是anonymous用户，但是我代理的是root用户。

执行命令确认hive工作正常：

0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
1 row selected (0.259 seconds)

0: jdbc:hive2://localhost:10000> show databases;

+----------------+--+

| database_name |

+----------------+--+

| default |

+----------------+--+

1 row selected (0.259 seconds)

访问HS2 Web UI

浏览器打开：http://172.18.9.75:10002，可以看到hive的执行状况。

本文仅对hive的搭建和组件做一个最小化的配置说明，在后续的博客中会进一步学习如何使用Hive的各种功能。

玩的愉快！

如果文章帮助您解决了工作难题，您可以帮我点击屏幕上的任意广告，或者赞助少量费用来支持我的持续创作，谢谢~

理解与搭建hive

部署hive

下载代码

创建meta数据库

拷贝并修改hive核心配置

mysql的地址（JDBC驱动的配置格式）

mysql的帐号密码

mysql的访问驱动

初始化metastore

配置log

创建hdfs的hive仓储目录

配置metaserver

配置hiveserver2

日志配置

Thrift配置

Web UI配置

启动hive

环境变量

metastore

hiveserver2

Beeline

访问HS2 Web UI

6 thoughts on “理解与搭建hive”

发表回复取消回复

部署hive

下载代码

创建meta数据库

拷贝并修改hive核心配置

mysql的地址（JDBC驱动的配置格式）

mysql的帐号密码

mysql的访问驱动

初始化metastore

配置log

创建hdfs的hive仓储目录

配置metaserver

配置hiveserver2

日志配置

Thrift配置

Web UI配置

启动hive

环境变量

metastore

hiveserver2

Beeline

访问HS2 Web UI

6 thoughts on “理解与搭建hive”

发表回复 取消回复

发表回复取消回复