Apache Hive

What is Apache Hive?

Hive is a data warehouse tool made by Facebook and now maintained by the Apache Foundation that aims to facilitate reading, writing and managing datasets using a SQL-like language.

The idea here is to have a SQL-like compiler that produces map-reduce jobs, that will act upon the data stored in the HDFS much like a relational database would do.

By using HIVE, we are able to impose a structure to some of our data (depending on it's stored format) and query it using sql queries, which are the tool that most of us is used to. This structure imposition does not change the data, it uses metadata files to look at it differently. Those metadata files are there to "explain" our raw files to the compiler, so it can build the correct map-reduce, tez or spark jobs.

In summary, Hive can be used as a OLAP tool (online analytic processing) to interact with the HDFS system by using SQL. Hive is not a relational database. It can't be used in any transactional scenario.

It supports a lot of raw formats, such as Sequence, Text, Avro, Orc and RC.

It also supports UDF and has special join functions for joining data.

The metadata is stored in a relational database called Derby.

Installing Hive

Firstly, go to https://hive.apache.org/downloads.html and download the latest version of hive. Here i'm using the 3.2.1 version.

Next, unpack it and put it on the appropriate folder in your system, by doing:

$ tar zxvf apache-hive-3.1.2-bin.tar.gz
$ sudo mv apache-hive-3.1.2-bin /opt/hive

After that, edit your .bashrc or /etc/profile file to include the following:

export HIVE_HOME="/opt/hive"
export PATH=$PATH:$HIVE_HOME/bin

Next, edit the hive config file to add the HADOOP_HOME var, it should point to your hadoop instalation path:

$ sudo nano $HIVE_HOME/bin/hive-config.sh

export HADOOP_HOME=/opt/hadoop

Next step: create a tmp folder inside your hdfs and add group write permission to it:

$ hdfs dfs -mkdir /tmp
$ hdfs dfs -chmod g+w /tmp

Following that, create a warehouse dir on hdfs and copy the default config to the hive-site.xml file:

$ hdfs dfs -mkdir -p /user/hive/warehouse
$ hdfs dfs -chmod g+w /user/hive/warehouse
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml

edit the hive-site.xml and add the following properties:

<name>system:java.io.tmpdir</name>

</property>

<name>system:user.name</name>

</property>

Now, your hive is setup.

Particularly for this version 3.1.2, we still need to something else. We need to downgrade the JVM version from 11 to 8, because hive does not support the version 11.

In my tests, I was unable to keep both versions installed on the same system. If that's your case, simply remove all java versions by issuing

$ dpkg-query -W -f='${binary:Package}\n' | grep -E -e '^(ia32-)?(sun|oracle)-java' -e '^openjdk-' -e '^icedtea' -e '^(default|gcj)-j(re|dk)' -e '^gcj-(.*)-j(re|dk)' -e '^java-common' | xargs sudo apt-get -y remove

$ sudo apt-get -y autoremove

$ sudo rm -rf /usr/lib/jvm/*

And then install the version 8 with the following command:

$ sudo apt install openjdk-8-jdk

If you'd like to try to run it with more than one jdk installed, make sure to select the version 8 for running Hive or as a default jdk by doing:

$ sudo update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

Selection Path Priority Status

------------------------------------------------------------

* 0 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 auto mode

1 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 manual mode

2 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode

> 2

Dont forget to also update javac

$ sudo update-alternatives --config javac

Also, check the hadoop-env.sh to change it to point to the 8 version.

Sometimes that is still not enough! (Yeah, I know...)

So, either you give up and just uninstall everything and keep the version 8 only, or you put the version 8 as auto mode, top priority. To do that:

sudo update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/java-8-openjdk-amd64/bin/javac 2000

sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/java-8-openjdk-amd64/bin/java 2000

For the Derby Database, it's better to put it in a centralized place, in order to be able to share it. To do that, we change the property javax.jdo.option.ConnectionURL value on hive-site.xml from

<value>jdbc:derby:;databaseName=metastore_db;create=true</value> to

<value>jdbc:derby:/opt/hive/metastore_db;databaseName=metastore_db;create=true</value>

Initiate the Derby Database

$ rm -rf $HIVE_HOME/metastore_db

$ cd $HIVE_HOME

$ schematool -initSchema -dbType derby

Now, initialize hive:

$ hive

hive> show tables;

hive>

Using Hive

We can create a table by using SQL as:

hive> create table tab_users (name string, age int, sex string);

But this is not very useful if we want to use that table the same way we do when we are using a rbms. This would be useful if we were creating a table based on data already present in the HDFS, making it a better way of selecting data.

For now, let's create it as a sort of an ACID database:

First, we need to change hive-site.xml to anable it to suport concurrency and also to use DbTxnManager as the transaction manager.

<name>hive.support.concurrency</name>

<name>hive.txn.manager</name>

<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>

After that, we need to add three properties that allow hive to proper use hadoop as a MapReduce executor:

<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>

</property>

<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>

</property>

<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>

</property>

No, all we have to do is to create our table with ORC format, which is compatible with ACID operations.

create table tab_users (id int, name string, age int, sex string) clustered by(id) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true');

All done!

To show the current tables:

hive> show tables;

To describe a table's format, use:

hive> describe tab_users;

name string

age int

sex string

Time taken: 0.081 seconds, Fetched: 3 row(s)

To insert data into tab_users (already set to ACID):

hive> insert into tab_users values ("User1", 39, "M");

Notice that the hive engine will create a MapReduce job to insert data into this table.

To select data from tab_users:

hive> select * from tab_users;

To remove data from tab_users (already set to ACID):

hive> delete tab_users [WHERE]

All operations create Hadoop MapReduce jobs to perform. We can check them at http://localhost:8088/cluster.

Hive + Spark

You can also change the execution engine hives uses to another one, particularly spark, by changing the property spark.execution.engine to spark in hive-site.xml

Spark SQL

The Spark SQL connects directly to hive to perform queries in hive tables. Just execute spark-sql on the command prompt and you're ready to manipulate data in no time

spark-sql> show tables;

tab_users

spark-sql>

Data Science Notes

Apache Hive

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Data Acquisition: Connection to Relational Databases