Creating a Single Node Hadoop - Spark VM for Developing / Learning
Before we go deeper into Data Science, let's break for a moment and deal with some basic infrastructure just to allow us to build a basic local hadoop environment for developing or learning purposes.
First and foremost, you should have a Linux installed, up and running. Also, you should know your way on linux. Make sure, at least, you know how to use the terminal, you know the basic file structure, how to pack/unpack things in tar/bz2/gz etc, how to ssh to some machine, how to start a VM (in case you're not using your local machine to install things) etc etc. To make the long story short: you need to have a basic level of linux knowledge.
If you've never used linux before, I strongly suggest that you stop whatever you're doing right know and go learn it, at least the basics of it. Dump your windows, get a linux distro, and start usingit for your daily life stuff. Yes, it's necessary. Maybe not so much right now, but it will be. Very soon.
Spark Instalation
The spark model follows the classic master-slave architecture. For a single node application, the same linux will run the master and the slave programs. If you wish to simulate a distributed environment, you can start another linux (in a VM or in a docker) and install a slave there, pointing to the master.
Both master and slave are from the same packages. So, go to https://downloads.apache.org/spark/ and chose a version to download there. I prefer to always get the latest one, but In case you're already working with a cluster in a job (for example), you could opt for having the same version locally, to avoid having different kinds of bugs in your work from those in your house, well... just in case, right?
While your computer is downloading the spark tgz, do install the basic requisites for running spark. In ubuntu, it should be done with:
$ sudo apt install default-jdk scala git -y
Once everything is finished, install spark by doing:
$ tar xvf spark-3.2.0-bin-hadoop3.2.tgz
$ sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark
Edit the ~/.profile file and add those lines (or edit them)
export SPARK_HOME=/opt/spark
export PATH=(...):$SPARK_HOME/bin:$SPARK_HOME/sbin <--- add to your current PATH.... OR
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin <--- IN CASE YOU DON'T HAVE A PATH exported on .profile
export PYSPARK_PYTHON=/usr/bin/python3
Now reset your local profile settings:
$ source ~/.profile
Finally, start the master and the slave nodes (which here are the same)
$ start-master.sh
$ start-worker.sh spark://(YOUR MASTER SERVER NAME):7077
YOUR MASTER SERVER NAME => the name your linux system provides you when you run $ hostname
Check that the spark master is up and also the slave is up, running and recognized at http://127.0.0.1:8080/
All done for the Spark instalation.
Hadoop Instalation
Now, lets install hadoop, so we have access to a HDFS filesystem. That's important to allow us to learn how to work with this distributed FS, and it will be essential, specially later, when we will be dealing with a big cluster.
1. Install the pre-requisites:
$ sudo apt-get install ssh pdsh
if not set already, set the JAVA_HOME env:
$ export JAVA_HOME=(PATH)
Not sure of where it is? Easy, just run the command bellow and copy it's result.
$ jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));'
By the way, that export should be saved in your .profile too.
Now go to one of the hadoop mirrors in http://www.apache.org/dyn/closer.cgi/hadoop/common/ and download a compatible version (For example, if your spark says it expects a spark 3.2, you should get that. Do verify what's the required version and go for it)
$ tar xvf ./hadoop-3.2.2.tar.gz
Now go to /opt/hadoop and edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh and add those lines below:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://(YOUR HOSTNAME):9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property> <name>dfs.name.dir</name> <value>file:///opt/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///opt/hadoop/hadoopdata/hdfs/datanode</value> </property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>>yarn.resourcemanager.hostname</name>
<value>(YOUR HOSTNAME)</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
$ hdfs namenode -format
Start it by executing /opt/hadoop/sbin/start-dfs.sh and then test it:
$ /opt/hadoop/sbin/start-dfs.sh
$ hdfs dfs -ls /
$ echo "aaa" > a.txt
$ hdfs dfs -put a.txt /a.txt
$ hdfs dfs -ls /
Optionally, I recommend that you create a shell script to initialize the "single node cluster":
#! /bin/sh
echo "Starting HDFS..."
/opt/hadoop/sbin/start-all.sh
echo "Starting Spark..."
start-master.sh
start-worker.sh spark://(YOUR HOSTNAME):7077
Comentários
Postar um comentário