Creating a Single Node Hadoop - Spark VM for Developing / Learning

    Before we go deeper into Data Science, let's break for a moment and deal with some basic infrastructure just to allow us to build a basic local hadoop environment for developing or learning purposes. 

    First and foremost, you should have a Linux installed, up and running. Also, you should know your way on linux. Make sure, at least, you know how to use the terminal, you know the basic file structure, how to pack/unpack things in tar/bz2/gz etc, how to ssh to some machine, how to start a VM (in case you're not using your local machine to install things) etc etc. To make the long story short: you need to have a basic level of linux knowledge.

    If you've never used linux before, I strongly suggest that you stop whatever you're doing right know and go learn it, at least the basics of it. Dump your windows, get a linux distro, and start usingit for your daily life stuff. Yes, it's necessary. Maybe not so much right now, but it will be. Very soon.


 Spark Instalation

    The spark model follows the classic master-slave architecture. For a single node application, the same linux will run the master and the slave programs. If you wish to simulate a distributed environment, you can start another linux (in a VM or in a docker) and install a slave there, pointing to the master.

    Both master and slave are from the same packages. So, go to https://downloads.apache.org/spark/ and chose a version to download there. I prefer to always get the latest one, but In case you're already working with a cluster in a job (for example), you could opt for having the same version locally, to avoid having different kinds of bugs in your work from those in your house, well... just in case, right?

    While your computer is downloading the spark tgz, do install the basic requisites for running spark. In ubuntu, it should be done with:

    $ sudo apt install default-jdk scala git -y

    Once everything is finished, install spark by doing:

    $ tar xvf spark-3.2.0-bin-hadoop3.2.tgz
    $ sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark
    

    Edit the ~/.profile file and add those lines (or edit them) 

        export SPARK_HOME=/opt/spark

        export PATH=(...):$SPARK_HOME/bin:$SPARK_HOME/sbin    <--- add to your current PATH.... OR
        export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin    <--- IN CASE YOU DON'T HAVE A PATH exported on .profile

        export PYSPARK_PYTHON=/usr/bin/python3


Now reset your local profile settings:

    $ source ~/.profile

Finally, start the master and the slave nodes (which here are the same)

    $ start-master.sh

    $ start-worker.sh spark://(YOUR MASTER SERVER NAME):7077

YOUR MASTER SERVER NAME => the name your linux system provides you when you run $ hostname    

    Check that the spark master is up and also the slave is up, running and recognized at http://127.0.0.1:8080/

    All done for the Spark instalation.


 Hadoop Instalation

    Now, lets install hadoop, so we have access to a HDFS filesystem. That's important to allow us to learn how to work with this distributed FS, and it will be essential, specially later, when we will be dealing with a big cluster.

    1. Install the pre-requisites:

    $ sudo apt-get install ssh pdsh

    if not set already, set the JAVA_HOME env:

    $ export JAVA_HOME=(PATH)

    Not sure of where it is? Easy, just run the command bellow and copy it's result.

    $ jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));'

    By the way, that export should be saved in your .profile too.

    Now go to one of the hadoop mirrors in http://www.apache.org/dyn/closer.cgi/hadoop/common/ and download a compatible version (For example, if your spark says it expects a spark 3.2, you should get that. Do verify what's the required version and go for it)

    $ tar xvf ./hadoop-3.2.2.tar.gz
   

    Now go to /opt/hadoop and edit the file /opt/hadoop/etc/hadoop/hadoop-env.sh and add those lines below:  

    export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
    export PATH=$PATH:$JAVA_HOME/bin
    export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar

    export HADOOP_HOME=/opt/hadoop
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_YARN_HOME=$HADOOP_HOME
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native


    Also, add /opt/hadoop/bin to the PATH in your ~/.profile file in order to have easy access to hdfs binaries on command prompt.

    Also add to the .profile:

    export PDSH_RCMD_TYPE=ssh

    Next, let's setup the namenode for our HDFS filesystem:

    First, let's configure the hadoop to run in pseudo-cluster mode. Edit the following files in the /opt/hadoop/etc/hadoop folder:


core-site.xml

<configuration>  

  <property>    

    <name>fs.defaultFS</name>

    <value>hdfs://(YOUR HOSTNAME):9000</value>

  </property>

</configuration>


hdfs-site.xml

<configuration>  

  <property>    

    <name>dfs.replication</name>

    <value>1</value>

  </property>

<property> <name>dfs.name.dir</name> <value>file:///opt/hadoop/hadoopdata/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///opt/hadoop/hadoopdata/hdfs/datanode</value> </property>

</configuration>


mapred-site.xml

<configuration>  

  <property>    

    <name>mapreduce.framework.name</name>

    <value>yarn</value>

  </property>

</configuration>

yarn-site.xml

<configuration>  

  <property>    

    <name>>yarn.resourcemanager.hostname</name>

    <value>(YOUR HOSTNAME)</value>

  </property>

  <property>    

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

  </property>

</configuration>

      
    Now, setup the SSH key to the local user

     $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
     $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     $ chmod 0600 ~/.ssh/authorized_keys

    Then, create folders for namenode and datanode data:

    mkdir -p /opt/hadoop/hadoopdata/hdfs/namenode
    mkdir -p /opt/hadoop/hadoopdata/hdfs/datanode

    Finally, lets format the namenode:

    $ hdfs namenode -format    

    Start it by executing /opt/hadoop/sbin/start-dfs.sh and then test it:

    $ /opt/hadoop/sbin/start-dfs.sh
    $ hdfs dfs -ls /

    $ echo "aaa" > a.txt
    $ hdfs dfs -put a.txt /a.txt
    $ hdfs dfs -ls /


    Optionally, I recommend that you create a shell script to initialize the "single node cluster":

    start-hadoop.sh:

        #! /bin/sh
        echo "Starting HDFS..."
        /opt/hadoop/sbin/start-all.sh
        echo "Starting Spark..."
        start-master.sh
        start-worker.sh spark://(YOUR HOSTNAME):7077


    All done!


    



Comentários

Postagens mais visitadas deste blog

Data Acquisition: Connection to Relational Databases