Postagens

Feeding the Data-lake I: Task management / automatization using Airflow

Imagem
 What is Airflow?      Airflow is a platform to programmatically author, schedule and monitor workflows. The idea is to program your workflow automation using python.     We  use the DAG concept to build tasks on Airflow. DAG stands for Direct Acyclic Graphs. This tasks are then interpreted and scheduled for execution by the Airflow execution engine. What is a DAG?      In graph theory,  Airflow install in docker     Let's build a docker cluster to run Airflow locally, by following the oficial documentation. Please refer to that documentation if you bump into any issues or to get a more thorough view of this solution. Step 1: Install docker-compose sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose- $( uname -s ) - $( uname -m ) " -o /usr/local/bin/docker-compose $ sudo chmod +x /usr/local/bin/docker-compose Step 2 : Get the docker compose yaml example file This yaml file will have...

Dataframe manipulation

     A Dataframe can be manipulated by using either a SQL syntax as specification (Spark-SQL) or by using manipulation functions applied to the dataframe object itself. Data Manipulation using Functions     It's a method more suitable for those who are more used to coding than to writing queries. The way you do data manipulation is simple: given a dataframe, apply transformation functions to it to get what you need.     As an example, let's suppose you have a structured data that corresponds to a table of divisions of an imaginary IT company divisions = [ Row ( id = '1' , name = 'Data Science' ), Row ( id = '2' , name = 'Big Data' ), Row ( id = '3' , name = 'Artificial Inteligence' ), Row ( id = '4' , name = 'Devops' ) ]     Let's also have a list of employees for that company Employee = Row ( "name" , "email" ) employees = [ Employee ( "John Robbins" , '...

Hive II: Using it with MySQL as a metadata db

       In the previous post, we setup a Hive with metadata stored in Derby.         Sometimes this is not desirable, since Derbty DB it's not as popular as MySQL / Posgre or other more known dbms.        In this post, we will make hive store it's metadata in a MySQL database. Remember that the data itself is still being stored in the HDFS. What we're changing here is only the location for the metadata info.       We wont deal with data migration, for the sake of simplicity, so please know that any metadata you have will be lost.  We're assuming that you are doing this in the beginning of your system setup. Be aware of that!     MySQL Install     This is already covered in a previous topic , where we installed a MySQL to be accessed by the Spark engine. Please refer to it if you have any other questions, but in summary, what we'll do is:      sudo apt install mysql...

Apache Hive

 What is Apache Hive?     Hive is a data warehouse tool made by Facebook and now maintained by the Apache Foundation that aims to facilitate reading, writing and managing datasets using a SQL-like language.     The idea here is to have a SQL-like compiler that produces map-reduce jobs, that will act upon the data stored in the HDFS much like a relational database would do.     By using HIVE, we are able to impose a structure to some of our data (depending on it's stored format) and query it using sql queries, which are the tool that most of us is used to. This structure imposition does not change the data, it uses metadata files to  look at it differently. Those metadata files are there to "explain" our raw files to the compiler, so it can build the correct map-reduce, tez or spark jobs.     In summary, Hive can be used as a OLAP tool (online analytic processing) to interact with the HDFS system by using SQL.  Hive is not a relation...

Data Acquisition: Connection to Relational Databases

Imagem
      In this Post, we'll cover the connection between Spark and relational databases.     The way spark does to connect to relational databases is by using a JDBC drive. Let's explore this using MySQL and Oracle as examples. Creating a MySQL database and connecting to it.     For this example, let's explore a locally installed mysql instance. To install a mysql instance in an Ubuntu linux, you can issue the following commands:     $ sudo apt update     $ sudo apt install mysql-server     After that, configure your database with the following command:     $ sudo mysql_secure_installation     Next, change the way root user will login, to enable it to be logged in from prompt. To do that, first you need to login as system root. This way, MySQL wont ask for password:     $ sudo mysql -u root      mysql> USE mysql;      mysql> UPDATE user SET plugin='mysql_...

Job Programming: User-defined Functions

      User-defined functions (UDF) are a way to run some code over every line (or some line) of a dataframe.       Let's imagine that you have a  dataframe  called df that you have created a view for it. Let's suppose that you want to run a function that you've created once you do a query. The way you do that is by creating an UDF and registering it to spark.     Once you've done that, you can call if from a query in the same way you'll do with a builtin function.     For example, lets take the salary data set from a previous post: from pyspark . sql import SparkSession from pyspark . sql . dataframe import DataFrame import pyspark . sql . functions as sf # Creating a spark session spark = ( SparkSession . builder . appName ( "spark_learn_sc" ). getOrCreate ()) # Reading our input data as a DataFrame df : DataFrame = spark . read . csv ( "../examples/salary.dat" , inferSchema = True ,      header ...

Creating a Single Node Hadoop - Spark VM for Developing / Learning

     Before we go deeper into Data Science, let's break for a moment and deal with some basic infrastructure just to allow us to build a basic local hadoop environment for developing or learning purposes.      First and foremost, you should have a Linux installed, up and running. Also, you should know your way on linux. Make sure, at least, you know how to use the terminal, you know the basic file structure, how to pack/unpack things in tar/bz2/gz etc, how to ssh to some machine, how to start a VM (in case you're not using your local machine to install things) etc etc. To make the long story short: you need to have a basic level of linux knowledge.     If you've never used linux before, I strongly suggest that you stop whatever you're doing right know and go learn it, at least the basics of it. Dump your windows, get a linux distro, and start usingit for your daily life stuff. Yes, it's necessary. Maybe not so much right now, but it will be. ...