Data Science Notes

Postagens

Dataframe manipulation

janeiro 19, 2022

A Dataframe can be manipulated by using either a SQL syntax as specification (Spark-SQL) or by using manipulation functions applied to the dataframe object itself. Data Manipulation using Functions It's a method more suitable for those who are more used to coding than to writing queries. The way you do data manipulation is simple: given a dataframe, apply transformation functions to it to get what you need. As an example, let's suppose you have a structured data that corresponds to a table of divisions of an imaginary IT company divisions = [ Row ( id = '1' , name = 'Data Science' ), Row ( id = '2' , name = 'Big Data' ), Row ( id = '3' , name = 'Artificial Inteligence' ), Row ( id = '4' , name = 'Devops' ) ] Let's also have a list of employees for that company Employee = Row ( "name" , "email" ) employees = [ Employee ( "John Robbins" , '...

Leia mais

Hive II: Using it with MySQL as a metadata db

janeiro 18, 2022

In the previous post, we setup a Hive with metadata stored in Derby. Sometimes this is not desirable, since Derbty DB it's not as popular as MySQL / Posgre or other more known dbms. In this post, we will make hive store it's metadata in a MySQL database. Remember that the data itself is still being stored in the HDFS. What we're changing here is only the location for the metadata info. We wont deal with data migration, for the sake of simplicity, so please know that any metadata you have will be lost. We're assuming that you are doing this in the beginning of your system setup. Be aware of that! MySQL Install This is already covered in a previous topic , where we installed a MySQL to be accessed by the Spark engine. Please refer to it if you have any other questions, but in summary, what we'll do is: sudo apt install mysql...

Leia mais

Apache Hive

janeiro 17, 2022

What is Apache Hive? Hive is a data warehouse tool made by Facebook and now maintained by the Apache Foundation that aims to facilitate reading, writing and managing datasets using a SQL-like language. The idea here is to have a SQL-like compiler that produces map-reduce jobs, that will act upon the data stored in the HDFS much like a relational database would do. By using HIVE, we are able to impose a structure to some of our data (depending on it's stored format) and query it using sql queries, which are the tool that most of us is used to. This structure imposition does not change the data, it uses metadata files to look at it differently. Those metadata files are there to "explain" our raw files to the compiler, so it can build the correct map-reduce, tez or spark jobs. In summary, Hive can be used as a OLAP tool (online analytic processing) to interact with the HDFS system by using SQL. Hive is not a relation...

Leia mais

Data Acquisition: Connection to Relational Databases

janeiro 13, 2022

In this Post, we'll cover the connection between Spark and relational databases. The way spark does to connect to relational databases is by using a JDBC drive. Let's explore this using MySQL and Oracle as examples. Creating a MySQL database and connecting to it. For this example, let's explore a locally installed mysql instance. To install a mysql instance in an Ubuntu linux, you can issue the following commands: $ sudo apt update $ sudo apt install mysql-server After that, configure your database with the following command: $ sudo mysql_secure_installation Next, change the way root user will login, to enable it to be logged in from prompt. To do that, first you need to login as system root. This way, MySQL wont ask for password: $ sudo mysql -u root mysql> USE mysql; mysql> UPDATE user SET plugin='mysql_...

Leia mais

Job Programming: User-defined Functions

janeiro 11, 2022

User-defined functions (UDF) are a way to run some code over every line (or some line) of a dataframe. Let's imagine that you have a dataframe called df that you have created a view for it. Let's suppose that you want to run a function that you've created once you do a query. The way you do that is by creating an UDF and registering it to spark. Once you've done that, you can call if from a query in the same way you'll do with a builtin function. For example, lets take the salary data set from a previous post: from pyspark . sql import SparkSession from pyspark . sql . dataframe import DataFrame import pyspark . sql . functions as sf # Creating a spark session spark = ( SparkSession . builder . appName ( "spark_learn_sc" ). getOrCreate ()) # Reading our input data as a DataFrame df : DataFrame = spark . read . csv ( "../examples/salary.dat" , inferSchema = True , header ...

Leia mais

Creating a Single Node Hadoop - Spark VM for Developing / Learning

dezembro 14, 2021

Before we go deeper into Data Science, let's break for a moment and deal with some basic infrastructure just to allow us to build a basic local hadoop environment for developing or learning purposes. First and foremost, you should have a Linux installed, up and running. Also, you should know your way on linux. Make sure, at least, you know how to use the terminal, you know the basic file structure, how to pack/unpack things in tar/bz2/gz etc, how to ssh to some machine, how to start a VM (in case you're not using your local machine to install things) etc etc. To make the long story short: you need to have a basic level of linux knowledge. If you've never used linux before, I strongly suggest that you stop whatever you're doing right know and go learn it, at least the basics of it. Dump your windows, get a linux distro, and start usingit for your daily life stuff. Yes, it's necessary. Maybe not so much right now, but it will be. ...

Leia mais