Postagens

Mostrando postagens de janeiro, 2022

Dataframe manipulation

     A Dataframe can be manipulated by using either a SQL syntax as specification (Spark-SQL) or by using manipulation functions applied to the dataframe object itself. Data Manipulation using Functions     It's a method more suitable for those who are more used to coding than to writing queries. The way you do data manipulation is simple: given a dataframe, apply transformation functions to it to get what you need.     As an example, let's suppose you have a structured data that corresponds to a table of divisions of an imaginary IT company divisions = [ Row ( id = '1' , name = 'Data Science' ), Row ( id = '2' , name = 'Big Data' ), Row ( id = '3' , name = 'Artificial Inteligence' ), Row ( id = '4' , name = 'Devops' ) ]     Let's also have a list of employees for that company Employee = Row ( "name" , "email" ) employees = [ Employee ( "John Robbins" , '...

Hive II: Using it with MySQL as a metadata db

       In the previous post, we setup a Hive with metadata stored in Derby.         Sometimes this is not desirable, since Derbty DB it's not as popular as MySQL / Posgre or other more known dbms.        In this post, we will make hive store it's metadata in a MySQL database. Remember that the data itself is still being stored in the HDFS. What we're changing here is only the location for the metadata info.       We wont deal with data migration, for the sake of simplicity, so please know that any metadata you have will be lost.  We're assuming that you are doing this in the beginning of your system setup. Be aware of that!     MySQL Install     This is already covered in a previous topic , where we installed a MySQL to be accessed by the Spark engine. Please refer to it if you have any other questions, but in summary, what we'll do is:      sudo apt install mysql...

Apache Hive

 What is Apache Hive?     Hive is a data warehouse tool made by Facebook and now maintained by the Apache Foundation that aims to facilitate reading, writing and managing datasets using a SQL-like language.     The idea here is to have a SQL-like compiler that produces map-reduce jobs, that will act upon the data stored in the HDFS much like a relational database would do.     By using HIVE, we are able to impose a structure to some of our data (depending on it's stored format) and query it using sql queries, which are the tool that most of us is used to. This structure imposition does not change the data, it uses metadata files to  look at it differently. Those metadata files are there to "explain" our raw files to the compiler, so it can build the correct map-reduce, tez or spark jobs.     In summary, Hive can be used as a OLAP tool (online analytic processing) to interact with the HDFS system by using SQL.  Hive is not a relation...

Data Acquisition: Connection to Relational Databases

Imagem
      In this Post, we'll cover the connection between Spark and relational databases.     The way spark does to connect to relational databases is by using a JDBC drive. Let's explore this using MySQL and Oracle as examples. Creating a MySQL database and connecting to it.     For this example, let's explore a locally installed mysql instance. To install a mysql instance in an Ubuntu linux, you can issue the following commands:     $ sudo apt update     $ sudo apt install mysql-server     After that, configure your database with the following command:     $ sudo mysql_secure_installation     Next, change the way root user will login, to enable it to be logged in from prompt. To do that, first you need to login as system root. This way, MySQL wont ask for password:     $ sudo mysql -u root      mysql> USE mysql;      mysql> UPDATE user SET plugin='mysql_...

Job Programming: User-defined Functions

      User-defined functions (UDF) are a way to run some code over every line (or some line) of a dataframe.       Let's imagine that you have a  dataframe  called df that you have created a view for it. Let's suppose that you want to run a function that you've created once you do a query. The way you do that is by creating an UDF and registering it to spark.     Once you've done that, you can call if from a query in the same way you'll do with a builtin function.     For example, lets take the salary data set from a previous post: from pyspark . sql import SparkSession from pyspark . sql . dataframe import DataFrame import pyspark . sql . functions as sf # Creating a spark session spark = ( SparkSession . builder . appName ( "spark_learn_sc" ). getOrCreate ()) # Reading our input data as a DataFrame df : DataFrame = spark . read . csv ( "../examples/salary.dat" , inferSchema = True ,      header ...