Spark first steps - Introduction [3]

Spark SQL is another way to do data manipulation using a DataFrame, a Dataset, a database connector, hive tables, avro, orc, json, parquet files etc.

It implements the ANSI SQL:2003 pattern, which means your knowledge of relational databases come in hand here!

But not just that. It comes with two internal engines that make thinks interesting: the Catalyst Optimizer and the Project Tungsten.

The Catalyst Optimizer is responsible for providing query optimization. Much like a decent relational database would do, it converts your SQL query into a Java bytecode that is optimized for the "hadoop" way of dealing with the data.

They way it works is by taking basically four steps:

Analysis
Logical Optimization
Physical Planning
Code Generation

Analysis

The Spark SQL begins by generating an "Abstract Syntax Tree" (AST) when it investigates what are your query's intentions. Any columns will be resolved by looking into the internal catalog structure. This catalog should be previously feed by a register command.

Logical Optimization

After being analyzed, the query is optimized by the Catalyst Optimizer using a cost-based optimizer (CBO), generating multiple execution plans and assigning costs to each one, in order to discover which will be the best one. The chosen one will be fed to the next step

Physical Planning

An optimal physical plan will be generated using physical operators that match those available in the Spark execution engine.

Code Generation

Here the Project Tungsten comes in place. In this phase, efficient java bytecode is generated and dispatched for execution.

A first example

To help us compare all the different ways we have of accessing and manipulating our data, let's rebuild the same solution as Introduction (part 2) and list the salary list for each employee in our small example dataset.

The beggining of our code is the same one as the previous DataFrame example:

from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
import pyspark.sql.functions as sf

# Creating a spark session
spark = (SparkSession.builder.appName("spark_learn_sc").getOrCreate())

# Reading our input data as a DataFrame
df:DataFrame = spark.read.csv("../examples/salary.dat", inferSchema=True, 
    header=True)

Next, lets declare our dataframe as a table. Here we are letting it infer the table's schema for us.

df.createOrReplaceTempView("tab_salary")

Finally, let's execute our query to retrieve the information we want:

res = spark.sql("""select Name, Sum(Salary) as Salary from 
tab_salary group by Name order by Salary""")

res.show()

Pretty easy!

+-------+------+
|   Name|Salary|
+-------+------+
|Charles|  3645|
|  Maria|  7910|
|  Frank|  8756|
|   Josh| 13456|
+-------+------+

Data Science Notes

Spark first steps - Introduction [3] - Spark SQL

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Data Acquisition: Connection to Relational Databases