Spark first steps - Introduction [3] - Spark SQL
Spark SQL is another way to do data manipulation using a DataFrame, a Dataset, a database connector, hive tables, avro, orc, json, parquet files etc.
It implements the ANSI SQL:2003 pattern, which means your knowledge of relational databases come in hand here!
But not just that. It comes with two internal engines that make thinks interesting: the Catalyst Optimizer and the Project Tungsten.
The Catalyst Optimizer is responsible for providing query optimization. Much like a decent relational database would do, it converts your SQL query into a Java bytecode that is optimized for the "hadoop" way of dealing with the data.
They way it works is by taking basically four steps:
- Analysis
- Logical Optimization
- Physical Planning
- Code Generation
Analysis
The Spark SQL begins by generating an "Abstract Syntax Tree" (AST) when it investigates what are your query's intentions. Any columns will be resolved by looking into the internal catalog structure. This catalog should be previously feed by a register command.
Logical Optimization
After being analyzed, the query is optimized by the Catalyst Optimizer using a cost-based optimizer (CBO), generating multiple execution plans and assigning costs to each one, in order to discover which will be the best one. The chosen one will be fed to the next step
Physical Planning
An optimal physical plan will be generated using physical operators that match those available in the Spark execution engine.
Code Generation
Here the Project Tungsten comes in place. In this phase, efficient java bytecode is generated and dispatched for execution.
A first example
To help us compare all the different ways we have of accessing and manipulating our data, let's rebuild the same solution as Introduction (part 2) and list the salary list for each employee in our small example dataset.
The beggining of our code is the same one as the previous DataFrame example:
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
import pyspark.sql.functions as sf
# Creating a spark session
spark = (SparkSession.builder.appName("spark_learn_sc").getOrCreate())
# Reading our input data as a DataFrame
df:DataFrame = spark.read.csv("../examples/salary.dat", inferSchema=True,
header=True)
Next, lets declare our dataframe as a table. Here we are letting it infer the table's schema for us.
df.createOrReplaceTempView("tab_salary")
Finally, let's execute our query to retrieve the information we want:
res = spark.sql("""select Name, Sum(Salary) as Salary from
tab_salary group by Name order by Salary""")
res.show()
Pretty easy!
+-------+------+
| Name|Salary|
+-------+------+
|Charles| 3645|
| Maria| 7910|
| Frank| 8756|
| Josh| 13456|
+-------+------+
Comentários
Postar um comentário