Spark first steps - Introduction [3] - Spark SQL
Spark SQL is another way to do data manipulation using a DataFrame, a Dataset, a database connector, hive tables, avro, orc, json, parquet files etc. It implements the ANSI SQL:2003 pattern, which means your knowledge of relational databases come in hand here! But not just that. It comes with two internal engines that make thinks interesting: the Catalyst Optimizer and the Project Tungsten. The Catalyst Optimizer is responsible for providing query optimization. Much like a decent relational database would do, it converts your SQL query into a Java bytecode that is optimized for the "hadoop" way of dealing with the data. They way it works is by taking basically four steps: Analysis Logical Optimization Physical Planning Code Generation Analysis The Spark SQL begins by generating an "Abstract Syntax Tree" (AST) when it investigates wha...