Data Science Notes

Postagens

Mostrando postagens de novembro, 2021

Spark first steps - Introduction [3] - Spark SQL

novembro 22, 2021

Spark SQL is another way to do data manipulation using a DataFrame, a Dataset, a database connector, hive tables, avro, orc, json, parquet files etc. It implements the ANSI SQL:2003 pattern, which means your knowledge of relational databases come in hand here! But not just that. It comes with two internal engines that make thinks interesting: the Catalyst Optimizer and the Project Tungsten. The Catalyst Optimizer is responsible for providing query optimization. Much like a decent relational database would do, it converts your SQL query into a Java bytecode that is optimized for the "hadoop" way of dealing with the data. They way it works is by taking basically four steps: Analysis Logical Optimization Physical Planning Code Generation Analysis The Spark SQL begins by generating an "Abstract Syntax Tree" (AST) when it investigates wha...

Spark first steps - Introduction [2] - Process data using RDD x DataFrame

novembro 19, 2021

Processing Data using RDD Now, let's see some examples of data manipulation using RDD, DataFrame and DataSet, starting with the most low-level "basic" RDD API. The first thing you will notice here is that it mimics or exposes the way a map-reduce paradigm should work, so what you'll get is a set of functions that perform basic functional operations to either modify the data or to reduce it (or merge it, aggregate it etc). Let's start with a simple example: We have a simple CSV file representing our data. It could be a lot of files in a HDFS, but for now lets keep things simple for these first steps. Name,Job Role,Salary Josh,Data Scientist,10000 Josh,Data Engineer,3456 Maria,Data Engineer,5654 Maria,IoT Specialist,2256 Frank,Project Manager,8756 Charles,Data Engineer,3645 Lets now suppose we want to know how much is each person salary. Please notice that a person can have more than one job role. For each job role, ...

Spark first steps - Introduction [1] - Session, RDD, DataFrame, DataSet

novembro 19, 2021

Spark can access and aggregate data. It allows you to access multiple sources of data, either local or remotely. Combined with a hadoop cluster environment, it provides multinode data access, which allows you to execute everything in parallel. Spark tries to dispatch job task execution to where the data is, in order to minimize I/O in the network. How to work with it Essentially, once everything is all set up, the first thing we need to do is to find a way to create a spark session. The first option is to use the pyspark shell. It creates a spark session for you and saves it in a variable called 'sc': Spark session created by the pyspark shell Another way to create it is manually, in code. You can use an IDE or the Jupyter notebook to do so. In this first example, we run it from a Visual Studio ".py" simple program: Spark session created manually ...