Spark first steps - Introduction [1] - Session, RDD, DataFrame, DataSet

Spark can access and aggregate data. It allows you to access multiple sources of data, either local or remotely.

Combined with a hadoop cluster environment, it provides multinode data access, which allows you to execute everything in parallel. Spark tries to dispatch job task execution to where the data is, in order to minimize I/O in the network.

How to work with it

Essentially, once everything is all set up, the first thing we need to do is to find a way to create a spark session.

The first option is to use the pyspark shell. It creates a spark session for you and saves it in a variable called 'sc':

Spark session created by the pyspark shell

Another way to create it is manually, in code. You can use an IDE or the Jupyter notebook to do so. In this first example, we run it from a Visual Studio ".py" simple program:

Spark session created manually

Now we do the same thing with a Jupyter Notebook.

Essentially, there's no difference between doing so in code using an IDE and doing it in a notebook. Nevertheless, by using a notebook, we're able to experiment in our code much easily, because we can write a little bit of code, then run it, then change it and test it to see if it works, everything being done on-the-fly.

Spark session created manually in a Jupyter Notebook

After having created our spark session, by using any of the above methods, we can now instruct Spark to access some data to work with.

The way Spark works is it creates an abstraction called RDD for each data we instruct it to access. On top of a RDD, Spark can build a DataFrame or a DataSet, which are different forms of accessing that data.

First and foremost, what is a RDD?

RDD is the short for Resilient Distributed Data. At the core, it is a an immutable collection that is potentially distributed / partitioned across the cluster, that supports parallel operations.

Each RDD has three key characteristics:

Dependency list
Partitions
Computer or transform function

The dependency list characteristic is responsible for making the RDD resilient. In any situation that requires a RDD to be reconstructed for any reason, for example, a memory loss or a node reset, the RDD can be rebuild by "replaying" it from it's dependencies.

Partitions allow Spark to split the workload among the cluster. In some cases, it can dispatch the execution to be physically closed to the data such as, for example, when using a HDFS, the workload can be done in the node that portion of data is located. This results in less I/O wait time on the cluster.

Finally, a computer or transform function is responsible to transform or act upon our RDD, providing is with an iterator to work with.

DataFrame

In case we're working with a table-like structured data, such as CSV files, relational database tables (or file representations of them), we can use a more user friendly abstraction called a DataFrame.

A DataFrame is a data structure inspired by the Pandas' DataFrames, that works as a distributed table, with named columns and schemas. Therefore, it allows us to impose a format to our distributed data collection, making it easier to be manipulated.

DataSet

From Spark 2.0, the concept of a DataSet was introduced. Originally a DataSet is a collection of typed objects that can be transformed by using functional or relational operations.

Later on, Spark merged DataFrame into the DataSet concept as a single abstraction unit. So, a DataFrame is now an untyped DataSet.

In untyped languages, such as python, typed DataSets do not make sense, so, when we are using python to process our data, we deal with DataFrames (which is an alias for untyped DataSets).

In typed languages such as Java or Scala, we can make use of DataSets to better handle our data.

RDD x DataFrame x DataSet

Why would someone use RDD if we have DataSets to work with?

Well, first of all and maybe the most important thing to bear in mind, RDD can handle unstructured data!

So, let's imagine that we're dealing with a media stream or a text stream, or, for any reason whatsoever your data won't be manipulated as a fixed schema: you wont be able to do so using a DataSet, because you won't have a structure.

So, by using RDD's API, you can manipulate your data with functional programming or use low-level transformations and actions to control your data.

Also, by using RDDs, you have full control of how Spark should query your data. It comes with a price: any builtin optimizations are lost, and it is up to you to make things perform well.

Now, what about why use DataSets (typed) instead of DataFrames?

The key benefit of using it is to get code analysis to show error in compile time. When we use untyped DataSets (DataFrames), we can only receive syntax errors as feedback, so, we potentially have bugs to be discovered in execution time.

Depending on the size of our DataSet, sometimes it means that you'll discover a bug after hours of execution.

When using untyped languages such as Python or R, it is something we need to learn to live with, but as for those using Scala, it should be a better approach to use the typed ones and take out those nasty bugs before running the code for hours.

Data Science Notes

Spark first steps - Introduction [1] - Session, RDD, DataFrame, DataSet

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Data Acquisition: Connection to Relational Databases