In this Post, we'll cover the connection between Spark and relational databases.
The way spark does to connect to relational databases is by using a JDBC drive. Let's explore this using MySQL and Oracle as examples.
Creating a MySQL database and connecting to it.
For this example, let's explore a locally installed mysql instance. To install a mysql instance in an Ubuntu linux, you can issue the following commands:
$ sudo apt update
$ sudo apt install mysql-server
After that, configure your database with the following command:
$ sudo mysql_secure_installation
Next, change the way root user will login, to enable it to be logged in from prompt. To do that, first you need to login as system root. This way, MySQL wont ask for password:
$ sudo mysql -u root
mysql> USE mysql;
mysql> UPDATE user SET plugin='mysql_native_password' WHERE User='root';
mysql> FLUSH PRIVILEGES;
mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'MyNewPass';
mysql> exit;
Now we can connect to this database by doing:
$ mysql -u root -p
mysql>
Next, we create a database and change our prompt to deal with it:
mysql> create database sparkdb;
mysql> use sparkdb;
And a table:
mysql> create table tab_users (name varchar(512), age int, sex varchar(2));
Now lets put some data into that table:
insert into tab_users (name, age, sex) values ('Joao Borges', 31 ,'M');
insert into tab_users (name, age, sex) values ('Madalena Silveira', 29 ,'F');
insert into tab_users (name, age, sex) values ('Inácio Santos ', 54 ,'M');
insert into tab_users (name, age, sex) values ('Fernando Santos ', 55 ,'M');
insert into tab_users (name, age, sex) values ('Lena Sampaio ', 27 ,'F');
insert into tab_users (name, age, sex) values ('Joana Fernandez ', 19 ,'F');
insert into tab_users (name, age, sex) values ('Ana Paula Dias ', 65 ,'F');
insert into tab_users (name, age, sex) values ('Jose Carlos ', 63 ,'M');
mysql> select * from tab_users;
+--------------------+------+------+
| name | age | sex |
+--------------------+------+------+
| Joao Borges | 31 | M |
| Madalena Silveira | 29 | F |
| Inácio Santos | 54 | M |
| Fernando Santos | 55 | M |
| Lena Sampaio | 27 | F |
| Joana Fernandez | 19 | F |
| Ana Paula Dias | 65 | F |
| Jose Carlos | 63 | M |
+--------------------+------+------+
Now, we can go back to our spark jupyter and query that same data in our program:
But first, we need to install the mysql connector JDBC, so Spark can access it. To do that, go to
https://dev.mysql.com/downloads/connector/j/, select your OS (in our case, Ubuntu Linux), then download the version most suitable to your system.
Then, install it by using
$ sudo gdebi ./mysql-connector-java_8.0.27-1ubuntu20.04_all.deb
All set, now we are able to access it using Spark:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
conf = SparkConf()
conf.setAll([
("spark.jars" , "/usr/share/java/mysql-connector-java-8.0.26.jar")
])
spark = (SparkSession.builder \
.appName("spark_learn_sc") \
.config(conf=conf)
.getOrCreate())
df = spark.read.format("jdbc"). \
option("url","jdbc:mysql://localhost/sparkdb"). \
option("driver","com.mysql.cj.jdbc.Driver"). \
option("dbtable","tab_cadastro"). \
option("user","(put your user, can be the root user)"). \
option("password","(put the user's password here)").load()
print (df.take(10))
[Row(name='Joao Borges', age=31, sex='M'), Row(name='Madalena Silveira', age=29, sex='F'), Row(name='Inácio Santos', age=54, sex='M'), Row(name='Fernando Santos', age=55, sex='M'), Row(name='Lena Sampaio', age=27, sex='F'), Row(name='Joana Fernandez', age=19, sex='F'), Row(name='Ana Paula Dias', age=65, sex='F'), Row(name='Jose Carlos', age=63, sex='M')]
Creating an Oracle database and connecting to it.
Now, let's make a locally installed Oracle instance. For this execise, we'll install the 19c (19.3) version in a docker container.
First, clone this project. It's a set of files provided by Oracle that contains the scripts and Dockefiles for building a container for running the Oracle database.
$ git clone https://github.com/oracle/docker-images
Next, go to the URL http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html and download the Oracle single instance database install file LINUX.X64_193000_db_home.zip.
Place it to it's equivalent path in the docker-images, according to the example below (set for the 19.3 version)
$ cp LINUX.X64_193000_db_home.zip docker-images/OracleDatabase/SingleInstance/dockerfiles/19.3.0
Next, run the script for building the image:
$ ./buildDockerImage.sh -v 19.3.0 -e
After that, please create a folder to hold the database data:
$ mkdir ~/oradata
$ sudo chmod 777 ~/oradata
Finally, run the container for the built image
$ docker run --name oracle19c -p 1521:1521 -p 5500:5500 -v ~/oradata:/opt/oracle/oradata
oracle/database:19.3.0-ee
Now, change the admin's password with the following command
$ docker exec oracle19c ./setPassword.sh (NEW PASSWORD)
Check that the container is running:
$ docker ps -a
e304f7143c07 oracle/database:19.3.0-ee "/bin/sh -c 'exec $O…" 6 hours ago Up 6 hours (healthy) 0.0.0.0:1521->1521/tcp, :::1521->1521/tcp, 0.0.0.0:5500->5500/tcp, :::5500->5500/tcp oracle19c
You should now be able to connect to the new Oracle database. You could do that by using the sqlplus tool inside the docker container by issuing:
$ docker exec -ti e304f7143c07 sqlplus system/(PASSWORD)@orclpdb1
You can also access it using the Oracle SQL developer tool. If you don't have this tool in your computer, you can install it with the following:
1. download the JDK deb install file on https://www.oracle.com/java/technologies/javase-downloads.html. For this example, I've downloaded jdk-17_linux-x64_bin.deb
2. Install it according to your linux distribution.
$ gdebi jdk-17_linux-x64_bin.deb
3. Now download the Oracle SQL Developer tool on https://www.oracle.com/tools/downloads/sqldev-downloads.html
4. Unzip it and put it in /opt (or on the right spot according to your linux distro)
$ unzip sqldeveloper-21.4.1.349.1822-no-jre.zip
$ sudo mv sqldeveloper /opt
$ go to /opt/sqldeveloper and run sqldeveloper.sh
Now click on + to add a new connection and fill it accordingly:
Creating an user/schema and putting some data.
For this example, lets use the command sqlplus prompt tool.
SQL> create user usr_sparkdb identified by 'PASSWORD';
SQL> grant create session to usr_sparkdb;
SQL> alter user usr_sparkdb quota unlimited on USERS;
SQL> create table usr_sparkdb.tab_users (name varchar2(512), age int, sex varchar(2));
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Joao Borges', 31 ,'M');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Madalena Silveira', 29 ,'F');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Inácio Santos ', 54 ,'M');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Fernando Santos ', 55 ,'M');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Lena Sampaio ', 27 ,'F');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Joana Fernandez ', 19 ,'F');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Ana Paula Dias ', 65 ,'F');
insert into usr_sparkdb.tab_cadastro (nome, idade, sexo) values ('Jose Carlos ', 63 ,'M');
Now, let's access this data using spark:
First, we need to download the ODBC driver for Spark. For the 19c version, you can find it at
https://www.oracle.com/database/technologies/appdev/jdbc-ucp-19c-downloads.html
Unzip it and put the ojdbc10.jar somewhere accessible. Then:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
conf = SparkConf()
conf.setAll([
("spark.jars" , "(PATH TO)ojdbc10.jar")
])
spark = (SparkSession.builder \
.appName("spark_learn_sc") \
.config(conf=conf)
.getOrCreate())
df = spark.read.format("jdbc"). \
option("url","jdbc:oracle:thin:@0.0.0.0:1521/ORCLPDB1"). \
option("driver","oracle.jdbc.driver.OracleDriver"). \
option("dbtable","tab_cadastro"). \
option("user","usr_sparkdb"). \
option("password","(PASSWORD)").load()
[Row(NAME='Joao Borges', AGE=Decimal('31'), SEX='M')]
And that's it!
Comentários
Postar um comentário