Job Programming: User-defined Functions

     User-defined functions (UDF) are a way to run some code over every line (or some line) of a dataframe.

     Let's imagine that you have a dataframe called df that you have created a view for it. Let's suppose that you want to run a function that you've created once you do a query. The way you do that is by creating an UDF and registering it to spark.

    Once you've done that, you can call if from a query in the same way you'll do with a builtin function.

    For example, lets take the salary data set from a previous post:

from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
import pyspark.sql.functions as sf

# Creating a spark session
spark = (SparkSession.builder.appName("spark_learn_sc").getOrCreate())

# Reading our input data as a DataFrame
df:DataFrame = spark.read.csv("../examples/salary.dat", inferSchema=True,
    header=True)

df.createOrReplaceTempView("tab_salary")

    Now let's suppose we're interested in finding out how far a salary is from the mean.

    First, lets find the mean for this dataset:

    

def rate(s, mean):
    return float(s / mean)

session.sparkSession.udf.register("rate", rate, DoubleType())

res = session.sparkSession.sql("""select Sum(Salary) from tab_salary""")
total = float(res.take(1)[0][0])
res = session.sparkSession.sql("""select distinct Name from tab_salary""")
num_users = res.count()

mean = total / num_users
print ("Mean salary: " + str(mean))

res = session.sparkSession.sql("""select Name, Sum(Salary) as Salary,
rate(Sum(Salary), %d) as RateMean from
tab_salary group by Name order by Salary""" % mean)

res.show()

    Notice that the method "rate" is a local method that is being executed in the dataset as it was a builtin method!


Comentários

Postagens mais visitadas deste blog

Data Acquisition: Connection to Relational Databases