User-defined functions (UDF) are a way to run some code over every line (or some line) of a dataframe.
Let's imagine that you have a dataframe called df that you have created a view for it. Let's suppose that you want to run a function that you've created once you do a query. The way you do that is by creating an UDF and registering it to spark.
Once you've done that, you can call if from a query in the same way you'll do with a builtin function.
For example, lets take the salary data set from a previous post:
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
import pyspark.sql.functions as sf
# Creating a spark session
spark = (SparkSession.builder.appName("spark_learn_sc").getOrCreate())
# Reading our input data as a DataFrame
df:DataFrame = spark.read.csv("../examples/salary.dat", inferSchema=True,
header=True)
df.createOrReplaceTempView("tab_salary")
Now let's suppose we're interested in finding out how far a salary is from the mean.
First, lets find the mean for this dataset:
def rate(s, mean):
return float(s / mean)
session.sparkSession.udf.register("rate", rate, DoubleType())
res = session.sparkSession.sql("""select Sum(Salary) from tab_salary""")
total = float(res.take(1)[0][0])
res = session.sparkSession.sql("""select distinct Name from tab_salary""")
num_users = res.count()
mean = total / num_users
print ("Mean salary: " + str(mean))
res = session.sparkSession.sql("""select Name, Sum(Salary) as Salary,
rate(Sum(Salary), %d) as RateMean from
tab_salary group by Name order by Salary""" % mean)
res.show()
Notice that the method "rate" is a local method that is being executed in the dataset as it was a builtin method!
Comentários
Postar um comentário