Applying a udf function in a distributed fashion in PySpark - python

Say I have a very basic Spark DataFrame that consists of a couple of columns, one of which contains a value that I want to modify.
|| value || lang ||
| 3 | en |
| 4 | ua |
Say, I want to have a new column per specific class where I would add a float number to the given value (this is not much relevant to the final question though, in reality I do a prediction with sklearn there, but for simplicity let's assume we are adding stuff, the idea is I am modifying the value in some way). So given a dict classes={'1':2.0, '2':3.0} I would like to have a column for each class where I add the value from DF to the value of the class and then save it to a csv:
class_1.csv
|| value || lang || my_class | modified ||
| 3 | en | 1 | 5.0 | # this is 3+2.0
| 4 | ua | 1 | 6.0 | # this is 4+2.0
class_2.csv
|| value || lang || my_class | modified ||
| 3 | en | 2 | 6.0 | # this is 3+3.0
| 4 | ua | 2 | 7.0 | # this is 4+3.0
So far I have the following code that works and modifies the value for each defined class, but it is done with a for loop and I am looking for a more advanced optimization for it:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
from pyspark.sql.functions import lit
# create session and context
spark = pyspark.sql.SparkSession.builder.master("yarn").appName("SomeApp").getOrCreate()
conf = SparkConf().setAppName('Some_App').setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)
my_df = spark.read.csv("some_file.csv")
# modify the value here
def do_stuff_to_column(value, separate_class):
# do stuff to column, let's pretend we just add a specific value per specific class that is read from a dictionary
class_dict = {'1':2.0, '2':3.0} # would be loaded from somewhere
return float(value+class_dict[separate_class])
# iterate over each given class later
class_dict = {'1':2.0, '2':3.0} # in reality have more than 10 classes
# create a udf function
udf_modify = udf(do_stuff_to_column, FloatType())
# loop over each class
for my_class in class_dict:
# create the column first with lit
my_df2 = my_df.withColumn("my_class", lit(my_class))
# modify using udf function
my_df2 = my_df2.withColumn("modified", udf_modify("value","my_class"))
# write to csv now
my_df2.write.format("csv").save("class_"+my_class+".csv")
So the question is, is there a better/faster way of doing this then in a for loop?

I would use some form of join, in this case crossJoin. Here's a MWE:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(3, 'en'), (4, 'ua')], ['value', 'lang'])
classes = spark.createDataFrame([(1, 2.), (2, 3.)], ['class_key', 'class_value'])
res = df.crossJoin(classes).withColumn('modified', F.col('value') + F.col('class_value'))
res.show()
For saving as separate CSV's I think there is no better way than to use a loop.

Related

Object to dictonary to use get() python pandas

I'm having some issues with a column in my csv that the type is an 'object', but it's should be an dict series (a dict for which row).
The point is to make which row as a dict to use get('id') on the key to return the id's values for which row in the 'Conta' column.
Thats the way it's as 'object' column:
| Conta |
| ---------------------------------------------|
| {'name':'joe','id':'4347176000574713087'} |
| {'name':'mary','id':'4347176000115055151'} |
| {'name':'fred','id':'4347176000574610147'} |
| {'name':'Marcos','id':'4347176000555566806'} |
| {'name':'marcos','id':'4347176000536834310'} |
Thats the way it's should be in the end
| Conta |
| ------------------- |
| 4347176000574713087 |
| 4347176000115055151 |
| 4347176000574610147 |
| 4347176000555566806 |
| 4347176000536834310 |
I tried to use:
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df['Conta'] = df['Conta'].to_dict()
df['Conta'] = [x.get('id', 0) for x in df['Conta']]
#return: AttributeError: 'str' object has no attribute 'get'
I also tried to use ast.literal_eval() but it dosen't work as well
import ast
import pandas as pd
df = pd.read_csv('csv/Modulo_CS.csv')
df = df[['Conta','ID_CS']]
df['Conta'] = df['Conta'].apply(ast.literal_eval)
#return: ValueError: malformed node or string: nan
Can someone help me?
Consider replacing the following line:
df['Conta'] = df['Conta'].apply(ast.literal_eval)
If it's being correctly detected as a dictionary then:
df['Conta'] = df['Conta].map(lambda x: x['id'])
If each row is a string:
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(x)['id'])
However, if you are getting a malformed node or json error. Consider first using str and then ast.literal_eval():
df['Conta'] = df['Conta'].map(lambda x: ast.literal_eval(str(x))['id'])

Replace datetime column in DF with hour from int column

I have a pyspark df with an hour column (int) like this:
hour
0
0
1
...
14
And I have an execution_datetime variable that looks like 2022-01-02 17:23:11
Now I want to calculate a new column for my DF that holds my execution_datetime but the hour is replaced by the values from the hour col. Output should look like:
hour
exec_dttm_with_hour
0
2022-01-02 00:23:11
0
2022-01-02 00:23:11
1
2022-01-02 01:23:11
...
...
14
2022-01-02 14:23:11
I know there are ways using i.e. .collect(), then edit the list and insert as new col. But I need to make use of sparks parallel execution since it could be a super high data load. Also, casting it to pandas and then editing it is not suitable for my use case.
Thanks in advance for any suggestions!
You can use the make_timestamp function if you're on spark 3+, else you can use an UDF.
spark.range(15). \
withColumnRenamed('id', 'hour'). \
withColumn('static_dttm', func.lit(execution_datetime).cast('timestamp')). \
withColumn('dttm',
func.expr('''make_timestamp(year(static_dttm),
month(static_dttm),
day(static_dttm),
hour,
minute(static_dttm),
second(static_dttm)
)
''')
). \
drop('static_dttm'). \
show()
# +----+-------------------+
# |hour| dttm|
# +----+-------------------+
# | 0|2022-01-02 00:23:11|
# | 1|2022-01-02 01:23:11|
# | 2|2022-01-02 02:23:11|
# | 3|2022-01-02 03:23:11|
# | 4|2022-01-02 04:23:11|
# | 5|2022-01-02 05:23:11|
# | 6|2022-01-02 06:23:11|
# | 7|2022-01-02 07:23:11|
# | 8|2022-01-02 08:23:11|
# | 9|2022-01-02 09:23:11|
# | 10|2022-01-02 10:23:11|
# | 11|2022-01-02 11:23:11|
# | 12|2022-01-02 12:23:11|
# | 13|2022-01-02 13:23:11|
# | 14|2022-01-02 14:23:11|
# +----+-------------------+
Using UDF
def update_ts(string_ts, hour_col):
import datetime
dttm = datetime.datetime.strptime(string_ts, '%Y-%m-%d %H:%M:%S')
return datetime.datetime(dttm.year, dttm.month, dttm.day, hour_col, dttm.minute, dttm.second)
update_ts_udf = func.udf(update_ts, TimestampType())
spark.range(15). \
withColumnRenamed('id', 'hour'). \
withColumn('dttm', update_ts_udf(func.lit(execution_datetime), func.col('hour'))). \
show()
# +----+-------------------+
# |hour| dttm|
# +----+-------------------+
# | 0|2022-01-02 00:23:11|
# | 1|2022-01-02 01:23:11|
# | 2|2022-01-02 02:23:11|
# | 3|2022-01-02 03:23:11|
# | 4|2022-01-02 04:23:11|
# | 5|2022-01-02 05:23:11|
# | 6|2022-01-02 06:23:11|
# | 7|2022-01-02 07:23:11|
# | 8|2022-01-02 08:23:11|
# | 9|2022-01-02 09:23:11|
# | 10|2022-01-02 10:23:11|
# | 11|2022-01-02 11:23:11|
# | 12|2022-01-02 12:23:11|
# | 13|2022-01-02 13:23:11|
# | 14|2022-01-02 14:23:11|
# +----+-------------------+
You can use concat to connect multiple strings in a row and lit to add a constant value to each row.
In the following code, a new column timestamp is introduced where the first 11 characters of execution_datetime are conceited with characters after the hours and in between the hours are added. It also makes sure, that the hours have a leading zero.
import pyspark.sql.functions as f
df = df.withColumn('timestamp', f.concat(f.lit(execution_datetime [0:11]), f.lpad(f.col('hour'), 2, '0') , f.lit(f.lit(execution_datetime [13:]))))
Remark: This might be the faster version than using timestamp functions as suggested in samkart's answer, but also less safe in catching wrong inputs.

Adding random samples from one spark dataframe to another

I have two dataframes like this:
| User |
------
| 1 |
| 2 |
| 3 |
and
| Articles |
----------
| 'A' |
| 'B' |
| 'C' |
What's an intuitive way to assign each user 2 articles randomly?
The output dataframe might look like this:
| User | Articles |
-----------------
| 1 | 'A' |
| 1 | 'C' |
| 2 | 'C' |
| 2 | 'B' |
| 3 | 'C' |
| 3 | 'A' |
Here's the code that will generate these two dataframes:
u =[(1,), (2,), (3,)]
rdd = sc.parallelize(u)
users = rdd.map(lambda x: Row(user_id=x[0]))
users_df = sqlContext.createDataFrame(users)
a = [('A',), ('B',), ('C',), ('D',), ('E',)]
rdd = sc.parallelize(a)
articles = rdd.map(lambda x: Row(article_id=x[0]))
articles_df = sqlContext.createDataFrame(articles)
Since your article list is small it makes sense to keep it as a python object and not as a distributed list. This will allow you to create a udf to produce a random list of articles for each user_id. The following is one way you could do so:
from random import sample,seed
from pyspark.sql import Row
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import ArrayType,StringType
class ArticleRandomizer(object):
def __init__(self,article_list,num_articles=2,preseed=0):
self.article_list=article_list
self.num_articles=num_articles
self.preseed=preseed
def getrandom(self,user):
seed(user+self.preseed)
return sample(self.article_list,self.num_articles)
u =[(1,), (2,), (3,)]
rdd = sc.parallelize(u)
users = rdd.map(lambda x: Row(user_id=x[0]))
users_df = sqlContext.createDataFrame(users)
a = [('A',), ('B',), ('C',), ('D',), ('E',)]
#rdd = sc.parallelize(a)
#articles = rdd.map(lambda x: Row(article_id=x[0]))
#articles_df = sqlContext.createDataFrame(articles)
article_list=[article[0] for article in a]
ARandomizer=ArticleRandomizer(article_list)
add_articles=udf(ARandomizer.getrandom,ArrayType(StringType()))
users_df.select('user_id',explode(add_articles('user_id'))).show()
This ArticleRandomizer.getrandom function is seeded by the user_id, so it is deterministic, meaning you will get the same random list of articles for a given user for each run. You can adjust this to get a potentially different list by changing the preseed value when you instantiate the class.
This hasn't been tested to see if it will scale well, but it should work fine on your dataset because the dimension of the articles and the users are both fairly small.
If indeed the Articles DataFrame is pretty small we can run collect_list which will take the entire DataFrame and make it one row with an Array column.
| Articles |
-----------------
| ['A', 'B', 'C'] |
Then we can cross-join this table to the Users one, randomly generate two different integers (this is the main part of the code below) and then just pick the two elements from the Articles column .
explode function is used in order to achieve the format you presented in the original question.
from pyspark.sql.functions import collect_list, rand, when, col, size, floor, explode, array
articles_collected = articles.agg(collect_list("Articles").alias("articles"))
users \
.join(articles_collected, how="cross") \
.withColumn(
"first_rand",
floor(rand() * size("articles"))
) \
.withColumn(
"second_rand",
when(
col("first_rand") == 0,
floor(rand() * (size("articles") - 1)) + 1
).otherwise(
floor(rand() * col("first_rand"))
)
) \
.withColumn(
"articles_picked",
array(
col("articles").getItem(col("first_rand").cast("int")),
col("articles").getItem(col("second_rand").cast("int"))
)
) \
.select(
"User",
explode("articles_picked").alias("Articles")
)

Automatically multiprocessing a 'function apply' on a dataframe column

I have a simple dataframe with two columns.
+---------+-------+ | subject | score |
+---------+-------+ | wow | 0 |
+---------+-------+ | cool | 0 |
+---------+-------+ | hey | 0 |
+---------+-------+ | there | 0 |
+---------+-------+ | come on | 0 |
+---------+-------+ | welcome | 0 |
+---------+-------+
For every record in 'subject' column, I am calling a function and updating the results in column 'score' :
df['score'] = df['subject'].apply(find_score)
Here find_score is a function, which processes strings and returns a score :
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
# Instantiates a client
language_client = language.Client()
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
This works fine as expected but its quite slow as it processes the record one by one.
Is there a way, this can be parallelised ? without manually splitting the dataframe into smaller chunks ? Is there any library which does that automatically ?
Cheers
The instantiation of language.Client every time you call the find_score function is likely a major bottleneck. You don't need to create a new client instance for every use of the function, so try creating it outside the function, before you call it:
# Instantiates a client
language_client = language.Client()
def find_score (row):
# Imports the Google Cloud client library
from google.cloud import language
import re
pre_text = re.sub('<[^>]*>', '', row)
text = re.sub(r'[^\w]', ' ', pre_text)
document = language_client.document_from_text(text)
# Detects the sentiment of the text
sentiment = document.analyze_sentiment().sentiment
print("Sentiment score - %f " % sentiment.score)
return sentiment.score
df['score'] = df['subject'].apply(find_score)
If you insist, you can use multiprocessing like this:
from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()

Use the result from Cross tab (spark dataframe) for chi-square test in SparkMlib

I've generated a dataframe data from crosstab in Spark DataFrame and want to perform the chi-squared test.
It seems that Statistics.chiSqTest can only be applied to a matrix. My DataFrame looks like as below and I want to see whether the level distribution is the same across three groups:
true
false
and Undefined.
from pyspark.mllib.stat import Statistics
+-----------------------------+-------+--------+----------+
|levels | true| false|Undefined |
+-----------------------------+-------+--------+----------+
| 1 |32783 |634460 |2732340 |
| 2 | 2139 | 41248 |54855 |
| 3 |28837 |573746 |5632147 |
| 4 |16473 |320529 |8852552 |
+-----------------------------+-------+--------+----------+
Is there any easy way to transform this in order to be used for chi-squared test?
One way to handle this without using mllib.Statistics:
import scipy.stats
crosstab = ...
scipy.stats.chi2_contingency(
crosstab.drop(crosstab.columns[0]).toPandas().as_matrix()
)
If you really want Spark statistics:
from itertools import chain
Statistics.chiSqTest(DenseMatrix(
numRows=crosstab.count(), numCols=len(crosstab.columns) - 1,
values=list(chain(*zip(*crosstab.drop(crosstab.columns[0]).collect())))
))

Categories