I have written a data preprocessing codes in Pandas UDF in PySpark. I'm using lambda function to extract a part of the text from all the records of a column.
Here is how my code looks like:
#pandas_udf("string", PandasUDFType.SCALAR)
def get_X(col):
return col.apply(lambda x: x.split(',')[-1] if len(x.split(',')) > 0 else x)
df = df.withColumn('X', get_first_name(df.Y))
This is working fine and giving the desired results. But I need to write the same piece of logic in Spark equivalent code. Is there a way to do it? Thanks.
I think one function substring_index is enough for this particular task:
from pyspark.sql.functions import substring_index
df = spark.createDataFrame([(x,) for x in ['f,l', 'g', 'a,b,cd']], ['c1'])
df2.withColumn('c2', substring_index('c1', ',', -1)).show()
+------+---+
| c1| c2|
+------+---+
| f,l| l|
| g| g|
|a,b,cd| cd|
+------+---+
Given the following DataFrame df:
df.show()
# +-------------+
# | BENF_NME|
# +-------------+
# | Doe, John|
# | Foo|
# |Baz, Quux,Bar|
# +-------------+
You can simply use regexp_extract() to select the first name:
from pyspark.sql.functions import regexp_extract
df.withColumn('First_Name', regexp_extract(df.BENF_NME, r'(?:.*,\s*)?(.*)', 1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
If you don't care about possible leading spaces, substring_index() provides a simple alternative to your original logic:
from pyspark.sql.functions import substring_index
df.withColumn('First_Name', substring_index(df.BENF_NME, ',', -1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
In this case the first row's First_Name has a leading space:
df.withColumn(...).collect()[0]
# Row(BENF_NME=u'Doe, John', First_Name=u' John'
If you still want to use a custom function, you need to create a user-defined function (UDF) using udf():
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
get_first_name = udf(lambda s: s.split(',')[-1], StringType())
df.withColumn('First_Name', get_first_name(df.BENF_NME)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
Note that UDFs are slower than the built-in Spark functions, especially Python UDFs.
You can do the same using when to implement if-then-else logic:
First split the column, then compute its size. If the size is greater than 0, take the last element from the split array. Otherwise, return the original column.
from pyspark.sql.functions import split, size, when
def get_first_name(col):
col_split = split(col, ',')
split_size = size(col_split)
return when(split_size > 0, col_split[split_size-1]).otherwise(col)
As an example, suppose you had the following DataFrame:
df.show()
#+---------+
#| BENF_NME|
#+---------+
#|Doe, John|
#| Madonna|
#+---------+
You can call the new function just as before:
df = df.withColumn('First_Name', get_first_name(df.BENF_NME))
df.show()
#+---------+----------+
#| BENF_NME|First_Name|
#+---------+----------+
#|Doe, John| John|
#| Madonna| Madonna|
#+---------+----------+
Related
Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.
I have a df like this one:
df = spark.createDataFrame(
[("1", "Apple", "cat"), ("2", "2.", "house"), ("3", "<strong>text</strong>", "HeLlo 2.5")],
["id", "text1", "text2"])
+---+---------------------+---------+
| id| text1| text2|
+---+---------------------+---------+
| 1| Apple| cat|
| 2| 2.| house|
| 3|<strong>text</strong>|HeLlo 2.5|
+---+---------------------+---------+
multiple functions to clean text like
def remove_html_tags(text):
document = html.fromstring(text)
return " ".join(etree.XPath("//text()")(document))
def lowercase(text):
return text.lower()
def remove_wrong_dot(text):
return re.sub(r'(?<!\d)[.,;:]|[.,;:](?!\d)', ' ', text)
and a list of columns to clean
COLS = ["text1", "text2"]
I would like to apply the functions to the columns in the list and also keep the original text
+---+---------------------+-----------+---------+-----------+
| id| text1|text1_clean| text2|text2_clean|
+---+---------------------+-----------+---------+-----------+
| 1| Apple| apple| cat| cat|
| 2| 2.| 2| house| house|
| 3|<strong>text</strong>| text|HeLlo 2.5| hello 2.5|
+---+---------------------+-----------+---------+-----------+
I already have an approach using UDF but it is not very efficient. I've been trying something like:
rdds = []
for col in TEXT_COLS:
rdd = df.rdd.map(lambda x: (x[col], lowercase(x[col])))
rdds.append(rdd.collect())
return df
My idea would be to join all rdds in the list but I don't know how efficient this would be or how to list more functions.
I appreciate any ideas or suggestions.
EDIT: Not all transformations can be done with regexp_replace. For example, the text can include nested html labels and in that case a simple replace wouldn't work or I don't want to replace all dots, only those at the end or beginning of substrings
Spark built-in functions can do all the transformations you wanted
from pyspark.sql import functions as F
cols = ["text1", "text2"]
for c in cols:
df = (df
.withColumn(f'{c}_clean', F.lower(c))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '<[^>]+>', ''))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '(?<!\d)[.,;:]|[.,;:](?!\d)', ''))
)
+---+--------------------+---------+-----------+-----------+
| id| text1| text2|text1_clean|text2_clean|
+---+--------------------+---------+-----------+-----------+
| 1| Apple| cat| apple| cat|
| 2| 2.| house| 2| house|
| 3|<strong>text</str...|HeLlo 2.5| text| hello 2.5|
+---+--------------------+---------+-----------+-----------+
I have a DataFrame with the following schema:
+--------------------+--------+-------+-------+-----------+--------------------+
| userid|datadate| runid|variant|device_type| prediction|
+--------------------+--------+-------+-------+-----------+--------------------+
|0001d15b-e2da-4f4...|20220111|1196752| 1| Mobile| 0.8827571312010658|
|00021723-2a0d-497...|20220111|1196752| 1| Mobile| 0.30763173370229735|
|00021723-2a0d-497...|20220111|1196752| 0| Mobile| 0.5336206154783815|
I would like to perform the following operation:
I want to do for each "runid", for each "device_type", some calculations with variant==1 and variant==0, including a resampling loop.
The ultimate goal is to store these calculations in another DF.
So in a naive approach the code would look like that:
for runid in df.select('runid').distinct().rdd.flatMap(list).collect():
for device in ["Mobile","Desktop"]:
a_variant = df.filter((df.runid == runid) & (df.device_type == device) & (df.variant == 0))
b_variant = df.filter((df.runid == runid) & (df.device_type == device) & (df.variant == 1))
## do some more calculations here
# bootstrap loop:
for samp in range(100):
sampled_vector_a = a_variant.select("prediction").sample(withReplacement = True, fraction = 1.0, seed = 123)
sampled_vector_b = b_variant.select("prediction").sample(withReplacement = True, fraction = 1.0, seed = 123)
## do some more calculations here
## do some more calculations here
## store calculations in a new DataFrame
Currently the process is too slow.
How can I optimize this process by utilizing spark in the best way?
Thanks!
Here is a way to sample from each group in a dataframe after applying groupBy.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("Demo").getOrCreate()
df = spark.createDataFrame(data,columns)
data = [["uid1","runid1",1,"Mobile",0.8],["uid2","runid1",1,"Mobile",0.3],
["uid3","runid1",0,"Mobile",0.5],["uid4","runid2",0,"Mobile",0.7],
["uid5","runid2",0,"Mobile",0.9]]
columns = ["userid","runid","variant","device_type","prediction"]
df.show()
# +------+------+-------+-----------+----------+
# |userid| runid|variant|device_type|prediction|
# +------+------+-------+-----------+----------+
# | uid1|runid1| 1| Mobile| 0.8|
# | uid2|runid1| 1| Mobile| 0.3|
# | uid3|runid1| 0| Mobile| 0.5|
# | uid4|runid2| 0| Mobile| 0.7|
# | uid5|runid2| 0| Mobile| 0.9|
# +------+------+-------+-----------+----------+
Define a sampling function that is going to be called by applyInPandas. The function my_sample extracts one sample for each input dataframe:
def my_sample(key, df):
x = df.sample(n=1)
return x
applyInPandas also needs a schema for its output, since it is returning the whole dataframe it will have the same fields as df:
from pyspark.sql.types import *
schema = StructType([StructField('userid', StringType()),
StructField('runid', StringType()),
StructField('variant', LongType()),
StructField('device_type', StringType()),
StructField('prediction', DoubleType())])
Just to check, try grouping the data, there are three groups:
df.groupby("runid", "device_type", "variant").mean("prediction").show()
# +------+-----------+-------+---------------+
# | runid|device_type|variant|avg(prediction)|
# +------+-----------+-------+---------------+
# |runid1| Mobile| 0| 0.5|
# |runid1| Mobile| 1| 0.55|
# |runid2| Mobile| 0| 0.8|
# +------+-----------+-------+---------------+
Now apply my_sample to each group using applyInPandas:
df.groupby("runid","device_type","variant").applyInPandas(my_sample, schema=schema).show()
# +------+------+-------+-----------+----------+
# |userid| runid|variant|device_type|prediction|
# +------+------+-------+-----------+----------+
# | uid3|runid1| 0| Mobile| 0.5|
# | uid2|runid1| 1| Mobile| 0.3|
# | uid4|runid2| 0| Mobile| 0.7|
# +------+------+-------+-----------+----------+
Note: I used applyInPandas since pyspark.sql.GroupedData.apply.html is deprecated
I'm trying to do the following but for a column in pyspark but no luck. Any idea on isolating just the lowercase characters in column of a spark df?
''.join('x' if x.islower() else 'X' if x.isupper() else x for x in text)
You can directly use regex_replace to substitute the lowercase values to any desired value -
In your case you will have to chain regex_replace to get the final output -
Data Preparation
inp_string = """
lRQWg2IZtB
hVzsJhPVH0
YXzc4fZDwu
qRyOUhT5Hn
b85O0H41RE
vOxPLFPWPy
fE6o5iMJ6I
918JI00EC7
x3yEYOCwek
m1eWY8rZwO
""".strip().split()
df = pd.DataFrame({
'value':inp_string
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
Regex Replace
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value", r'[a-z]', "x"))
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value_modified", r'[A-Z]', "X"))
sparkDF.show()
+----------+--------------+
| value|value_modified|
+----------+--------------+
|lRQWg2IZtB| xXXXx2XXxX|
|hVzsJhPVH0| xXxxXxXXX0|
|YXzc4fZDwu| XXxx4xXXxx|
|qRyOUhT5Hn| xXxXXxX5Xx|
|b85O0H41RE| x85X0X41XX|
|vOxPLFPWPy| xXxXXXXXXx|
|fE6o5iMJ6I| xX6x5xXX6X|
|918JI00EC7| 918XX00XX7|
|x3yEYOCwek| x3xXXXXxxx|
|m1eWY8rZwO| x1xXX8xXxX|
+----------+--------------+
Using the following dataframe as an example
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
You can use a pyspark.sql function called regexpr_replace to isolate the lowercase letters in the column with the following code
from pyspark.sql import functions
df = (df.withColumn("value",
functions.regexp_replace("value", r'[A-Z]|[0-9]|[,.;##?!&$]', "")))
df.show()
+-----+
|value|
+-----+
| lgt|
| hzsh|
|zcfwu|
| qyhn|
| b|
| vxy|
| foi|
| |
|xywek|
| merw|
+-----+
Is there a way, in pyspark, to perform the substr function on a DataFrame column, without specifying the length? Namely, something like df["my-col"].substr(begin).
I am not sure why this function is not exposed as api in pysaprk.sql.functions module.
SparkSQL supports the substring function without defining len argument substring(str, pos, len)
You can use it with expr api of functions module like below to achieve same:
df.withColumn('substr_name', f.expr("substring(name, 2)")).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
How spark is doing it internally :
Now if you see physical plan of above statement then will notice that if we don't pass len then spark will automatically add 2147483647.
As #pault said in comment, 2147483647 is the maximum positive value for a 32-bit signed binary integer (2^31 -1).
df.withColumn('substr_name', f.expr("substring(name, 2)")).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 2147483647) AS substr_name#169]
+- Scan ExistingRDD[name#140,id#141L] --> 2147483647 is automatically added
In substring api implementation of functions module, it expect us to explicitly pass length. If you want then you can give any higher number in len which can cover max lengths of your column.
df.withColumn('substr_name', f.substring('name', 2, 100)).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
>>> df.withColumn('substr_name', f.substring('name', 2, 100)).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 100) AS substr_name#189]
+- Scan ExistingRDD[name#140,id#141L] --> 100 is what we passed
If the objective is to make a substring from a position given by a parameter begin to the end of the string, then you can do it as follows:
import pyspark.sql.functions as f
l = [(1, 'Prague'), (2, 'New York')]
df = spark.createDataFrame(l, ['id', 'city'])
begin = 2
l = (f.length('city') - f.lit(begin) + 1)
(
df
.withColumn('substr', f.col('city').substr(f.lit(begin), l))
).show()
+---+--------+-------+
| id| city| substr|
+---+--------+-------+
| 1| Prague| rague|
| 2|New York|ew York|
+---+--------+-------+
I'd create udf.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import StringType
>>> df = spark.createDataFrame([('Alice', 23), ('Brian', 25)], schema=["name", "age"])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 23|
|Brian| 25|
+-----+---+
>>> #F.udf(returnType=StringType())
... def substr_udf(col):
... return str(col)[2:]
>>> df = df.withColumn('substr', substr_udf('name'))
>>> df.show()
+-----+---+------+
| name|age|substr|
+-----+---+------+
|Alice| 23| ice|
|Brian| 25| ian|
+-----+---+------+
No we need to specify the both parameters pos and len
BUt do make sure that both should be of same type other wise it will give error.
Error: Column not iterable.
You can do in this way:
df = df.withColumn("new", F.col("previous").substr(F.lit(5), F.length("previous")-5))