pyspark replace lowercase characters in column with 'x' - python

I'm trying to do the following but for a column in pyspark but no luck. Any idea on isolating just the lowercase characters in column of a spark df?
''.join('x' if x.islower() else 'X' if x.isupper() else x for x in text)

You can directly use regex_replace to substitute the lowercase values to any desired value -
In your case you will have to chain regex_replace to get the final output -
Data Preparation
inp_string = """
lRQWg2IZtB
hVzsJhPVH0
YXzc4fZDwu
qRyOUhT5Hn
b85O0H41RE
vOxPLFPWPy
fE6o5iMJ6I
918JI00EC7
x3yEYOCwek
m1eWY8rZwO
""".strip().split()
df = pd.DataFrame({
'value':inp_string
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
Regex Replace
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value", r'[a-z]', "x"))
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value_modified", r'[A-Z]', "X"))
sparkDF.show()
+----------+--------------+
| value|value_modified|
+----------+--------------+
|lRQWg2IZtB| xXXXx2XXxX|
|hVzsJhPVH0| xXxxXxXXX0|
|YXzc4fZDwu| XXxx4xXXxx|
|qRyOUhT5Hn| xXxXXxX5Xx|
|b85O0H41RE| x85X0X41XX|
|vOxPLFPWPy| xXxXXXXXXx|
|fE6o5iMJ6I| xX6x5xXX6X|
|918JI00EC7| 918XX00XX7|
|x3yEYOCwek| x3xXXXXxxx|
|m1eWY8rZwO| x1xXX8xXxX|
+----------+--------------+

Using the following dataframe as an example
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
You can use a pyspark.sql function called regexpr_replace to isolate the lowercase letters in the column with the following code
from pyspark.sql import functions
df = (df.withColumn("value",
functions.regexp_replace("value", r'[A-Z]|[0-9]|[,.;##?!&$]', "")))
df.show()
+-----+
|value|
+-----+
| lgt|
| hzsh|
|zcfwu|
| qyhn|
| b|
| vxy|
| foi|
| |
|xywek|
| merw|
+-----+

Related

Use enumerate to get partition columns from dataframe

I am trying to get all columns and their datatypes into a variable, also only the partition columns into another variable of list type in python.
Getting details from describe extended.
df = spark.sql("describe extended schema_name.table_name")
+----------------------------------------------------------+
|col_name |data_type |
+----------------------------+-----------------------------+
|col1 |string |
|col2 |int
|col3 |string
|col4 |int
|col5 |string
|# Partition Information | |
|# col_name |data_type |
|col4 |int |
|col5 |string |
| | |
|# Detailed Table Information| |
|Database |schema_name |
|Table |table_name |
|Owner |owner.name |
Converting result into a list.
des_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
Here is how I am trying to get all columns(all items until before # Partition Information).
all_cols_name_type=[]
for index,item in enumerate(des_list):
if item[0]=='# Partition Information':
all_cols_name_type.append(des_list[:index])
For partitions, i would like to get everything between the items '# col_name' and line before ''(line before # Detailed Table Information)
Any help is appreciated to be able to get this.
You can try using the following answer or equivalent in Scala:
val (partitionCols, dataCols) = spark.catalog.listColumns("schema_name.table_name")
.collect()
.partition(c => c.isPartition)
val parCols = partitionCols.map(c => (c.name, c.dataType))
val datCols = dataCols.map(c => (c.name, c.dataType))
If the table is not defined in the catalog (e.g reading parquet dataset directly from s3 using spark.read.parquet("s3://path/...")) then you can use the following snippet in Scala:
val (partitionSchema, dataSchema) = df.queryExecution.optimizedPlan match {
case LogicalRelation(hfs: HadoopFsRelation, _, _, _) =>
(hfs.partitionSchema, hfs.dataSchema)
case DataSourceV2ScanRelation(_, scan: FileScan, _) =>
(scan.readPartitionSchema, scan.readDataSchema)
case _ => (StructType(Seq()), StructType(Seq()))
}
val parCols = partitionSchema.map(f => (f.name, f.dataType))
val datCols = dataSchema.map(f => (f.name, f.dataType))
There is a trick to do so: You can use monotonically_increasing_id to give each row a number, find the row that has # col_name and get that index. Something like this
My sample table
df = spark.sql('describe data')
df = df.withColumn('id', F.monotonically_increasing_id())
df.show()
+--------------------+---------+-------+---+
| col_name|data_type|comment| id|
+--------------------+---------+-------+---+
| c1| int| null| 0|
| c2| string| null| 1|
|# Partition Infor...| | | 2|
| # col_name|data_type|comment| 3|
| c2| string| null| 4|
+--------------------+---------+-------+---+
tricky part
idx = df.where(F.col('col_name') == '# col_name').first()['id']
# 3
partition_cols = [r['col_name'] for r in df.where(F.col('id') > idx).collect()]
# ['c2']

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

PySpark - Transform multiple columns without using udf

I have a df like this one:
df = spark.createDataFrame(
[("1", "Apple", "cat"), ("2", "2.", "house"), ("3", "<strong>text</strong>", "HeLlo 2.5")],
["id", "text1", "text2"])
+---+---------------------+---------+
| id| text1| text2|
+---+---------------------+---------+
| 1| Apple| cat|
| 2| 2.| house|
| 3|<strong>text</strong>|HeLlo 2.5|
+---+---------------------+---------+
multiple functions to clean text like
def remove_html_tags(text):
document = html.fromstring(text)
return " ".join(etree.XPath("//text()")(document))
def lowercase(text):
return text.lower()
def remove_wrong_dot(text):
return re.sub(r'(?<!\d)[.,;:]|[.,;:](?!\d)', ' ', text)
and a list of columns to clean
COLS = ["text1", "text2"]
I would like to apply the functions to the columns in the list and also keep the original text
+---+---------------------+-----------+---------+-----------+
| id| text1|text1_clean| text2|text2_clean|
+---+---------------------+-----------+---------+-----------+
| 1| Apple| apple| cat| cat|
| 2| 2.| 2| house| house|
| 3|<strong>text</strong>| text|HeLlo 2.5| hello 2.5|
+---+---------------------+-----------+---------+-----------+
I already have an approach using UDF but it is not very efficient. I've been trying something like:
rdds = []
for col in TEXT_COLS:
rdd = df.rdd.map(lambda x: (x[col], lowercase(x[col])))
rdds.append(rdd.collect())
return df
My idea would be to join all rdds in the list but I don't know how efficient this would be or how to list more functions.
I appreciate any ideas or suggestions.
EDIT: Not all transformations can be done with regexp_replace. For example, the text can include nested html labels and in that case a simple replace wouldn't work or I don't want to replace all dots, only those at the end or beginning of substrings
Spark built-in functions can do all the transformations you wanted
from pyspark.sql import functions as F
cols = ["text1", "text2"]
for c in cols:
df = (df
.withColumn(f'{c}_clean', F.lower(c))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '<[^>]+>', ''))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '(?<!\d)[.,;:]|[.,;:](?!\d)', ''))
)
+---+--------------------+---------+-----------+-----------+
| id| text1| text2|text1_clean|text2_clean|
+---+--------------------+---------+-----------+-----------+
| 1| Apple| cat| apple| cat|
| 2| 2.| house| 2| house|
| 3|<strong>text</str...|HeLlo 2.5| text| hello 2.5|
+---+--------------------+---------+-----------+-----------+

PySpark equivalent for lambda function in Pandas UDF

I have written a data preprocessing codes in Pandas UDF in PySpark. I'm using lambda function to extract a part of the text from all the records of a column.
Here is how my code looks like:
#pandas_udf("string", PandasUDFType.SCALAR)
def get_X(col):
return col.apply(lambda x: x.split(',')[-1] if len(x.split(',')) > 0 else x)
df = df.withColumn('X', get_first_name(df.Y))
This is working fine and giving the desired results. But I need to write the same piece of logic in Spark equivalent code. Is there a way to do it? Thanks.
I think one function substring_index is enough for this particular task:
from pyspark.sql.functions import substring_index
df = spark.createDataFrame([(x,) for x in ['f,l', 'g', 'a,b,cd']], ['c1'])
df2.withColumn('c2', substring_index('c1', ',', -1)).show()
+------+---+
| c1| c2|
+------+---+
| f,l| l|
| g| g|
|a,b,cd| cd|
+------+---+
Given the following DataFrame df:
df.show()
# +-------------+
# | BENF_NME|
# +-------------+
# | Doe, John|
# | Foo|
# |Baz, Quux,Bar|
# +-------------+
You can simply use regexp_extract() to select the first name:
from pyspark.sql.functions import regexp_extract
df.withColumn('First_Name', regexp_extract(df.BENF_NME, r'(?:.*,\s*)?(.*)', 1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
If you don't care about possible leading spaces, substring_index() provides a simple alternative to your original logic:
from pyspark.sql.functions import substring_index
df.withColumn('First_Name', substring_index(df.BENF_NME, ',', -1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
In this case the first row's First_Name has a leading space:
df.withColumn(...).collect()[0]
# Row(BENF_NME=u'Doe, John', First_Name=u' John'
If you still want to use a custom function, you need to create a user-defined function (UDF) using udf():
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
get_first_name = udf(lambda s: s.split(',')[-1], StringType())
df.withColumn('First_Name', get_first_name(df.BENF_NME)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
Note that UDFs are slower than the built-in Spark functions, especially Python UDFs.
You can do the same using when to implement if-then-else logic:
First split the column, then compute its size. If the size is greater than 0, take the last element from the split array. Otherwise, return the original column.
from pyspark.sql.functions import split, size, when
def get_first_name(col):
col_split = split(col, ',')
split_size = size(col_split)
return when(split_size > 0, col_split[split_size-1]).otherwise(col)
As an example, suppose you had the following DataFrame:
df.show()
#+---------+
#| BENF_NME|
#+---------+
#|Doe, John|
#| Madonna|
#+---------+
You can call the new function just as before:
df = df.withColumn('First_Name', get_first_name(df.BENF_NME))
df.show()
#+---------+----------+
#| BENF_NME|First_Name|
#+---------+----------+
#|Doe, John| John|
#| Madonna| Madonna|
#+---------+----------+

PySpark "explode" dict in column

I have a column 'true_recoms' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
true_recoms | {"5556867":1,"5801144":5,"7397596":21}
I need to 'explode' this column to get something like this:
item | 20380109
recom_item | 5556867
recom_cnt | 1
..............
item | 20380109
recom_item | 5801144
recom_cnt | 5
..............
item | 20380109
recom_item | 7397596
recom_cnt | 21
I've tried to use from_json but its doesnt work:
schema_json = StructType(fields=[
StructField("item", StringType()),
StructField("recoms", StringType())
])
df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)
+--------+--------------------+------+
| item| true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...| [,]|
|17359322|{"17359392":1,"17...| [,]|
|31480894|{"31480598":1,"31...| [,]|
| 7265665|{"7265891":1,"503...| [,]|
|31350949|{"32218698":1,"31...| [,]|
+--------+--------------------+------+
only showing top 5 rows
The schema is incorrectly defined. You declare to be as struct with two string fields
item
recoms
while neither field is present in the document.
Unfortunately from_json can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
is not an option.
Personally I would use an udf
from pyspark.sql.functions import udf, explode
import json
#udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
which can be applied like this
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+

Categories