PySpark "explode" dict in column

PySpark "explode" dict in column - python

I have a column 'true_recoms' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
true_recoms | {"5556867":1,"5801144":5,"7397596":21}
I need to 'explode' this column to get something like this:
item | 20380109
recom_item | 5556867
recom_cnt | 1
..............
item | 20380109
recom_item | 5801144
recom_cnt | 5
..............
item | 20380109
recom_item | 7397596
recom_cnt | 21
I've tried to use from_json but its doesnt work:
schema_json = StructType(fields=[
StructField("item", StringType()),
StructField("recoms", StringType())
])
df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)
+--------+--------------------+------+
| item| true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...| [,]|
|17359322|{"17359392":1,"17...| [,]|
|31480894|{"31480598":1,"31...| [,]|
| 7265665|{"7265891":1,"503...| [,]|
|31350949|{"32218698":1,"31...| [,]|
+--------+--------------------+------+
only showing top 5 rows

The schema is incorrectly defined. You declare to be as struct with two string fields
item
recoms
while neither field is present in the document.
Unfortunately from_json can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
is not an option.
Personally I would use an udf
from pyspark.sql.functions import udf, explode
import json
#udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
which can be applied like this
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+

Related

Use enumerate to get partition columns from dataframe

I am trying to get all columns and their datatypes into a variable, also only the partition columns into another variable of list type in python.
Getting details from describe extended.
df = spark.sql("describe extended schema_name.table_name")
+----------------------------------------------------------+
|col_name |data_type |
+----------------------------+-----------------------------+
|col1 |string |
|col2 |int
|col3 |string
|col4 |int
|col5 |string
|# Partition Information | |
|# col_name |data_type |
|col4 |int |
|col5 |string |
| | |
|# Detailed Table Information| |
|Database |schema_name |
|Table |table_name |
|Owner |owner.name |
Converting result into a list.
des_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
Here is how I am trying to get all columns(all items until before # Partition Information).
all_cols_name_type=[]
for index,item in enumerate(des_list):
if item[0]=='# Partition Information':
all_cols_name_type.append(des_list[:index])
For partitions, i would like to get everything between the items '# col_name' and line before ''(line before # Detailed Table Information)
Any help is appreciated to be able to get this.

You can try using the following answer or equivalent in Scala:
val (partitionCols, dataCols) = spark.catalog.listColumns("schema_name.table_name")
.collect()
.partition(c => c.isPartition)
val parCols = partitionCols.map(c => (c.name, c.dataType))
val datCols = dataCols.map(c => (c.name, c.dataType))
If the table is not defined in the catalog (e.g reading parquet dataset directly from s3 using spark.read.parquet("s3://path/...")) then you can use the following snippet in Scala:
val (partitionSchema, dataSchema) = df.queryExecution.optimizedPlan match {
case LogicalRelation(hfs: HadoopFsRelation, _, _, _) =>
(hfs.partitionSchema, hfs.dataSchema)
case DataSourceV2ScanRelation(_, scan: FileScan, _) =>
(scan.readPartitionSchema, scan.readDataSchema)
case _ => (StructType(Seq()), StructType(Seq()))
}
val parCols = partitionSchema.map(f => (f.name, f.dataType))
val datCols = dataSchema.map(f => (f.name, f.dataType))

There is a trick to do so: You can use monotonically_increasing_id to give each row a number, find the row that has # col_name and get that index. Something like this
My sample table
df = spark.sql('describe data')
df = df.withColumn('id', F.monotonically_increasing_id())
df.show()
+--------------------+---------+-------+---+
| col_name|data_type|comment| id|
+--------------------+---------+-------+---+
| c1| int| null| 0|
| c2| string| null| 1|
|# Partition Infor...| | | 2|
| # col_name|data_type|comment| 3|
| c2| string| null| 4|
+--------------------+---------+-------+---+
tricky part
idx = df.where(F.col('col_name') == '# col_name').first()['id']
# 3
partition_cols = [r['col_name'] for r in df.where(F.col('id') > idx).collect()]
# ['c2']

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.

Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+

If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

pyspark replace lowercase characters in column with 'x'

I'm trying to do the following but for a column in pyspark but no luck. Any idea on isolating just the lowercase characters in column of a spark df?
''.join('x' if x.islower() else 'X' if x.isupper() else x for x in text)

You can directly use regex_replace to substitute the lowercase values to any desired value -
In your case you will have to chain regex_replace to get the final output -
Data Preparation
inp_string = """
lRQWg2IZtB
hVzsJhPVH0
YXzc4fZDwu
qRyOUhT5Hn
b85O0H41RE
vOxPLFPWPy
fE6o5iMJ6I
918JI00EC7
x3yEYOCwek
m1eWY8rZwO
""".strip().split()
df = pd.DataFrame({
'value':inp_string
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
Regex Replace
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value", r'[a-z]', "x"))
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value_modified", r'[A-Z]', "X"))
sparkDF.show()
+----------+--------------+
| value|value_modified|
+----------+--------------+
|lRQWg2IZtB| xXXXx2XXxX|
|hVzsJhPVH0| xXxxXxXXX0|
|YXzc4fZDwu| XXxx4xXXxx|
|qRyOUhT5Hn| xXxXXxX5Xx|
|b85O0H41RE| x85X0X41XX|
|vOxPLFPWPy| xXxXXXXXXx|
|fE6o5iMJ6I| xX6x5xXX6X|
|918JI00EC7| 918XX00XX7|
|x3yEYOCwek| x3xXXXXxxx|
|m1eWY8rZwO| x1xXX8xXxX|
+----------+--------------+

Using the following dataframe as an example
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
You can use a pyspark.sql function called regexpr_replace to isolate the lowercase letters in the column with the following code
from pyspark.sql import functions
df = (df.withColumn("value",
functions.regexp_replace("value", r'[A-Z]|[0-9]|[,.;##?!&$]', "")))
df.show()
+-----+
|value|
+-----+
| lgt|
| hzsh|
|zcfwu|
| qyhn|
| b|
| vxy|
| foi|
| |
|xywek|
| merw|
+-----+

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:
>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)
which you can instantiate like such:
schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType()),
StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete for example.
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
According to the solution I linked that tries to leverage pyspark.sql's to_json and from_json functions, it should be accomplishable with something like this:
new_schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType())
])))])
test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))
>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)
>>> test_df.show()
+----+
| big|
+----+
|null|
+----+
So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?
Pyspark to_json documentation
Pyspark from_json documentation

It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:
new_schema = ArrayType(StructType([StructField("keep", StringType())]))
test_df = df.withColumn("big", from_json(to_json("big"), new_schema))

PySpark equivalent for lambda function in Pandas UDF

I have written a data preprocessing codes in Pandas UDF in PySpark. I'm using lambda function to extract a part of the text from all the records of a column.
Here is how my code looks like:
#pandas_udf("string", PandasUDFType.SCALAR)
def get_X(col):
return col.apply(lambda x: x.split(',')[-1] if len(x.split(',')) > 0 else x)
df = df.withColumn('X', get_first_name(df.Y))
This is working fine and giving the desired results. But I need to write the same piece of logic in Spark equivalent code. Is there a way to do it? Thanks.

I think one function substring_index is enough for this particular task:
from pyspark.sql.functions import substring_index
df = spark.createDataFrame([(x,) for x in ['f,l', 'g', 'a,b,cd']], ['c1'])
df2.withColumn('c2', substring_index('c1', ',', -1)).show()
+------+---+
| c1| c2|
+------+---+
| f,l| l|
| g| g|
|a,b,cd| cd|
+------+---+

Given the following DataFrame df:
df.show()
# +-------------+
# | BENF_NME|
# +-------------+
# | Doe, John|
# | Foo|
# |Baz, Quux,Bar|
# +-------------+
You can simply use regexp_extract() to select the first name:
from pyspark.sql.functions import regexp_extract
df.withColumn('First_Name', regexp_extract(df.BENF_NME, r'(?:.*,\s*)?(.*)', 1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
If you don't care about possible leading spaces, substring_index() provides a simple alternative to your original logic:
from pyspark.sql.functions import substring_index
df.withColumn('First_Name', substring_index(df.BENF_NME, ',', -1)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
In this case the first row's First_Name has a leading space:
df.withColumn(...).collect()[0]
# Row(BENF_NME=u'Doe, John', First_Name=u' John'
If you still want to use a custom function, you need to create a user-defined function (UDF) using udf():
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
get_first_name = udf(lambda s: s.split(',')[-1], StringType())
df.withColumn('First_Name', get_first_name(df.BENF_NME)).show()
# +-------------+----------+
# | BENF_NME|First_Name|
# +-------------+----------+
# | Doe, John| John|
# | Foo| Foo|
# |Baz, Quux,Bar| Bar|
# +-------------+----------+
Note that UDFs are slower than the built-in Spark functions, especially Python UDFs.

You can do the same using when to implement if-then-else logic:
First split the column, then compute its size. If the size is greater than 0, take the last element from the split array. Otherwise, return the original column.
from pyspark.sql.functions import split, size, when
def get_first_name(col):
col_split = split(col, ',')
split_size = size(col_split)
return when(split_size > 0, col_split[split_size-1]).otherwise(col)
As an example, suppose you had the following DataFrame:
df.show()
#+---------+
#| BENF_NME|
#+---------+
#|Doe, John|
#| Madonna|
#+---------+
You can call the new function just as before:
df = df.withColumn('First_Name', get_first_name(df.BENF_NME))
df.show()
#+---------+----------+
#| BENF_NME|First_Name|
#+---------+----------+
#|Doe, John| John|
#| Madonna| Madonna|
#+---------+----------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark "explode" dict in column - python

Related

Use enumerate to get partition columns from dataframe

Trim String Characters in Pyspark dataframe

pyspark replace lowercase characters in column with 'x'

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

PySpark equivalent for lambda function in Pandas UDF

Categories

Resources