Pyspark: Replacing value in a column by searching a dictionary - python

I'm a newbie in PySpark.
I have a Spark DataFrame df that has a column 'device_type'.
I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".
In Python I can do the following,
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)
How can I achieve this using PySpark? Thanks!

You can use either na.replace:
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
df.na.replace(deviceDict, 1).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
or map literal:
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping = create_map([lit(x) for x in chain(*deviceDict.items())])
df.select(mapping[df['device_type']].alias('device_type'))
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:
from pyspark.sql.functions import coalesce
df.select(
coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+

After a lot of searching and alternatives I think that the simplest way to replace using a python dict is with pyspark dataframe method replace:
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df_replace = df.replace(deviceDict,subset=['device_type'])
This will replace all values with the dict, you can get the same results using df.na.replace() if you pass a dict argument combined with a subset argument. It's not clear enough on his docs because if you search the function replace you will get two references, one inside of pyspark.sql.DataFrame.replace and the other one in side of pyspark.sql.DataFrameNaFunctions.replace, but the sample code of both reference use df.na.replace so it is not clear you can actually use df.replace.

Here is a little helper function, inspired by the R recode function, that abstracts the previous answers. As a bonus, it adds the option for a default value.
from itertools import chain
from pyspark.sql.functions import col, create_map, lit, when, isnull
from pyspark.sql.column import Column
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df.show()
+-----------+
|device_type|
+-----------+
| Tablet|
| Phone|
| PC|
| Other|
| null|
+-----------+
Here is the definition of recode.
def recode(col_name, map_dict, default=None):
if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
col_name = col(col_name)
mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
if default is None:
return mapping_expr.getItem(col_name)
else:
return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)
Creating a column without a default gives null/None in all unmatched values.
df.withColumn("device_type", recode('device_type', deviceDict)).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
On the other hand, specifying a value for default replaces all unmatched values with this default.
df.withColumn("device_type", recode('device_type', deviceDict, default='Other')).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| Other|
+-----------+

You can do this using df.withColumn too:
from itertools import chain
from pyspark.sql.functions import create_map, lit
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
mapping_expr = create_map([lit(x) for x in chain(*deviceDict.items())])
df = df.withColumn('device_type', mapping_expr[df['dvice_type']])
df.show()

The simplest way to do it is to apply a udf on your dataframe :
from pyspark.sql.functions import col , udf
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
map_func = udf(lambda row : deviceDict.get(row,row))
df = df.withColumn("device_type", map_func(col("device_type")))

Another way of solving this is using CASE WHEN in traditional sql but using f-strings and using the python dictionary along with .join for automatically generating the CASE WHEN statement:
column = 'device_type' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'"
for k,v in deviceDict.items()])} ELSE {column} END"""
df.withColumn(column,F.expr(e)).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
Note: if you want to return NULL where the keys doesnot match, just change ELSE {column} END to ELSE NULL END in the case statement for variable e
column = 'device_type' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'"
for k,v in deviceDict.items()])} ELSE NULL END"""
df.withColumn('New_Col',F.expr(e)).show()
+-----------+-------+
|device_type|New_Col|
+-----------+-------+
| Tablet| Mobile|
| Phone| Mobile|
| PC|Desktop|
| Other| null|
| null| null|
+-----------+-------+

Related

Replacing column values by dict pyspark

I have a dictionary like this
d = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
and a df:
+-----------+
|some_column|
+-----------+
| banana|
| cat|
| apple|
| other|
| null|
+-----------+
Id like to get this as output:
+-----------+
|some_column|
+-----------+
| fruit|
| animal|
| fruit|
| other|
| null|
+-----------+
I know that if i had a dictionary like this
{"apple" : "fruit", "banana": "fruit", [ยทยทยท]}
i could use df.na.replace, and of course i can work through my given dictionary and change it to something like this.
But is there a way of getting my desired output without changing the dictionary?
Create a dataframe from the dictionary and join the dataframes.
d = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
df = spark.createDataFrame([[d]], ['data'])
df = df.select(f.explode('data'))
df.show()
df.printSchema()
data = ['banana', 'cat', 'apple', 'other', None]
df2 = spark.createDataFrame(data, StringType()).toDF('some_column')
df2.show()
df2.join(df, f.array_contains(f.col('value'), f.col('some_column')), 'left') \
.select(f.coalesce('key', 'some_column').alias('some_column')) \
.show()
+------+------------------+
| key| value|
+------+------------------+
|animal|[cat, dog, turtle]|
| fruit| [banana, apple]|
+------+------------------+
root
|-- key: string (nullable = false)
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
+-----------+
|some_column|
+-----------+
| banana|
| cat|
| apple|
| other|
| null|
+-----------+
+-----------+
|some_column|
+-----------+
| fruit|
| animal|
| fruit|
| other|
| null|
+-----------+
import pandas as pd
lx = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
df = pd.DataFrame({'input': ['banana', 'cat', 'apple', 'other', 'null']})
ls_input = df['input'].to_list()
# invert dict .. see https://stackoverflow.com/questions/483666/reverse-invert-a-dictionary-mapping
lx_inv = {vi: k for k, v in lx.items() for vi in v}
y = []
for x in ls_input:
try:
y.append(lx_inv[x])
except:
y.append(x)
df2 = pd.DataFrame(data=y, columns=['output'])
this creates inverted dictionary. not sure what you mean exactly by 'not changing the dictionary' this method creates a new dict for making comparisons. also, there are probably some nuances about duplicates (can there be values that belong to 2 keys in the original dict) and missing/undefined cases, but you need to specify what are the possible cases and desired outcomes for those.

How to add trailer row to a Pyspark data frame having row count

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SparkByExamples.com') \
.getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2 = df.select(lit("D").alias("S"), "*")
df2.show()
Output:
----------
+---+---------+--------+------+------+
| S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
| D| James| Smith| M| 3000|
| D| Anna| Rose| F| 4100|
| D| Robert|Williams| M| 6200|
+---+---------+--------+------+------+
Required Output:
Need to add an extra row "T" and count of row for column- "firstname" like below. Column "firstname" can be of any type .
+---+---------+--------+------+------+
| S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
| D| James| Smith| M| 3000|
| D| Anna| Rose| F| 4100|
| D| Robert|Williams| M| 6200|
| T| 3 | | | |
+---+---------+--------+------+------+
Tried creating a new data frame with trailer values and apply union as suggested on most of the stacoverflow solution- but both the dataframe should have same no of columns.
Is there any better way to have the count in the trailer irrespective of column type of "firstname" column.
Since you want to create a new row irrespective of column type, you can write a function that takes the column name as an input, and returns a dictionary containing all of the necessary information for the new row including the number of entries in that column.
To create an output pyspark dataframe like the one you've shown, every column will have to be a string type because the new row will have to contain an empty string '' for the columns lastname, gender, salary. You cannot have mixed types in pyspark columns (see here), so when you create a union between df2 and total_row_df, any columns that are string type in total_row_df be coerced to a string type in the resulting dataframe.
from pyspark.sql.functions import count
def create_total_row(col_name):
total_row = {}
for col in df2.columns:
if col == 'S':
total_row[col] = 'T'
elif col == col_name:
total_row[col] = df2.select(count(df2[col_name])).collect()[0][0]
else:
total_row[col] = ''
return total_row
total_row = create_total_row('firstname')
total_row_df = spark.createDataFrame([total_row])
df2.union(total_row_df).show()
Result:
+---+---------+--------+------+------+
| S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
| D| James| Smith| M| 3000|
| D| Anna| Rose| F| 4100|
| D| Robert|Williams| M| 6200|
| T| 3| | | |
+---+---------+--------+------+------+

select from a column made of string array pyspark or python high order function multiple values

i have a table like this and i want to create a new column based on what is listed on the column
example
df.withcolumn('good',.when('java' or 'php' isin ['booksIntereste']).lit(1).otherwise(0))
desired output when it contain java or php get 1 else 0
You can directly use a Higher Order Function - array_contains for this , additionally you can browse through this article to understand more
Data Preparation
d = {
'name':['James','Washington','Robert','Micheal'],
'booksInterested':[['Java','C#','Python'],[],['PHP','Java'],['Java']]
}
sparkDF = sql.createDataFrame(pd.DataFrame(d))
sparkDF.show()
+----------+------------------+
| name| booksInterested|
+----------+------------------+
| James|[Java, C#, Python]|
|Washington| []|
| Robert| [PHP, Java]|
| Micheal| [Java]|
+----------+------------------+
Array Contains
sparkDF = sparkDF.withColumn('good',F.array_contains(F.col('booksInterested'), 'Java'))
+----------+------------------+-----+
| name| booksInterested| good|
+----------+------------------+-----+
| James|[Java, C#, Python]| true|
|Washington| []|false|
| Robert| [PHP, Java]| true|
| Micheal| [Java]| true|
+----------+------------------+-----+
ForAll Array Contains - Multiple
sparkDF = sparkDF.withColumn('good_multiple',F.forall(F.col('booksInterested'), lambda x: x.isin(['Java','Python','PHP'])))
sparkDF.show()
+----------+------------------+-----+-------------+
| name| booksInterested| good|good_multiple|
+----------+------------------+-----+-------------+
| James|[Java, C#, Python]| true| false|
|Washington| []|false| true|
| Robert| [PHP, Java]| true| true|
| Micheal| [Java]| true| true|
+----------+------------------+-----+-------------+

replace key value from dictionary

Below is my DF:
deviceDict = {'TABLET' : 'MOBILE', 'PHONE':'MOBILE', 'PC':'Desktop', 'CEDEX' : '', 'ST' : 'SAINT', 'AV' : 'AVENUE', 'BD': 'BOULEVARD'}
df = spark.createDataFrame([('TABLET', 'DAF ST PAQ BD'), ('PHONE', 'AVOTHA'), ('PC', 'STPA CEDEX'), ('OTHER', 'AV DAF'), (None, None)], ["device_type", 'City'])
df.show()
Output:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| TABLET|DAF ST PAQ BD|
| PHONE| AVOTHA|
| PC| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
The aim is to replace key/value, solution from Pyspark: Replacing value in a column by searching a dictionary
tests = df.na.replace(deviceDict, 1)
Result:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| MOBILE|DAF ST PAQ BD|
| MOBILE| AVOTHA|
| Desktop| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
It worked for device_type but I wasn't able to change the city (even when using subset)
Expected output:
+-----------+------------------------+
|device_type| City|
+-----------+------------------------+
| MOBILE| DAF SAINT PAQ BOULEVARD|
| MOBILE| AVOTHA|
| Desktop| STPA|
| OTHER| AVENUE DAF|
| null| null|
+-----------+------------------------+
The replacement doesn't occur for the column City because you're trying to do some partial replacement in the column values. Whereas function DataFrame.replace uses the entire value as a mapping.
To achieve what you want for column City, you can use multiple nested regexp_replace expressions that you can dynamically generate using Python functools.reduce for example:
from functools import reduce
import pyspark.sql.functions as F
m = list(deviceDict.items())
df1 = df.na.replace(deviceDict, 1).withColumn(
"City",
reduce(
lambda acc, x: F.regexp_replace(acc, rf"\b{x[0]}\b", x[1]),
m[1:],
F.regexp_replace(F.col("City"), rf"\b{m[0][0]}\b", m[0][1]),
)
)
df1.show(truncate=False)
#+-----------+-----------------------+
#|device_type|City |
#+-----------+-----------------------+
#|MOBILE |DAF SAINT PAQ BOULEVARD|
#|MOBILE |AVOTHA |
#|Desktop |STPA |
#|OTHER |AVENUE DAF |
#|null |null |
#+-----------+-----------------------+

PySpark: How to explode list into multiple columns with sequential naming?

I have the following PySpark DF:
+--------+--------------------+--------------------+
| id| resoFacts| heating|
+--------+--------------------+--------------------+
|90179090|[, [No Handicap A...|[Central Heat, Fo...|
+--------+--------------------+--------------------+
created by the following:
(data_filt
.where(col('id') == '90179090')
.withColumn('heating', col("resoFacts").getField('heating')))
I want to create a DF that expands the list in heating into sequentially named columns, as so:
+--------------+------------+----------+----------+---------+
| id |heating_1 |heating_2 | heating_3|heating_4|
+--------------+------------+----------+----------+---------+
| 90179090 |Central Heat|Forced Air| Gas |Heat Pump|
+--------------+------------+----------+----------+---------+
My furthest attempt has generated the following DF:
+---+------------+----------+----+---------+
|pos|Central Heat|Forced Air| Gas|Heat Pump|
+---+------------+----------+----+---------+
| 1| null|Forced Air|null| null|
| 3| null| null| Gas| null|
| 2| null| null|null|Heat Pump|
| 0|Central Heat| null|null| null|
+---+------------+----------+----+---------+
with this code:
(data_filt
.where(col('id') == '90179090')
.withColumn('heating', col("resoFacts").getField('heating'))
.select("heating", posexplode("heating"))
.groupBy('pos').pivot('col').agg(first('col')))
I'm likely doing something wrong with the line beginning with groupBy. Does anyone have thoughts?
If you only have 4 elements in the array, you can simply do that :
from pyspark.sql import functions as F
data_filt.select(
"id",
*(
F.col("heating").getItem(i).alias(f"heating_{i+1}")
for i in range(4)
)
)
Increase the range if you have more elements.

Categories