Replacing column values by dict pyspark - python

I have a dictionary like this
d = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
and a df:
+-----------+
|some_column|
+-----------+
| banana|
| cat|
| apple|
| other|
| null|
+-----------+
Id like to get this as output:
+-----------+
|some_column|
+-----------+
| fruit|
| animal|
| fruit|
| other|
| null|
+-----------+
I know that if i had a dictionary like this
{"apple" : "fruit", "banana": "fruit", [ยทยทยท]}
i could use df.na.replace, and of course i can work through my given dictionary and change it to something like this.
But is there a way of getting my desired output without changing the dictionary?

Create a dataframe from the dictionary and join the dataframes.
d = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
df = spark.createDataFrame([[d]], ['data'])
df = df.select(f.explode('data'))
df.show()
df.printSchema()
data = ['banana', 'cat', 'apple', 'other', None]
df2 = spark.createDataFrame(data, StringType()).toDF('some_column')
df2.show()
df2.join(df, f.array_contains(f.col('value'), f.col('some_column')), 'left') \
.select(f.coalesce('key', 'some_column').alias('some_column')) \
.show()
+------+------------------+
| key| value|
+------+------------------+
|animal|[cat, dog, turtle]|
| fruit| [banana, apple]|
+------+------------------+
root
|-- key: string (nullable = false)
|-- value: array (nullable = true)
| |-- element: string (containsNull = true)
+-----------+
|some_column|
+-----------+
| banana|
| cat|
| apple|
| other|
| null|
+-----------+
+-----------+
|some_column|
+-----------+
| fruit|
| animal|
| fruit|
| other|
| null|
+-----------+

import pandas as pd
lx = {"animal": ["cat", "dog", "turtle"], "fruit" : ["banana", "apple"]}
df = pd.DataFrame({'input': ['banana', 'cat', 'apple', 'other', 'null']})
ls_input = df['input'].to_list()
# invert dict .. see https://stackoverflow.com/questions/483666/reverse-invert-a-dictionary-mapping
lx_inv = {vi: k for k, v in lx.items() for vi in v}
y = []
for x in ls_input:
try:
y.append(lx_inv[x])
except:
y.append(x)
df2 = pd.DataFrame(data=y, columns=['output'])
this creates inverted dictionary. not sure what you mean exactly by 'not changing the dictionary' this method creates a new dict for making comparisons. also, there are probably some nuances about duplicates (can there be values that belong to 2 keys in the original dict) and missing/undefined cases, but you need to specify what are the possible cases and desired outcomes for those.

Related

PySpark - Filter dataframe columns based on list

I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)

replace key value from dictionary

Below is my DF:
deviceDict = {'TABLET' : 'MOBILE', 'PHONE':'MOBILE', 'PC':'Desktop', 'CEDEX' : '', 'ST' : 'SAINT', 'AV' : 'AVENUE', 'BD': 'BOULEVARD'}
df = spark.createDataFrame([('TABLET', 'DAF ST PAQ BD'), ('PHONE', 'AVOTHA'), ('PC', 'STPA CEDEX'), ('OTHER', 'AV DAF'), (None, None)], ["device_type", 'City'])
df.show()
Output:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| TABLET|DAF ST PAQ BD|
| PHONE| AVOTHA|
| PC| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
The aim is to replace key/value, solution from Pyspark: Replacing value in a column by searching a dictionary
tests = df.na.replace(deviceDict, 1)
Result:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| MOBILE|DAF ST PAQ BD|
| MOBILE| AVOTHA|
| Desktop| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
It worked for device_type but I wasn't able to change the city (even when using subset)
Expected output:
+-----------+------------------------+
|device_type| City|
+-----------+------------------------+
| MOBILE| DAF SAINT PAQ BOULEVARD|
| MOBILE| AVOTHA|
| Desktop| STPA|
| OTHER| AVENUE DAF|
| null| null|
+-----------+------------------------+
The replacement doesn't occur for the column City because you're trying to do some partial replacement in the column values. Whereas function DataFrame.replace uses the entire value as a mapping.
To achieve what you want for column City, you can use multiple nested regexp_replace expressions that you can dynamically generate using Python functools.reduce for example:
from functools import reduce
import pyspark.sql.functions as F
m = list(deviceDict.items())
df1 = df.na.replace(deviceDict, 1).withColumn(
"City",
reduce(
lambda acc, x: F.regexp_replace(acc, rf"\b{x[0]}\b", x[1]),
m[1:],
F.regexp_replace(F.col("City"), rf"\b{m[0][0]}\b", m[0][1]),
)
)
df1.show(truncate=False)
#+-----------+-----------------------+
#|device_type|City |
#+-----------+-----------------------+
#|MOBILE |DAF SAINT PAQ BOULEVARD|
#|MOBILE |AVOTHA |
#|Desktop |STPA |
#|OTHER |AVENUE DAF |
#|null |null |
#+-----------+-----------------------+

How to dynamically filter out rows in a Spark dataframe with an exact match?

I have a dictionary like so
dict = {
"ColA": "A",
"ColB": "B"
}
I want to use this dictionary to delete a row in a dataframe, df, only if the row matches each value in the dictionary exactly.
So using the input dataframe
+------+------+
| ColA | ColB |
+------+------+
| A | A |
| A | B |
| B | B |
+------+------+
The output would be
+------+------+
| ColA | ColB |
+------+------+
| A | A |
| B | B |
+------+------+
I have tried something like this
for col in dict:
df = df.filter(df_to_upsert[col] != row[col])
However this would just filter out rows with any matching value in row_dict, so in this case every row in the dataframe would be filtered.
A typical case using a reduce function:
from pyspark.sql.functions import col
from functools import reduce
cond = reduce(lambda x,y: x|y, [ col(k)!=v for k,v in dict.items() ])
df.filter(cond).show()
+----+----+
|ColA|ColB|
+----+----+
| A| A|
| B| B|
+----+----+

Adding a custom column to a pyspark dataframe using udf passing columns as an argument

I have a spark dataframe having two columns and I am trying to add a new column referring a new value for these columns. I am taking this values from a dictionary which contains the correct value for the column
+--------------+--------------------+
| country| zip|
+--------------+--------------------+
| Brazil| 7541|
|United Kingdom| 5678|
| Japan| 1234|
| Denmark| 2345|
| Canada| 4567|
| Italy| 6031|
| Sweden| 4205|
| France| 6111|
| Spain| 8555|
| India| 2552|
+--------------+--------------------+
The correct value for the country should be India and zip should be 1234 and that is stored in a dictionary
column_dict = {'country' : 'India', zip: 1234}
I am trying to make the new column value as "Brazil: India, Zip :1234" where the value of the column is anything different from these values.
I have tried it in following way but it's returning empty column but the function is returning the desired value
cols = list(df.columns)
col_list = list(column_dict.keys())
def update(df, cols = cols , col_list = col_list):
z = []
for col1, col2 in zip(cols,col_list):
if col1 == col2:
if df.col1 != column_dict[col2]:
z.append("{'col':" + col2 + ", 'reco': " + str(column_dict[col2]) + "}")
else:
z.append("{'col':" + col2 + ", 'reco': }")
my_udf = udf(lambda x: update(x, cols, col_list))
z = y.withColumn("NewValue", lit(my_udf(y, cols,col_list)))
If I export the same output dataframe to csv value is coming with the parts appending with '\'. How can I get the function value on the column in exact way?
A simple way is to make a dataframe from your dictionary and union() it to your main dataframe and then groupby and get the last value. here you can do this:
sc = SparkContext.getOrCreate()
newDf = sc.parallelize([
{'country' : 'India', 'zip': 1234}
]).toDF()
newDF.show()
newDF:
+-------+----+
|country| zip|
+-------+----+
| India|1234|
+-------+----+
and finalDF:
unionDF = df.union(newDF)
unionDF.show()
+--------------+--------------------+
| country| zip|
+--------------+--------------------+
| Brazil| 7541|
|United Kingdom| 5678|
| Japan| 1234|
| Denmark| 2345|
| Canada| 4567|
| Italy| 6031|
| Sweden| 4205|
| France| 6111|
| Spain| 8555|
| India| 2552|
| India| 1234|
+--------------+--------------------+
and in the end do groupby and last:
import pyspark.sql.functions as f
finalDF = unionDF.groupbby('country').agg(f.last('zip'))
finalDF.show()
+--------------+--------------------+
| country| zip|
+--------------+--------------------+
| Brazil| 7541|
|United Kingdom| 5678|
| Japan| 1234|
| Denmark| 2345|
| Canada| 4567|
| Italy| 6031|
| Sweden| 4205|
| France| 6111|
| Spain| 8555|
| India| 1234|
+--------------+--------------------+

Pyspark: Replacing value in a column by searching a dictionary

I'm a newbie in PySpark.
I have a Spark DataFrame df that has a column 'device_type'.
I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".
In Python I can do the following,
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)
How can I achieve this using PySpark? Thanks!
You can use either na.replace:
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
df.na.replace(deviceDict, 1).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
or map literal:
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping = create_map([lit(x) for x in chain(*deviceDict.items())])
df.select(mapping[df['device_type']].alias('device_type'))
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:
from pyspark.sql.functions import coalesce
df.select(
coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
After a lot of searching and alternatives I think that the simplest way to replace using a python dict is with pyspark dataframe method replace:
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df_replace = df.replace(deviceDict,subset=['device_type'])
This will replace all values with the dict, you can get the same results using df.na.replace() if you pass a dict argument combined with a subset argument. It's not clear enough on his docs because if you search the function replace you will get two references, one inside of pyspark.sql.DataFrame.replace and the other one in side of pyspark.sql.DataFrameNaFunctions.replace, but the sample code of both reference use df.na.replace so it is not clear you can actually use df.replace.
Here is a little helper function, inspired by the R recode function, that abstracts the previous answers. As a bonus, it adds the option for a default value.
from itertools import chain
from pyspark.sql.functions import col, create_map, lit, when, isnull
from pyspark.sql.column import Column
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df.show()
+-----------+
|device_type|
+-----------+
| Tablet|
| Phone|
| PC|
| Other|
| null|
+-----------+
Here is the definition of recode.
def recode(col_name, map_dict, default=None):
if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
col_name = col(col_name)
mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
if default is None:
return mapping_expr.getItem(col_name)
else:
return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)
Creating a column without a default gives null/None in all unmatched values.
df.withColumn("device_type", recode('device_type', deviceDict)).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
On the other hand, specifying a value for default replaces all unmatched values with this default.
df.withColumn("device_type", recode('device_type', deviceDict, default='Other')).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| Other|
+-----------+
You can do this using df.withColumn too:
from itertools import chain
from pyspark.sql.functions import create_map, lit
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
mapping_expr = create_map([lit(x) for x in chain(*deviceDict.items())])
df = df.withColumn('device_type', mapping_expr[df['dvice_type']])
df.show()
The simplest way to do it is to apply a udf on your dataframe :
from pyspark.sql.functions import col , udf
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
map_func = udf(lambda row : deviceDict.get(row,row))
df = df.withColumn("device_type", map_func(col("device_type")))
Another way of solving this is using CASE WHEN in traditional sql but using f-strings and using the python dictionary along with .join for automatically generating the CASE WHEN statement:
column = 'device_type' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'"
for k,v in deviceDict.items()])} ELSE {column} END"""
df.withColumn(column,F.expr(e)).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
Note: if you want to return NULL where the keys doesnot match, just change ELSE {column} END to ELSE NULL END in the case statement for variable e
column = 'device_type' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'"
for k,v in deviceDict.items()])} ELSE NULL END"""
df.withColumn('New_Col',F.expr(e)).show()
+-----------+-------+
|device_type|New_Col|
+-----------+-------+
| Tablet| Mobile|
| Phone| Mobile|
| PC|Desktop|
| Other| null|
| null| null|
+-----------+-------+

Categories