Remove all rows that are duplicates with respect to some rows - python

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:
+------+-----+----+
| id|value|type|
+------+-----+----+
|283924| 1.5| 0|
|283924| 1.5| 1|
|982384| 3.0| 0|
|982384| 3.0| 1|
|892383| 2.0| 0|
|892383| 2.5| 1|
+------+-----+----+
I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.
In this case:
Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:
The output would be:
+------+-----+----+
| id|value|type|
+------+-----+----+
|892383| 2.5| 1|
|892383| 2.0| 0|
+------+-----+----+
I've tried
df.dropDuplicates(subset = ['id', 'value'], keep = False)
But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.
How else could I do this?

You can do that using the window functions
from pyspark.sql import Window, functions as F
df.withColumn(
'fg',
F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()

You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:
df.join(
df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#| id|value|type|
#+------+-----+----+
#|892383| 2.5| 1|
#|892383| 2.0| 0|
#+------+-----+----+

Related

How to add average value of value column by every unique id column?

I'm attempting to do something very similar to this post here, but I need to use pyspark dataframes and I'm looking to create two columns based off different IDs.
Essentially I am attempting to append my original pyspark dataframe with two news columns each containing the mean value for their paired IDs.
An example initial df and the output df can be found below:
Example input and output
To achieve this , you need to create 2 individual dataframes containing the the aggregation and join it to back to the original dataframe.
Essentially I am attempting to append my original pyspark dataframe
with two news columns each containing the mean value for their paired
IDs.
Data Preparation
d = {
'id1':[1]*2 + [2] * 3,
'id2':[2]*2 + [1] * 3,
'value': [i for i in range(1,100,20)]
}
df = pd.DataFrame(d)
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+---+-----+
|id1|id2|value|
+---+---+-----+
| 1| 2| 1|
| 1| 2| 21|
| 2| 1| 41|
| 2| 1| 61|
| 2| 1| 81|
+---+---+-----+
Aggregation & Join
sparkDF_agg_id1 = sparkDF.groupBy('id1').agg(F.mean(F.col('value')).alias('value_mean_id1'))
sparkDF_agg_id2 = sparkDF.groupBy('id2').agg(F.mean(F.col('value')).alias('value_mean_id2'))
finalDF = sparkDF.join(sparkDF_agg_id1
,sparkDF['id1'] == sparkDF_agg_id1['id1']
,'inner'
).select(sparkDF['*']
,sparkDF_agg_id1['value_mean_id1']
)
finalDF = finalDF.join(sparkDF_agg_id2
,finalDF['id2'] == sparkDF_agg_id2['id2']
,'inner'
).select(finalDF['*']
,sparkDF_agg_id2['value_mean_id2']
)
finalDF.show()
+---+---+-----+--------------+--------------+
|id1|id2|value|value_mean_id1|value_mean_id2|
+---+---+-----+--------------+--------------+
| 2| 1| 41| 61.0| 61.0|
| 2| 1| 61| 61.0| 61.0|
| 2| 1| 81| 61.0| 61.0|
| 1| 2| 1| 11.0| 11.0|
| 1| 2| 21| 11.0| 11.0|
+---+---+-----+--------------+--------------+

How to create a dataframe based on the lookup dataframe and create mulitple column on dynamic and map values in the specific columns

I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+

How to rename DataFrame header with values from mapping table in Pyspark

I have to rename columns of table() with values from mapping table(df2 below) in Pyspark.
Thanks for any help!
I tried to do it with pandas but it works for 25 min with my tables.
import pandas as pd
df = pd.DataFrame({'kod':[1,1,3,4,5], 'freq':[4,8,8,20,16], 'lsv':[100,200,300,250,400]})
df2 = pd.DataFrame({'oldid':['kod','freq','lsv'], 'newid':['code','visits','volume']})
mapping=dict(df2[['oldid', 'newid']].values)
df=df.rename(columns=mapping)
display(df2)
Spark Dataframes works little differently than Pandas data frame
after converting your pandas dataframes into Spark data frames
I am updating the name of freq to zeq just to demonstrate the sorting
df = spark.createDataFrame([(4,1,100),(8,1,200),(8,3,300),(20,4,250),(16,5,400)], ['zeq','kod','lsv'])
sorted_df = df.select(sorted(df.columns))
sorted_df.show()
+---+---+---+
|kod|lsv|zeq|
+---+---+---+
| 1|100| 4|
| 1|200| 8|
| 3|300| 8|
| 4|250| 20|
| 5|400| 16|
+---+---+---+
header dataFrame
headers = spark.createDataFrame([('code','kod'),('visits','zeq'),('volume','lsv')],['newid','oldid'])
headers.show()
+------+-----+
| newid|oldid|
+------+-----+
| code| kod|
|visits| zeq|
|volume| lsv|
+------+-----+
there is a method called toDF available on Spark dataframe that takes the list of new header columns as an argument and updates the header of the dataframe.
so sort your data frame based on oldid and select new id and convert that column values into list like below
sorted_headers_list = headers.sort('oldid').select('newid').rdd.flatMap(lambda x: x).collect()
update your dataframe with new headers
df_with_updated_headers = sorted_df.toDF(*sorted_headers_list)
df_with_updated_headers.show()
+----+------+------+
|code|volume|visits|
+----+------+------+
| 1| 100| 4|
| 1| 200| 8|
| 3| 300| 8|
| 4| 250| 20|
| 5| 400| 16|
+----+------+------+
please let me know if you need more details

Pyspark - Using two time indices for window function

I have a dataframe where each row has two date columns. I would like to create a window function with a range between that counts the number of rows in a particular range, where BOTH date columns are within the range. In the case below, both timestamps of a row must be before the timestamp of the current row, to be included in the count.
Example df including the count column:
+---+-----------+-----------+-----+
| ID|Timestamp_1|Timestamp_2|Count|
+---+-----------+-----------+-----+
| a| 0| 3| 0|
| b| 2| 5| 0|
| d| 5| 5| 3|
| c| 5| 9| 3|
| e| 8| 10| 4|
+---+-----------+-----------+-----+
I tried creating two windows and creating the new column over both of these:
w_1 = Window.partitionBy().orderBy('Timestamp_1').rangeBetween(Window.unboundedPreceding, 0)
w_2 = Window.partitionBy().orderBy('Timestamp_2').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('count', F.count('ID').over(w_1).over(w_2))
However, this is not allowed in Pyspark and therefore gives an error.
Any ideas? Solutions in SQL are also fine!
Would a self-join work?
from pyspark.sql import functions as F
df_count = (
df.alias('a')
.join(
df.alias('b'),
(F.col('b.Timestamp_1') <= F.col('a.Timestamp_1')) &
(F.col('b.Timestamp_2') <= F.col('a.Timestamp_2')),
'left'
)
.groupBy(
'a.ID'
)
.agg(
F.count('b.ID').alias('count')
)
)
df = df.join(df_count, 'ID')

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?
This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]
The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.
Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.
My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3
df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+
A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements
The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

Categories