How to perform dedup in python? - python

I have a data frame. Now I want to partition by one column and orderby with descending order with one column. Although I know how to do it in Pyspark using sql but not clear of doing it in python. My pyspark code is as follows:-
df= Name age
Ram 20
geet 16
ram 50
geet 15
tom 21
hary 25
tom 36
partition_col=['Name']
arrange_col =['age']
df= df.select("*",F.row_number().over(Window.partitionBy(partition_col).orderBy(*[F.desc(c) for c in arrange_col ])).alias("Value"))
this gives me:-
df=
+----+---+-----+
|Name|age|Value|
+----+---+-----+
| Ram| 20| 1|
|geet| 16| 1|
|geet| 15| 2|
|hary| 25| 1|
| ram| 50| 1|
| tom| 36| 1|
| tom| 21| 2|
+----+---+-----+
Now I want the same in python. how do I code this:-
df.select("*",F.row_number().over(Window.partitionBy(partition_col).orderBy(*[F.desc(c) for c in arrange_col ])).alias("Value"))
Into python?
As there is no option of window.partitonBy

Related

Delta Lake Table Storage Sorting

I have a delta lake table and inserting the data into that table. Business asked to sort the data while storing it in the table.
I sorted my dataframe before creating the delta table as below
df.sort()
and then created the delta table as below
df.write.format('delta').Option('mergeSchema, true).save('deltalocation')
when retrieving this data into dataframe i see the data is still unsorted.
and i have to do df.sort in order to display the sorted data.
Per my understanding the data cannot actually be stored in a sorted order and the user will have to write a sorting query while extracting the data from the table.
I need to understand if this is correct and also how the delta lake internally stores the data.
My understanding is that it partitions the data and doesn't care about the sort order. data is spread across multiple partitions.
Can someone please clarify this in more detail and advise if my undertanding is correct ?
Delta Lake itself does not itself enable sorting because this would require any engine writing to sort the data. To balance simplicity, speed of ingestion, and speed of query, this is why Delta Lake itself does not require or enable sorting per se. i.e., your statement is correct.
My understanding is that it partitions the data and doesn't care about the sort order. data is spread across multiple partitions.
Note that Delta Lake includes data skipping and OPTIMIZE ZORDER. This allows you to skip files/data using the column statistics and by clustering the data. While sorting can be helpful for a single column, Z-order provides better multi-column data cluster. More info is available in Delta 2.0 - The Foundation of your Data Lakehouse is Open.
Saying this, how Delta Lake stores the data is often a product of what the writer itself is doing. If you were to specify a sort during the write phase, e.g.:
df_sorted = df.repartition("date").sortWithinPartitions("date", "id")
df_sorted.write.format("delta").partitionBy("date").save('deltalocation')
Then the data should be sorted and when read it will be sorted as well.
In response to the question about the potential order, allow me to provide a simple example:
from pyspark.sql.functions import expr
data = spark.range(0, 100)
df = data.withColumn("mod", expr("mod(id, 10)")).show()
# Write unsorted table
df.write.format("delta").partitionBy("mod").save("/tmp/df")
# Sort within partitions
df_sorted = df.repartition("mod").sortWithinPartitions("mod", "id")
# Write sorted table
df_sorted.write.format("delta").partitionBy("mod").save("/tmp/df_sorted")
The two data frames have been saved as Delta tables to their respective df and df_sorted locations.
You can read the data by the following:
# Load data
spark.read.format("delta").load("/tmp/df").show()
spark.read.format("delta").load("/tmp/df").orderBy("mod").show()
spark.read.format("delta").load("/tmp/df_sorted").show()
spark.read.format("delta").load("/tmp/df_sorted").orderBy("mod").show()
For the un-sorted query, here are the first 20 rows and as expected, the data is not sorted.
+---+---+
| id|mod|
+---+---+
| 63| 3|
| 73| 3|
| 83| 3|
| 93| 3|
| 3| 3|
| 13| 3|
| 23| 3|
| 33| 3|
| 43| 3|
| 53| 3|
| 88| 8|
| 98| 8|
| 28| 8|
| 38| 8|
| 48| 8|
| 58| 8|
| 8| 8|
| 18| 8|
| 68| 8|
| 78| 8|
+---+---+
But in the case of df_sorted:
+---+---+
| id|mod|
+---+---+
| 2| 2|
| 12| 2|
| 22| 2|
| 32| 2|
| 42| 2|
| 52| 2|
| 62| 2|
| 72| 2|
| 82| 2|
| 92| 2|
| 9| 9|
| 19| 9|
| 29| 9|
| 39| 9|
| 49| 9|
| 59| 9|
| 69| 9|
| 79| 9|
| 89| 9|
| 99| 9|
+---+---+
As noted, the data within the partitions are sorted. The partitions themselves are not sorted because different worker threads will extract the data by different partitions so there is no guarantee of the order of the partitions unless you explicitly specify a sort order of the partitions.

How to pass a third-party column after a GroupBy and aggregation in PySpark DataFrame?

I have a Spark DataFrame, say df, to which I need to apply a GroupBy col1, aggregate by maximum value of col2 and pass the corresponding value of col3 (which has nothing to do with the groupBy or the aggregation). It is best to illustrate it with an example.
df.show()
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| 1| 500| 10 |
| 1| 600| 11 |
| 1| 700| 12 |
| 2| 600| 14 |
| 2| 800| 15 |
| 2| 650| 17 |
+-----+-----+-----+
I can easily perform the groupBy and the aggregation to obtain the maximum value of each group in col2, using
import pyspark.sql.functions as F
df1 = df.groupBy("col1").agg(
F.max("col2").alias('Max_col2')).show()
+-----+---------+
| col1| Max_col2|
+-----+---------+
| 1| 700|
| 2| 800|
+-----+---------+
However, what I am struggling with and what I would like to do is to, additionally, pass the corresponding value of col3, thus obtaining the following table:
+-----+---------+-----+
| col1| Max_col2| col3|
+-----+---------+-----+
| 1| 700| 12 |
| 2| 800| 15 |
+-----+---------+-----+
Does anyone know how this can be done?
Many thanks in advance,
Marioanzas
You can aggregate the maximum of a struct, and then expand the struct:
import pyspark.sql.functions as F
df2 = df.groupBy('col1').agg(
F.max(F.struct('col2', 'col3')).alias('col')
).select('col1', 'col.*')
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 700| 12|
| 2| 800| 15|
+----+----+----+

how to combine two dataframe replacing null values

I have two dataframe. The set of columns in them is slightly different
df1:
+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
| 1| 15| 20| 8|
| 2| 0|null| 5|
+---+----+----+----+
df2:
+---+----+----+----+
| id|col1|col2|col4|
+---+----+----+----+
| 1| 10| 10| 40|
| 2| 10| 30| 50|
+---+----+----+----+
How can pyspark make a left join for df1? But at the same time replace null values with values from df2? And also adding the missing columns from df2
result_df:
id col1 col2 col3 col4
1 15 20 8 40
2 0 30 5 50
I need to combine two data frames with id to get an extra column col4, and for col1, col2, col3, take values from df1, unless the value is non-zero, then replace it with the value from df2.
Use coalesce function after the left join.
from pyspark.sql.functions import *
df1.show()
#+---+----+----+----+
#| id|col1|col2|col3|
#+---+----+----+----+
#| 1| 15| 20| 8|
#| 2| 0|null| 5|
#+---+----+----+----+
df2.show()
#+---+----+----+----+----+
#| id|col1|col2|col3|col4|
#+---+----+----+----+----+
#| 1| 15| 20| 8| 40|
#| 2| 0| 30| 5| 50|
#+---+----+----+----+----+
df1.join(df2,["id"],"left").\
select("id",coalesce(df2.col1,df1.col1).alias("col1"),coalesce(df2.col2,df1.col2).alias("col2"),coalesce(df2.col3,df1.col3).alias("col3"),df2.col4).\
show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 15| 20| 8| 40|
| 2| 0| 30| 5| 50|
+---+----+----+----+----+

Joining on the previous month data for each month in the PySpark dataset

I have a dataset which is on a monthly basis and each month has N-number of accounts. Some months will have new accounts and some accounts will disappear after a certain month(this is randomly done).
I need to get an account's current month balance and deduct it from the previous month's balance (if this account existed in the previous month) otherwise have it as the current month's balance.
I was suggested to do a join on each month. i.e. join month1 to month2, month2 to month3, etc. But I am not exactly sure how that would go...
Here is an example dataset:
|date |account |balance |
----------------------------------
|01.01.2019|1 |40 |
|01.01.2019|2 |33 |
|01.01.2019|3 |31 |
|01.02.2019|1 |32 |
|01.02.2019|2 |56 |
|01.02.2019|4 |89 |
|01.03.2019|2 |12 |
|01.03.2019|4 |35 |
|01.03.2019|5 |76 |
|01.03.2019|6 |47 |
----------------------------------
The account id is unique for each gone, current and new-coming account.
I initially used f.lag, but now that there are accounts that dissapear and come in new, the number of accounts per month is not constant, so I cannot lag. As I said I was suggested to use join. I.e. join Jan onto Feb, Feb onto March, etc.
But I am not really sure how that would go. Anyone has any ideas ?
P.S. I created this table with example of an account that stays, an account that is new and an account that is removed from later months.
The end goal is:
|date |account |balance | balance_diff_with_previous_month |
--------------------------------------------------------------------|
|01.01.2019|1 |40 |na |
|01.01.2019|2 |33 |na |
|01.01.2019|3 |31 |na |
|01.02.2019|1 |32 |-8 |
|01.02.2019|2 |56 |23 |
|01.02.2019|4 |89 |89 |
|01.03.2019|2 |12 |-44 |
|01.03.2019|4 |35 |-54 |
|01.03.2019|5 |76 |76 |
|01.03.2019|6 |47 |47 |
--------------------------------------------------------------------|
As I said, f.lag cannot be used because the number of accounts per month is not constant and I do not control the number of accounts, therefore cannot f.lag a constant amount of rows.
Anyone has any ideas about how joining on account and/or date(current month) with date-1 (previous month)?
Thanks for reading and helping :)
alternate solution using joins ....
df = spark.createDataFrame([
("01.01.2019", 1, 40),("01.01.2019", 2, 33),("01.01.2019", 3, 31),
("01.02.2019", 1, 32), ("01.02.2019", 2, 56),("01.02.2019", 4, 89),
("01.03.2019", 2, 12),("01.03.2019", 4, 35),("01.03.2019", 5, 76),("01.03.2019", 6, 47)],
["date","account","balance"])
df.alias("current").join(
df.alias("previous"),
[F.to_date(F.col("previous.date"), "dd.MM.yyyy") == F.to_date(F.add_months(F.to_date(F.col("current.date"), "dd.MM.yyyy"),-1),"dd.MM.yyyy"), F.col("previous.account") == F.col("current.account")],
"left"
).select(
F.col("current.date").alias("date"),
F.coalesce("current.account", "previous.account").alias("account"),
F.col("current.balance").alias("balance"),
(F.col("current.balance") - F.coalesce(F.col("previous.balance"), F.lit(0))).alias("balance_diff_with_previous_month")
).orderBy("date","account").show()
which results
+----------+-------+-------+--------------------------------+
| date|account|balance|balance_diff_with_previous_month|
+----------+-------+-------+--------------------------------+
|01.01.2019| 1| 40| 40|
|01.01.2019| 2| 33| 33|
|01.01.2019| 3| 31| 31|
|01.02.2019| 1| 32| -8|
|01.02.2019| 2| 56| 23|
|01.02.2019| 4| 89| 89|
|01.03.2019| 2| 12| -44|
|01.03.2019| 4| 35| -54|
|01.03.2019| 5| 76| 76|
|01.03.2019| 6| 47| 47|
+----------+-------+-------+--------------------------------+
F.lag works perfectly for what you want if you partition by account and
partition = Window.partitionBy("account") \
.orderBy(F.col("date").cast("timestamp").cast("long"))
previousAmount = data.withColumn("balance_diff_with_previous_month", F.lag("balance").over(partition))
.show(10, False)
>>> from pyspark.sql.functions import *
>>> from pyspark.sql import Window
>>> df.show()
+----------+-------+-------+
| date|account|balance|
+----------+-------+-------+
|01.01.2019| 1| 40|
|01.01.2019| 2| 33|
|01.01.2019| 3| 31|
|01.02.2019| 1| 32|
|01.02.2019| 2| 56|
|01.02.2019| 4| 89|
|01.03.2019| 2| 12|
|01.03.2019| 4| 35|
|01.03.2019| 5| 76|
|01.03.2019| 6| 47|
+----------+-------+-------+
>>> df1 = df.withColumn("date", expr("to_date(date, 'dd.MM.yyyy')"))
>>> W = Window.partitionBy("account").orderBy("date")
>>> df1.withColumn("balance_diff_with_previous_month", col("balance") - lag(col("balance"),1,0).over(W)).show()
+----------+-------+-------+--------------------------------+
| date|account|balance|balance_diff_with_previous_month|
+----------+-------+-------+--------------------------------+
|2019-01-01| 1| 40| 40.0|
|2019-01-01| 2| 33| 33.0|
|2019-01-01| 3| 31| 31.0|
|2019-02-01| 1| 32| -8.0|
|2019-02-01| 2| 56| 23.0|
|2019-02-01| 4| 89| 89.0|
|2019-03-01| 2| 12| -44.0|
|2019-03-01| 4| 35| -54.0|
|2019-03-01| 5| 76| 76.0|
|2019-03-01| 6| 47| 47.0|
+----------+-------+-------+--------------------------------+

How to filter rows which include in PySpark window

I have a dataframe like
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 31-05-2019| 6|
| 2| 31-01-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
Entries for some month are missing so I forward fill them with window operations
window = Window.partitionBy(['TRADEID'])\
.orderBy('time_period')\
.rowsBetween(Window.unboundedPreceding, 0)
filled_column = last(df['value'], ignorenulls = True).over(window)
df_filled = df.withColumn('value_filled', filled_column)
After this I get
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 28-02-2019| 5|
| 1| 31-03-2019| 5|
| 1| 30-04-2019| 5|
| 1| 31-05-2019| 6|
| 2| 31-01-2019| 15|
| 2| 28-02-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
However, I do not want to fill a month if the gap between this month and the last available month is more than 2 months. For example, in my case I want to get
|TRADEID|time_period|value |
+-------+-----------+--------+
| 1| 31-01-2019| 5 |
| 1| 28-02-2019| 5 |
| 1| 31-03-2019| 5 |
| 1| 30-04-2019| null|
| 1| 31-05-2019| 6 |
| 2| 31-01-2019| 15 |
| 2| 28-02-2019| 15 |
| 2| 31-03-2019| 20 |
+-------+-----------+--------+
How can I do this?
Have one more window operation that deletes entries with month that should not be filled beforehand? If it is a good way, please help me with code, I am bad at window operations.

Categories