Pysaprk multi groupby with different column - python

I have data like below
year name percent sex
1880 John 0.081541 boy
1881 William 0.080511 boy
1881 John 0.050057 boy
I need to groupby and count using different columns
df_year = df.groupby('year').count()
df_name = df.groupby('name').count()
df_sex = df.groupby('sex').count()
then I have to create a Window to get the top-3 data by each column
window = Window.partitionBy('year').orderBy(col("count").desc())
top4_res = df_year.withColumn('topn', func.row_number().over(window)).\
filter(col('topn') <= 4).repartition(1)
suppose I have hundreds of columns to groupby and count and topk_3 operation.
can I do it all in once?
or is there any better ways to do it?

I am not sure if this will meet your requirement but if you are okay with a single dataframe, i think it can give you a start, let me know if otherwise. You can stack these 3 columns (or more) and then groupby and take count :
cols = ['year','name','sex']
e = f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as (col,val)"""
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy(*['col','val']).agg(F.count("col").alias("Counts")).orderBy('col')).show()
+----+-------+------+
| col| val|Counts|
+----+-------+------+
|name| John| 2|
|name|William| 1|
| sex| boy| 3|
|year| 1881| 2|
|year| 1880| 1|
+----+-------+------+
If you want a wide form you can also pivot but i think long form would be helpful:
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy('col').pivot('val').agg(F.count('val')).show())
+----+----+----+----+-------+----+
| col|1880|1881|John|William| boy|
+----+----+----+----+-------+----+
|name|null|null| 2| 1|null|
|year| 1| 2|null| null|null|
| sex|null|null|null| null| 3|
+----+----+----+----+-------+----+

If you want top n values of columns that have the biggest count, this should work:
from pyspark.sql.functions import *
columns_to_check = [ 'year', 'name' ]
n = 4
for c in columns_to_check:
# returns a dataframe
x = df.groupBy(c).count().sort(col("count").desc()).limit(n)
x.show()
# returns a list of rows
x = df.groupBy(c).count().sort(col("count").desc()).take(n)
print(x)

Related

How to make a new pyspark df column that's the average of the last n values by day of week?

What I'm trying to do is make a pyspark dataframe with item and date and another column "3_avg" that's the average of the last three same day-of-week from the given date back. Said another way, if 2022-5-5 is a thursday, I want the 3_avg value for that row to be the average sales for that item for the last three thursdays, so 4/28, 4/21, and 4/14.
I've got this thus far, but it just averages the whole column for that day of week... I can't figure out how to get it to be distinct by item and date and only use the last three? I was trying to get it to work with day_of_week, but my brain can't connect that to what I need to happen.
df_fcst_dow = (
df_avg
.withColumn("day_of_week", F.dayofweek(F.col("trn_dt")))
.groupBy("item", "date", "day_of_week")
.agg(
F.sum(F.col("sales") / 3).alias("3_avg")
)
)
You can do this with a window or you can do it with a groupby. Here I'd encourage group by as it will distribute the work better amongst the worker nodes. We create an array of the current date and the next two dates. We then explode that array, give us data duplicated accross all the dates we want so we can then group it up to make an average.
import pyspark.sql.functions as F
>>> spark.table("trn_dt").show()
+----+----------+-----+
|item| date|sales|
+----+----------+-----+
| 1|2016-01-03| 16.0|
| 1|2016-01-02| 15.0|
| 1|2016-01-05| 9.0|
| 1|2016-01-04| 10.0|
| 1|2016-01-01| 11.0|
| 1|2016-01-07| 10.0|
| 1|2016-01-06| 7.0|
+----+----------+-----+
df_avg.withColumn( "dates",
F.array( #building array of dates
df_avg["date"],
F.date_add( df_avg["date"], 1),
F.date_add( df_avg["date"], 2)
)).select(
F.col("item"),
F.explode("dates") ).alias("ThreeDayAve"), # tripling our data
F.col("sales")
).groupBy( "item","ThreeDayAve")
.agg( F.avg("sales").alias("3_avg")).show()
+----+-----------+------------------+
|item|ThreeDayAve| 3_avg|
+----+-----------+------------------+
| 1| 2016-01-05|11.666666666666666|
| 1| 2016-01-04|13.666666666666666|
| 1| 2016-01-07| 8.666666666666666|
| 1| 2016-01-01| 11.0|
| 1| 2016-01-03| 14.0|
| 1| 2016-01-02| 13.0|
| 1| 2016-01-09| 10.0|
| 1| 2016-01-06| 8.666666666666666|
| 1| 2016-01-08| 8.5|
+----+-----------+------------------+
You likely could use window on this but it wouldn't perform as well on large data sets.

PySpark - Filter dataframe columns based on list

I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)

Finding columns with Null values and write them in a new column per each record in Pyspark

I have a scenario, where I have to find columns with Null values per each record, and write all such column names into a separate column.
Example:
I have this DataFrame:
+---------+---+------------+-----------+------+-------+
|firstName|age|jobStartDate|isGraduated|gender| salary|
+---------+---+------------+-----------+------+-------+
| null|se3| 2006-01-01| 8| M| F|
| null| a3| null| True| F| null|
| Robert| 37| 1992-01-01| null| M|5000.50|
+---------+---+------------+-----------+------+-------+
Expected result should be like the one below:
+---------+---+------------+-----------+------+-------+----------------------+
|firstName|age|jobStartDate|isGraduated|gender| salary| Missing Columns|
+---------+---+------------+-----------+------+-------+----------------------+
| null|se3| 2006-01-01| 8| M| F| firstName|
| null| a3| 2006-01-02| True| F| null| firstName,salary|
| Robert| 37| 1992-01-01| null| M|5000.50| isGraduated|
+---------+---+------------+-----------+------+-------+----------------------+
I have written code which half meets my expected results:
def find_exceptions(df,mand_cols = ['firstName','jobStartDate','salary']):
miss = "Missing: "
for column in mand_cols:
if df[column] is None:
miss = miss + column + ","
return miss
I am able to collect the missing values as list:
temp = sourceDF.rdd.map(find_exceptions)
temp.collect()
#result:
['Missing: firstName,', 'Missing: firstName,jobStartDate,salary,', 'Missing: ']
I am finding it difficult to actually write this into a new column. I am fairly new to Spark and would really appreciate if someone could help me with this.
You can do this in three steps.
Step 1: Create an array of size number of columns. If an entry is null, then set the respective element in array as the name of column name, else leave the value null.
Step 2: Filter the array for column names
Step 3: Concatenate to have comma-separated list
df //step 1
.withColumn("MissingColumns",
array(
when(col("firstName").isNull(),lit("firstName")),
when(col("age").isNull(),lit("age")),
when(col("jobStartDate").isNull(),lit("jobStartDate")),
when(col("isGraduated").isNull(),lit("isGraduated")),
when(col("gender").isNull(),lit("gender")),
when(col("salary").isNull(),lit("salary"))
)
)
//step 2
.withColumn("MissingColumns",expr("filter(MissingColumns, c -> c IS NOT NULL)"))
//step 3
.withColumn("MissingColumns",concat_ws(",",col("MissingColumns")) )

PySpark DataFrame: Find closest value and slice the DataFrame

So, I've done enough research and haven't found a post that addresses what I want to do.
I have a PySpark DataFrame my_df which is sorted by value column-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
| F| 15|
| G| 10|
+----+-----+
The summation of all the counts in value column is equal to 136. I want to get all the rows whose combined values >= x% of 136. In this example, let's say x=80. Then target sum = 0.8*136 = 108.8. Hence, the new DataFrame will consist of all the rows that have a combined value >= 108.8.
In our example, this would come down to row D (since combined values upto D = 30+25+20+18 = 93).
However, the hard part is that I also want to include the immediately following rows with duplicate values. In this case, I also want to include row E since it has the same value as row D i.e. 18.
I want to slice my_df by giving a percentage x variable, for example 80 as discussed above. The new DataFrame should consist of the following rows-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
+----+-----+
One thing I could do here is iterate through the DataFrame (which is ~360k rows), but I guess that defeats the purpose of Spark.
Is there a concise function for what I want here?
Use pyspark SQL functions to do this concisely.
result = my_df.filter(my_df.value > target).select(my_df.name,my_df.value)
result.show()
Edit: Based on OP's question edit - Compute running sum and get rows until the target value is reached. Note that this will result in rows upto D, not E..which seems like a strange requirement.
from pyspark.sql import Window
from pyspark.sql import functions as f
# Total sum of all `values`
target = (my_df.agg(sum("value")).collect())[0][0]
w = Window.orderBy(my_df.name) #Ideally this should be a column that specifies ordering among rows
running_sum_df = my_df.withColumn('rsum',f.sum(my_df.value).over(w))
running_sum_df.filter(running_sum_df.rsum <= 0.8*target)
Your requirements are quite strict, so it's difficult to formulate an efficient solution to your problem. Nevertheless, here is one approach:
First calculate the cumulative sum and the total sum for the value column and filter the DataFrame using the percentage of target condition you specified. Let's call this result df_filtered:
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.orderBy(f.col("value").desc(), "name").rangeBetween(Window.unboundedPreceding, 0)
target = 0.8
df_filtered = df.withColumn("cum_sum", f.sum("value").over(w))\
.withColumn("total_sum", f.sum("value").over(Window.partitionBy()))\
.where(f.col("cum_sum") <= f.col("total_sum")*target)
df_filtered.show()
#+----+-----+-------+---------+
#|name|value|cum_sum|total_sum|
#+----+-----+-------+---------+
#| A| 30| 30| 136|
#| B| 25| 55| 136|
#| C| 20| 75| 136|
#| D| 18| 93| 136|
#+----+-----+-------+---------+
Then join this filtered DataFrame back on the original on the value column. Since your DataFrame is already sorted by value, the final output will contain the rows you want.
df.alias("r")\
.join(
df_filtered.alias('l'),
on="value"
).select("r.name", "r.value").sort(f.col("value").desc(), "name").show()
#+----+-----+
#|name|value|
#+----+-----+
#| A| 30|
#| B| 25|
#| C| 20|
#| D| 18|
#| E| 18|
#+----+-----+
The total_sum and cum_sum columns are calculated using a Window function.
The Window w orders on the value column descending, followed by the name column. The name column is used to break ties- without it, both rows C and D would have the same cumulative sum of 111 = 75+18+18 and you'd incorrectly lose both of them in the filter.
w = Window\ # Define Window
.orderBy( # This will define ordering
f.col("value").desc(), # First sort by value descending
"name" # Sort on name second
)\
.rangeBetween(Window.unboundedPreceding, 0) # Extend back to beginning of window
The rangeBetween(Window.unboundedPreceding, 0) specifies that the Window should include all rows before the current row (defined by the orderBy). This is what makes it a cumulative sum.

Select a range in Pyspark

I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take data from 250K to 750K index. How can I do that without using collect in pyspark?
To be more precise, I want something like take function to get results between a range. For example, something like take(250000, 750000).
Here is one way to select a range in a pyspark DF:
Create DF
df = spark.createDataFrame(
data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
, schema = ["amount", "date"]
)
df.show()
+------+----------+
|amount| date|
+------+----------+
| 10|2018-01-01|
| 22|2017-01-01|
| 13|2014-01-01|
| 4|2015-01-01|
| 35|2013-01-01|
| 26|2016-01-01|
| 7|2012-01-01|
| 18|2011-01-01|
+------+----------+
Sort (on date) and insert index (based on row number)
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))
df.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 18|2011-01-01| 1|
| 7|2012-01-01| 2|
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
| 22|2017-01-01| 7|
| 10|2018-01-01| 8|
+------+----------+-----+
Get The Required Range (assume want everything between rows 3 and 6)
df1=df.filter(df.index.between(3, 6))
df1.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
+------+----------+-----+
This is very simple using between , for example assuming your sorted column name is index -
df_sample = df.select(df.somecolumn, df.index.between(250000, 750000))
once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.

Categories