Pyspark efficiently create patterns within each window - python

I want to create a base dataframe from the existing one, which does not contain all I want, for example, I have the dataframe collecting the number of candies each people (tracked by "id") bought each year-month (but in this case each person didn't buy candies every month)
|id|year_month|num_of_candies_bought
1 2022-01 5
1 2022-03 10
1 2022-04 2
What I want is that to track them by fix the year-month I'm interested like this (for the first 5 months this year)
|id|year_month|num_of_candies_bought
1 2022-01 5
1 2022-02 0
1 2022-03 10
1 2022-04 2
1 2022-05 0
I think one way to do this is to use "crossjoin" but it turns out that this takes long time to process. Is there any way to do this without any join? In my work the first dataframe is very very huge (a million rows for instance) while the second is just fixed (like in this case only 5 rows) and much much smaller. Is it possible (if it is needed to use crossjoin) to improve performance drastically?
P.S. I want this to seperate each person (so I need to use window.partition thing)

I'd simply add a 0 (zero) line for each id and each id and year_month.
Let's assume df is your dataframe.
from pyspark.sql import functions as F
# generate a list of all year_month you need
year_month = ["2022-01", "2022-02", "2022-03", "2022-04", "2022-05"]
df_id = (
df.select("id")
.distinct()
.withColumn("num_of_candies_bought", F.lit(0))
.withColumn("year_month", F.explode(F.array(*map(F.lit, year_month))))
)
df = (
df.unionByName(df_id)
.groupBy("id", "year_month")
.agg(F.sum("num_of_candies_bought").alias("num_of_candies_bought"))
)

Related

Count occurrences in column based on another column (date)

I am trying to count the number of "Type" occurrences by what month they are in.
Daily data is given, so to group by month I tried using .resample() but the problem with using is that combines all the strings together in one LONG string and then I can't count the number of occurrences using str.count() as it returns the wrong value (it finds too many matches because it isn't looking for the EXACT pattern).
I think it has to be done in more than one step...
I have tried SO many things... I even heard there is a pivot table?
Sample data:
Type
Date
Cat
2020-01-01
Cat
2020-01-01
Bird
2020-01-01
Dog
2020-01-01
Cat
2020-02-01
Cat
2020-03-01
Bird
2020-03-01
Cat
2020-05-02
... For all the months over a few years...
Converted to the following format: (titles of header can be in numeric form as well)
January 2020
February 2020
Cat
4
1
Bird
1
0
Dog
1
0
As far as I know, Pandas does not have a standard function or typical approach to obtain your desired result. Below I've included a code snippet that gets your desired result.
If you do not mind using extra packages, there exist some packages which you can use for quicker/easier binary encoding (e.g. category_encoder).
import pandas as pd
# your data in dictionary format
d = {
"Type":["Cat","Cat","Bird","Dog","Cat","Cat","Bird","Cat"],
"Date":["2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-03-01","2020-03-01","2020-05-02"]
}
# creata dataframe with the dates as index
df = pd.DataFrame(data = d['Type'], index=pd.to_datetime(d['Date']))
animals = list(df[0].unique()) # a list contaning all unique animals
ndf = pd.DataFrame(index=animals) # empty new dataframe with all animals as index
for animal in animals:
ndf.loc[animal, df.index.month.unique()] = ( # at row = animal, insert all unique months
(df == animal).groupby(df.index.month) # groupby months, using .month (returns 1 for Jan)
.sum() # sum since we use bool comparison
.transpose() # tranpose due to desired output format
.values # array of values to insert
)
# convert column names back to date time and save as string in desired format
ndf.columns = pd.to_datetime(ndf.columns, format='%m').strftime('%B 2020')
Result
January 2020
February 2020
March 2020
May 2020
Cat
2
1
1
1
Bird
1
0
1
0
Dog
1
0
0
0

find first unique items selected by user and ranking them in order of user selection by date

I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.

How to resample/reindex/groupby a time series based on a column's data?

SO I've got a pandas data frame that contains 2 values of water use at a 1 second resolution. The values are "hotIn" and "hotOut". The hotIn can record down to the tenth of a gallon at a one second resolution while the hotOut records whole number pulses representing a gallon, i.e. when a pulse occurs, one gallon has passed through the meter. The pulses occur roughly every 14-15 seconds.
Data looks roughly like this:
Index hotIn(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:00 4 0
2019-03-23T00:00:01 5 0
2019-03-23T00:00:02 4 0
2019-03-23T00:00:03 4 0
2019-03-23T00:00:04 3 0
2019-03-23T00:00:05 4 1
2019-03-23T00:00:06 4 0
2019-03-23T00:00:07 5 0
2019-03-23T00:00:08 3 0
2019-03-23T00:00:09 3 0
2019-03-23T00:00:10 4 0
2019-03-23T00:00:11 4 0
2019-03-23T00:00:12 5 0
2019-03-23T00:00:13 5 1
What I'm trying to do is resample or reindex the data frame based on the occurrence of pulses and sum the hotIn between the new timestamps.
For example, sum the hotIn between 00:00:00 - 00:00:05 and 00:00:06 - 00:00:13.
Results would ideally look like this:
Index hotIn sum(gpm) hotOut(pulse=1gal)
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 32 1
I've explored using a two step for-elif loop that just checks if the hotOut == 1, it works but its painfully slow on large datasets. I'm positive the timestamp functionality of Pandas will be superior if this is possible.
I also can't simply resample on a set frequency because the interval between pulses changes periodically. The primary issue is the period of timestamps between pulses changes so a general resample rule would not work. I've also run into problems with matching data frame lengths when pulling out the timestamps associated with pulses and applying them to the main as a new index.
IIUC, you can do:
s = df['hotOut(pulse=1gal)'].shift().ne(0).cumsum()
(df.groupby(s)
.agg({'Index':'last', 'hotIn(gpm)':'sum'})
.reset_index(drop=True)
)
Output:
Index hotIn(gpm)
0 2019-03-23T00:00:05 24
1 2019-03-23T00:00:13 33
You don't want to group on the Index. You want to group whenever 'hotOut(pulse=1gal)' changes.
s = df['hotOut(pulse=1gal)'].cumsum().shift().bfill()
(df.reset_index()
.groupby(s, as_index=False)
.agg({'Index': 'last', 'hotIn(gpm)': 'sum', 'hotOut(pulse=1gal)': 'last'})
.set_index('Index'))
hotIn(gpm) hotOut(pulse=1gal)
Index
2019-03-23T00:00:05 24 1
2019-03-23T00:00:13 33 1

Divide 2 columns and create new column with results

I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000

Incremental spend in 6 weeks for two groups using pandas

I have an excel data with the following information,
df.head()
User_id Group Week Spend Purchases Group
170309867 Test 2014-10-13 794.66 2 Test-NonRed
57954586 Test 2014-10-13 55.99 1 Test-Red
118068583 Test 2014-10-13 40.87 1 Test-NonRed
516478393 Test 2014-10-13 17.5 2 Test-NonRed
457873235 Test 2014-10-13 20.44 1 Test-Red
From the above information, I need to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. control. I need it in absolute ($) and % terms.
I have tried pandas as,
df2= df.groupby(by=['Group','Week']).sum().abs().groupby(level=[0]).cumsum()
And I have the following result,
df2.head()
And then I calculated the sum for each group as,
df2.groupby(by=['group2']).sum()
df2.head()
I would like to have them ( the incremental spend) as an absolute value which I tried by abs(), as well as I need it in absolute percentage.Any help would be much appreciated,
The expected results are to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. Control. I need it in absolute spend and then its percentage.The incremental spend for the 6 weeks. Something like this,
Group incremental_spend incremental_%
Control 11450175 #%
test-NonRed 50288158 #%
test-Red 12043938 #%
So my real questions,
1. Whether the above-mentioned approach is the right way to calculate incremental spend for Column Group in 6 Weeks from column Week on Spend?
2. Also, I need all my results in Absolute counts and Absolute %
I think there are several problems here which make your answer difficult to understand.
Vocabulary
What you describe as "Incremental spend" is just the sum.
What you do in two steps is the sum of the cumulative sum .cumsum().sum(), which is not right.
Also I am not sure whether you need abs, which gives the absolute value (abs(-1) gives 1) and will thus only have an effect if there are negative values in your data.
Unfortunately the sample dataset is not large enough to get a conclusion.
Dataset
Your dataset has two columns Group with identical names, which is error prone.
Missing information
You want to get final values (sums) as a ratio (%), but you do not indicate what is the reference value for this ratio.
Is it the sum of Spend for the control group ?
Potential solution
>>> df # Sample dataframe with one entry as 'Control' group
Out[]:
User_id Group Week Spend Purchases Group.1
0 170309867 Test 2014-10-13 794.66 2 Test-NonRed
1 57954586 Test 2014-10-13 55.99 1 Test-Red
2 118068583 Test 2014-10-13 40.87 1 Test-NonRed
3 516478393 Test 2014-10-13 17.50 2 Control
4 457873235 Test 2014-10-13 20.44 1 Test-Red
df2 = pd.DataFrame(df.groupby('Group.1').Spend.sum()) # Get 'Spend' sum for each group
>>> df2
Out[]:
Spend
Group.1
Control 17.50
Test-NonRed 835.53
Test-Red 76.43
control_spend_total = df2.loc['Control'].values # Get total spend for 'Control' group
>>> control_spend_total
Out[]: array([ 17.5])
df2['Spend_%'] = df2.Spend / control_spend_total * 100 # Add 'Spend_ratio' column
>>> df2
Out[]:
Spend Spend_%
Group.1
Control 17.50 100.000000
Test-NonRed 835.53 4774.457143
Test-Red 76.43 436.742857
Does it look like what you want?

Categories