User defined expanding window in Pandas - python

Is there a way to modify the expanding windows in Pandas. For example consider a random DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
When I to df.expanding(1).apply(), it applies the function to expanding every row, is it possible to pass the date column to the expanding function so instead of every row as a window, it accumulates groups of rows based on date
Existing expanding window:
window 1: 0 16 25 37 Apple Orange 2002-01-01
window 2: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
window 3: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
Expected expanding window:
window 1 (all rows for date "2002-01-01"):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
window 2 (all rows for date "2002-01-01" and "2002-02-01" ):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01

Assuming df.date is sorted.
I suppose that there is more efficient way to calculate end. If you find better solution please let me know.
Another usage, Pandas custom window rolling, Possible problems
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
lst = []
prev_date = self.custom_name[0]
for i, date in enumerate(self.custom_name[1:], 1):
if prev_date != date:
lst.append(i)
prev_date = date
lst.append(len(df))
end = np.array(lst)
start = np.zeros_like(end)
return start, end
indexer = CustomIndexer(custom_name=df.date)
for window in df.rolling(indexer):
print(window)
Outputs:
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
7 17 17 37 Mango Orange 2002-02-01
8 5 53 0 Apple lemon 2002-02-01
9 16 10 24 Apple Orange 2002-02-01

Related

Efficient mean and total aggregation over multiple Pandas DataFrame columns

Suppose I have a DataFrame that looks something like this:
id
country
grade
category
amount
0
7
fr
a
mango
52
1
5
fr
b
banana
68
2
7
fr
a
banana
73
3
4
it
c
mango
70
4
5
fr
b
banana
99
5
9
uk
a
apple
29
6
3
uk
a
mango
83
7
0
uk
b
banana
59
8
2
it
c
mango
11
9
9
uk
a
banana
91
10
0
uk
b
mango
95
11
8
uk
a
mango
30
12
3
uk
a
mango
82
13
1
it
b
banana
78
14
3
uk
a
apple
76
15
6
it
c
apple
76
16
2
it
c
mango
10
17
1
it
b
mango
30
18
9
uk
a
banana
17
19
2
it
c
mango
58
Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4 total: 70
id: 2 total: 11 + 10 + 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id
country
grade
category
amount
mean_totals
mean_counts
0
7
fr
a
mango
52
52
1
1
5
fr
b
banana
68
167
2
2
7
fr
a
banana
73
73
1
3
4
it
c
mango
70
74.5
2
4
5
fr
b
banana
99
167
2
5
9
uk
a
apple
29
52.5
1
6
3
uk
a
mango
83
97.5
1.5
7
0
uk
b
banana
59
59
1
8
2
it
c
mango
11
74.5
2
9
9
uk
a
banana
91
108
2
10
0
uk
b
mango
95
95
1
11
8
uk
a
mango
30
97.5
1.5
12
3
uk
a
mango
82
97.5
1.5
13
1
it
b
banana
78
78
1
14
3
uk
a
apple
76
52.5
1
15
6
it
c
apple
76
76
1
16
2
it
c
mango
10
74.5
2
17
1
it
b
mango
30
30
1
18
9
uk
a
banana
17
108
2
19
2
it
c
mango
58
74.5
2
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)
Output_dataframe

How to get sum of recent N value or available value of each item with respect given cut off date?

To be honest, I am very new in programming and learned some of the panda's basic functionality.
I am able to do group by and sum price of each item, but not able to specifically apply cut of date and do summation.
Below are my input data and expected result. Requesting to help how to achieve this using pandas. data image
In the below data, N=5 (no of values need to consider before the cut of date), the expected result for item Grape is 88 i.e. sum of entry 7,6,5,4, and 3. And orange is 90 (entries 13,12,11,10), here only 4 entry available, so considered all.
EntryDate Itemname Price Cut off date Expected result
1 3/9/2020 Grape 16 3/15/2020 88
2 3/10/2020 Grape 15 3/15/2020 88
3 3/11/2020 Grape 12 3/15/2020 88
4 3/12/2020 Grape 18 3/15/2020 88
5 3/13/2020 Grape 20 3/15/2020 88
6 3/13/2020 Grape 18 3/15/2020 88
7 3/14/2020 Grape 20 3/15/2020 88
8 3/15/2020 Grape 12 3/15/2020 88
9 3/16/2020 Grape 19 3/15/2020 88
10 2/10/2020 Orange 22 2/17/2020 90
11 2/11/2020 Orange 21 2/17/2020 90
12 2/12/2020 Orange 26 2/17/2020 90
13 2/13/2020 Orange 21 2/17/2020 90
14 2/20/2020 Orange 26 2/17/2020 90
First convert columns to datetimes, then filter rows by cut off date by Series.lt in boolean indexing and aggregate sum in lambda function for last N values by Series.tail, last for new column use Series.map:
N = 5
df['Date'] = pd.to_datetime(df['Date'])
df['Cut off date'] = pd.to_datetime(df['Cut off date'])
s = (df[df['Date'].lt(df['Cut off date'])]
.groupby('Itemname')['Price']
.agg(lambda x: x.tail(N).sum()))
df['new'] = df['Itemname'].map(s)
print (df)
Entry Date Itemname Price Cut off date Expected result new
0 1 2020-03-09 Grape 16 2020-03-15 88 88
1 2 2020-03-10 Grape 15 2020-03-15 88 88
2 3 2020-03-11 Grape 12 2020-03-15 88 88
3 4 2020-03-12 Grape 18 2020-03-15 88 88
4 5 2020-03-13 Grape 20 2020-03-15 88 88
5 6 2020-03-13 Grape 18 2020-03-15 88 88
6 7 2020-03-14 Grape 20 2020-03-15 88 88
7 8 2020-03-15 Grape 12 2020-03-15 88 88
8 9 2020-03-16 Grape 19 2020-03-15 88 88
9 10 2020-02-10 Orange 22 2020-02-17 90 90
10 11 2020-02-11 Orange 21 2020-02-17 90 90
11 12 2020-02-12 Orange 26 2020-02-17 90 90
12 13 2020-02-13 Orange 21 2020-02-17 90 90
13 14 2020-02-20 Orange 26 2020-02-17 90 90

Pandas - return all rows with identical values in one column and small differences in another column

I have some DataFrame:
df = pd.DataFrame({'fruit':['apple', 'apple', 'apple', 'pear', 'pear', 'pear', 'mango', 'mango', 'mango', 'peach', 'peach', 'peach', 'plum', 'plum', 'plum'],
'region':[5,5,5,7,7,7,2,2,2,2,2,2,2,2,2],
'location':[75000,75000,75000,250,250,250,48897467,48897467,48897467,48897629,48897629,48897629,500000000,500000000,500000000],
'unique':np.random.randint(100, size=15)})
fruit region location unique
0 apple 5 75000 51
1 apple 5 75000 1
2 apple 5 75000 44
3 pear 7 250 36
4 pear 7 250 86
5 pear 7 250 99
6 mango 2 48897467 40
7 mango 2 48897467 12
8 mango 2 48897467 33
9 peach 2 48897629 23
10 peach 2 48897629 85
11 peach 2 48897629 65
12 plum 2 500000000 46
13 plum 2 500000000 87
14 plum 2 500000000 42
I'd like to select all rows of different 'fruit' with identical values in the 'region' column and a difference of less than 1000 in the 'location' column.
So, in this example, I'd like to return:
fruit region location unique
6 mango 2 48897467 40
7 mango 2 48897467 12
8 mango 2 48897467 33
9 peach 2 48897629 23
10 peach 2 48897629 85
11 peach 2 48897629 65
I've tried something like:
df.groupby('region')['location'].diff()
But this isn't exactly what I'm trying to do.
You can do it this way
a = df.groupby('region')['location'].transform(lambda x: x.max()-x.min())
b = df.groupby('region')['fruit'].transform('nunique')
df.loc[(a<=1000) & (b>1)]
output
fruit region location unique
6 mango 2 9000 7
7 mango 2 9000 98
8 mango 2 9000 92
9 peach 2 8800 34
10 peach 2 8800 17
11 peach 2 8800 15
I am adding this as a new answer sicne the previous answer is usefull if someone wants the whole group instead of a portion of the group like you are wanting now.
For your latest need, you can do as below
def func(x):
return (x - x.iloc[0])
a = df.groupby('region')['location'].apply(func)
b = df.groupby('region')['fruit'].transform('nunique')
df.loc[(a<=1000) & (b>1)]
This will only work of the column location is sorted ascendingly (make sure to sort the df by region & location before you start).
output
fruit region location unique
6 mango 2 48897467 79
7 mango 2 48897467 62
8 mango 2 48897467 68
9 peach 2 48897629 71
10 peach 2 48897629 64
11 peach 2 48897629 69

Repeat columns as rows in python?

Fruit January Shipments January Sales February Shipments February Sales
------------ ------------------- --------------- -------------------- ----------------
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
I'm trying to achieve the following result:
Fruit Month Shipments Sales
------------ ---------- ----------- -------
Apple January 30 11
Banana January 12 49
Pear January 25 50
Kiwi January 41 25
Strawberry January 11 33
Apple February 18 31
Banana February 39 14
Pear February 44 21
Kiwi February 10 25
Strawberry February 35 50
I've tried pandas.pivot and pandas.pivot_table and had no luck. I'm in the process of creating two dataframes (Fruit/Month/Shipments) and (Fruit/Month/Sales), and concatenating the two into one with a loop, but I was hoping for a easier way to do this.
one way is to use modify the column to a multi level then use stack. Let suppose your dataframe is called df. First set the column Fruit as index, then define the multilevel columns:
df = df.set_index('Fruit')
# manual way to create the multiindex columns
#df.columns = pd.MultiIndex.from_product([['January','February'],
# ['Shipments','Sales']], names=['Month',None])
# more general way to create the multiindex columns thanks to #Scott Boston
df.columns = df.columns.str.split(expand=True)
df.columns.names = ['Month',None]
your data looks like:
Month January February
Shipments Sales Shipments Sales
Fruit
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
Now you can use stack on level 0 and reset_index
df_output = df.stack(0).reset_index()
which gives
Fruit Month Sales Shipments
0 Apple February 31 18
1 Apple January 11 30
2 Banana February 14 39
3 Banana January 49 12
4 Pear February 21 44
5 Pear January 50 25
6 Kiwi February 25 10
7 Kiwi January 25 41
8 Strawberry February 50 35
9 Strawberry January 33 11
Finally, if you want a specific order for values in the column Month you can use pd.Categorical:
df_output['Month'] = pd.Categorical(df_output['Month'].tolist(), ordered=True,
categories=['January','February'])
setting that January is before February when sorting. Now, doing
df_output = df_output.sort_values(['Month'])
gives the result:
Fruit Month Sales Shipments
1 Apple January 11 30
3 Banana January 49 12
5 Pear January 50 25
7 Kiwi January 25 41
9 Strawberry January 33 11
0 Apple February 31 18
2 Banana February 14 39
4 Pear February 21 44
6 Kiwi February 25 10
8 Strawberry February 50 35
I see it's not exactly the expected output (order in Fruit column and order of columns) but both can be easily change if needed.
How to use pd.wide_to_long as #user3483203 suggests.
df1 = df.set_index('Fruit')
#First we have to so column renaming use multiindex column headers and swapping levels.
df1.columns = df1.columns.str.split(expand=True)
df1.columns = df1.columns.map('{0[1]}_{0[0]}'.format)
#Reset index and use pd.wide_to_long:
df1 = df1.reset_index()
df_out = pd.wide_to_long(df1, ['Shipments','Sales'], 'Fruit', 'Month','_','\w+')\
.reset_index()
print(df_out)
Output:
Fruit Month Shipments Sales
0 Apple January 30.0 11.0
1 Banana January 12.0 49.0
2 Pear January 25.0 50.0
3 Kiwi January 41.0 25.0
4 Strawberry January 11.0 33.0
5 Apple February 18.0 31.0
6 Banana February 39.0 14.0
7 Pear February 44.0 21.0
8 Kiwi February 10.0 25.0
9 Strawberry February 35.0 50.0

Appending values from one column to another in pandas

Hi all I'm doing data cleanup, and I'm facing a bit of an obstacle. I have multiple dataframes that look like this:
df1
WL WM WH WP
0 NaN NaN Sea NaN
1 low medium high premium
2 26 26 15 14
3 32 32 18 29
4 41 41 19 42
5 apple dog fur napkins
6 orange cat tesla earphone
7 mango rat tobias controller
I am trying to combine the WL and WM column such that the outcome looks like this:
df1
WM WH WP
0 NaN NaN NaN
1 medium high premium
2 26 15 14
3 32 18 29
4 41 19 42
5 dog fur napkins
6 cat tesla earphone
7 rat tobias controller
8 apple
9 orange
10 mango
My initial attempt was to slice the WL column and append that to the WM column, however that has not yielded a correct output.
for num in range(len(df)):
low = df.loc[:, df.isin(['WarrantyLow']).any()]
low = low[5:]
medium = df.loc[:, df.isin(['WarrantyMedium']).any()]
medium.append(low)
df.append to combine WM and WL. Call df.reset_index to reset the index for the next concatenation
pd.concat(..., ignore_index=True, ...) combines result of (1) with rest of the dataframe, ignoring the index
In [400]: pd.concat([df1['WM'].append(df1['WL'].iloc[5:]).reset_index(drop=True), \
df1.iloc[:, 2:]], ignore_index=True, axis=1).fillna('')\
.rename(columns=dict(enumerate(['WM', 'WH', 'WP'])))
Out[400]:
WM WH WP
0 Sea
1 medium high premium
2 26 15 14
3 32 18 29
4 41 19 42
5 dog fur napkins
6 cat tesla earphone
7 rat tobias controller
8 apple
9 orange
10 mango

Categories