User defined expanding window in Pandas - python
Is there a way to modify the expanding windows in Pandas. For example consider a random DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
When I to df.expanding(1).apply(), it applies the function to expanding every row, is it possible to pass the date column to the expanding function so instead of every row as a window, it accumulates groups of rows based on date
Existing expanding window:
window 1: 0 16 25 37 Apple Orange 2002-01-01
window 2: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
window 3: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
Expected expanding window:
window 1 (all rows for date "2002-01-01"):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
window 2 (all rows for date "2002-01-01" and "2002-02-01" ):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
Assuming df.date is sorted.
I suppose that there is more efficient way to calculate end. If you find better solution please let me know.
Another usage, Pandas custom window rolling, Possible problems
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
lst = []
prev_date = self.custom_name[0]
for i, date in enumerate(self.custom_name[1:], 1):
if prev_date != date:
lst.append(i)
prev_date = date
lst.append(len(df))
end = np.array(lst)
start = np.zeros_like(end)
return start, end
indexer = CustomIndexer(custom_name=df.date)
for window in df.rolling(indexer):
print(window)
Outputs:
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
7 17 17 37 Mango Orange 2002-02-01
8 5 53 0 Apple lemon 2002-02-01
9 16 10 24 Apple Orange 2002-02-01
Related
Efficient mean and total aggregation over multiple Pandas DataFrame columns
Suppose I have a DataFrame that looks something like this: id country grade category amount 0 7 fr a mango 52 1 5 fr b banana 68 2 7 fr a banana 73 3 4 it c mango 70 4 5 fr b banana 99 5 9 uk a apple 29 6 3 uk a mango 83 7 0 uk b banana 59 8 2 it c mango 11 9 9 uk a banana 91 10 0 uk b mango 95 11 8 uk a mango 30 12 3 uk a mango 82 13 1 it b banana 78 14 3 uk a apple 76 15 6 it c apple 76 16 2 it c mango 10 17 1 it b mango 30 18 9 uk a banana 17 19 2 it c mango 58 Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.) import pandas as pd df = pd.DataFrame({ "id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2], "country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"], "grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"], "category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"], "amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58] }) I would like to add two columns to this DF. First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos: id: 4 total: 70 id: 2 total: 11 + 10 + 58 = 79 So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations. The second column I want to add is the same but for the mean annual count for each combination. Desired output and the best I could come up with: I've managed to populate these two desired columns using the following code: import math combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])] for c in combos: x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])] m = x.groupby("id").sum()["amount"].mean() k = x.groupby("id").count()["amount"].mean() if math.isnan(m): m = 0 if math.isnan(k): k = 0 c.append(m) c.append(k) temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"]) df = df.merge(temp_grouping,on=["country","grade","category"],how="left") Which gives the desired output: id country grade category amount mean_totals mean_counts 0 7 fr a mango 52 52 1 1 5 fr b banana 68 167 2 2 7 fr a banana 73 73 1 3 4 it c mango 70 74.5 2 4 5 fr b banana 99 167 2 5 9 uk a apple 29 52.5 1 6 3 uk a mango 83 97.5 1.5 7 0 uk b banana 59 59 1 8 2 it c mango 11 74.5 2 9 9 uk a banana 91 108 2 10 0 uk b mango 95 95 1 11 8 uk a mango 30 97.5 1.5 12 3 uk a mango 82 97.5 1.5 13 1 it b banana 78 78 1 14 3 uk a apple 76 52.5 1 15 6 it c apple 76 76 1 16 2 it c mango 10 74.5 2 17 1 it b mango 30 30 1 18 9 uk a banana 17 108 2 19 2 it c mango 58 74.5 2 The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows: mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique()) df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1) which gives 0 7 fr a mango 52 52.0 1 5 fr b banana 68 167.0 2 7 fr a banana 73 73.0 3 4 it c mango 70 74.5 4 5 fr b banana 99 167.0 5 9 uk a apple 29 52.5 6 3 uk a mango 83 97.5 7 0 uk b banana 59 59.0 8 2 it c mango 11 74.5 9 9 uk a banana 91 108.0 10 0 uk b mango 95 95.0 11 8 uk a mango 30 97.5 12 3 uk a mango 82 97.5 13 1 it b banana 78 78.0 14 3 uk a apple 76 52.5 15 6 it c apple 76 76.0 16 2 it c mango 10 74.5 17 1 it b mango 30 30.0 18 9 uk a banana 17 108.0 19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean: out = (df .groupby(['country', 'grade', 'category', 'id']).sum() .groupby(['country', 'grade', 'category']).mean() ) output: amount country grade category fr a banana 73.0 mango 52.0 b banana 167.0 it b banana 78.0 mango 30.0 c apple 76.0 mango 74.5 uk a apple 52.5 banana 108.0 mango 97.5 b banana 59.0 mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df. import pandas as pd df = pd.DataFrame({ "id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2], "country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"], "grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"], "category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"], "amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58] }) intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean')) output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left') print(output_df) Output_dataframe
How to get sum of recent N value or available value of each item with respect given cut off date?
To be honest, I am very new in programming and learned some of the panda's basic functionality. I am able to do group by and sum price of each item, but not able to specifically apply cut of date and do summation. Below are my input data and expected result. Requesting to help how to achieve this using pandas. data image In the below data, N=5 (no of values need to consider before the cut of date), the expected result for item Grape is 88 i.e. sum of entry 7,6,5,4, and 3. And orange is 90 (entries 13,12,11,10), here only 4 entry available, so considered all. EntryDate Itemname Price Cut off date Expected result 1 3/9/2020 Grape 16 3/15/2020 88 2 3/10/2020 Grape 15 3/15/2020 88 3 3/11/2020 Grape 12 3/15/2020 88 4 3/12/2020 Grape 18 3/15/2020 88 5 3/13/2020 Grape 20 3/15/2020 88 6 3/13/2020 Grape 18 3/15/2020 88 7 3/14/2020 Grape 20 3/15/2020 88 8 3/15/2020 Grape 12 3/15/2020 88 9 3/16/2020 Grape 19 3/15/2020 88 10 2/10/2020 Orange 22 2/17/2020 90 11 2/11/2020 Orange 21 2/17/2020 90 12 2/12/2020 Orange 26 2/17/2020 90 13 2/13/2020 Orange 21 2/17/2020 90 14 2/20/2020 Orange 26 2/17/2020 90
First convert columns to datetimes, then filter rows by cut off date by Series.lt in boolean indexing and aggregate sum in lambda function for last N values by Series.tail, last for new column use Series.map: N = 5 df['Date'] = pd.to_datetime(df['Date']) df['Cut off date'] = pd.to_datetime(df['Cut off date']) s = (df[df['Date'].lt(df['Cut off date'])] .groupby('Itemname')['Price'] .agg(lambda x: x.tail(N).sum())) df['new'] = df['Itemname'].map(s) print (df) Entry Date Itemname Price Cut off date Expected result new 0 1 2020-03-09 Grape 16 2020-03-15 88 88 1 2 2020-03-10 Grape 15 2020-03-15 88 88 2 3 2020-03-11 Grape 12 2020-03-15 88 88 3 4 2020-03-12 Grape 18 2020-03-15 88 88 4 5 2020-03-13 Grape 20 2020-03-15 88 88 5 6 2020-03-13 Grape 18 2020-03-15 88 88 6 7 2020-03-14 Grape 20 2020-03-15 88 88 7 8 2020-03-15 Grape 12 2020-03-15 88 88 8 9 2020-03-16 Grape 19 2020-03-15 88 88 9 10 2020-02-10 Orange 22 2020-02-17 90 90 10 11 2020-02-11 Orange 21 2020-02-17 90 90 11 12 2020-02-12 Orange 26 2020-02-17 90 90 12 13 2020-02-13 Orange 21 2020-02-17 90 90 13 14 2020-02-20 Orange 26 2020-02-17 90 90
Pandas - return all rows with identical values in one column and small differences in another column
I have some DataFrame: df = pd.DataFrame({'fruit':['apple', 'apple', 'apple', 'pear', 'pear', 'pear', 'mango', 'mango', 'mango', 'peach', 'peach', 'peach', 'plum', 'plum', 'plum'], 'region':[5,5,5,7,7,7,2,2,2,2,2,2,2,2,2], 'location':[75000,75000,75000,250,250,250,48897467,48897467,48897467,48897629,48897629,48897629,500000000,500000000,500000000], 'unique':np.random.randint(100, size=15)}) fruit region location unique 0 apple 5 75000 51 1 apple 5 75000 1 2 apple 5 75000 44 3 pear 7 250 36 4 pear 7 250 86 5 pear 7 250 99 6 mango 2 48897467 40 7 mango 2 48897467 12 8 mango 2 48897467 33 9 peach 2 48897629 23 10 peach 2 48897629 85 11 peach 2 48897629 65 12 plum 2 500000000 46 13 plum 2 500000000 87 14 plum 2 500000000 42 I'd like to select all rows of different 'fruit' with identical values in the 'region' column and a difference of less than 1000 in the 'location' column. So, in this example, I'd like to return: fruit region location unique 6 mango 2 48897467 40 7 mango 2 48897467 12 8 mango 2 48897467 33 9 peach 2 48897629 23 10 peach 2 48897629 85 11 peach 2 48897629 65 I've tried something like: df.groupby('region')['location'].diff() But this isn't exactly what I'm trying to do.
You can do it this way a = df.groupby('region')['location'].transform(lambda x: x.max()-x.min()) b = df.groupby('region')['fruit'].transform('nunique') df.loc[(a<=1000) & (b>1)] output fruit region location unique 6 mango 2 9000 7 7 mango 2 9000 98 8 mango 2 9000 92 9 peach 2 8800 34 10 peach 2 8800 17 11 peach 2 8800 15
I am adding this as a new answer sicne the previous answer is usefull if someone wants the whole group instead of a portion of the group like you are wanting now. For your latest need, you can do as below def func(x): return (x - x.iloc[0]) a = df.groupby('region')['location'].apply(func) b = df.groupby('region')['fruit'].transform('nunique') df.loc[(a<=1000) & (b>1)] This will only work of the column location is sorted ascendingly (make sure to sort the df by region & location before you start). output fruit region location unique 6 mango 2 48897467 79 7 mango 2 48897467 62 8 mango 2 48897467 68 9 peach 2 48897629 71 10 peach 2 48897629 64 11 peach 2 48897629 69
Repeat columns as rows in python?
Fruit January Shipments January Sales February Shipments February Sales ------------ ------------------- --------------- -------------------- ---------------- Apple 30 11 18 31 Banana 12 49 39 14 Pear 25 50 44 21 Kiwi 41 25 10 25 Strawberry 11 33 35 50 I'm trying to achieve the following result: Fruit Month Shipments Sales ------------ ---------- ----------- ------- Apple January 30 11 Banana January 12 49 Pear January 25 50 Kiwi January 41 25 Strawberry January 11 33 Apple February 18 31 Banana February 39 14 Pear February 44 21 Kiwi February 10 25 Strawberry February 35 50 I've tried pandas.pivot and pandas.pivot_table and had no luck. I'm in the process of creating two dataframes (Fruit/Month/Shipments) and (Fruit/Month/Sales), and concatenating the two into one with a loop, but I was hoping for a easier way to do this.
one way is to use modify the column to a multi level then use stack. Let suppose your dataframe is called df. First set the column Fruit as index, then define the multilevel columns: df = df.set_index('Fruit') # manual way to create the multiindex columns #df.columns = pd.MultiIndex.from_product([['January','February'], # ['Shipments','Sales']], names=['Month',None]) # more general way to create the multiindex columns thanks to #Scott Boston df.columns = df.columns.str.split(expand=True) df.columns.names = ['Month',None] your data looks like: Month January February Shipments Sales Shipments Sales Fruit Apple 30 11 18 31 Banana 12 49 39 14 Pear 25 50 44 21 Kiwi 41 25 10 25 Strawberry 11 33 35 50 Now you can use stack on level 0 and reset_index df_output = df.stack(0).reset_index() which gives Fruit Month Sales Shipments 0 Apple February 31 18 1 Apple January 11 30 2 Banana February 14 39 3 Banana January 49 12 4 Pear February 21 44 5 Pear January 50 25 6 Kiwi February 25 10 7 Kiwi January 25 41 8 Strawberry February 50 35 9 Strawberry January 33 11 Finally, if you want a specific order for values in the column Month you can use pd.Categorical: df_output['Month'] = pd.Categorical(df_output['Month'].tolist(), ordered=True, categories=['January','February']) setting that January is before February when sorting. Now, doing df_output = df_output.sort_values(['Month']) gives the result: Fruit Month Sales Shipments 1 Apple January 11 30 3 Banana January 49 12 5 Pear January 50 25 7 Kiwi January 25 41 9 Strawberry January 33 11 0 Apple February 31 18 2 Banana February 14 39 4 Pear February 21 44 6 Kiwi February 25 10 8 Strawberry February 50 35 I see it's not exactly the expected output (order in Fruit column and order of columns) but both can be easily change if needed.
How to use pd.wide_to_long as #user3483203 suggests. df1 = df.set_index('Fruit') #First we have to so column renaming use multiindex column headers and swapping levels. df1.columns = df1.columns.str.split(expand=True) df1.columns = df1.columns.map('{0[1]}_{0[0]}'.format) #Reset index and use pd.wide_to_long: df1 = df1.reset_index() df_out = pd.wide_to_long(df1, ['Shipments','Sales'], 'Fruit', 'Month','_','\w+')\ .reset_index() print(df_out) Output: Fruit Month Shipments Sales 0 Apple January 30.0 11.0 1 Banana January 12.0 49.0 2 Pear January 25.0 50.0 3 Kiwi January 41.0 25.0 4 Strawberry January 11.0 33.0 5 Apple February 18.0 31.0 6 Banana February 39.0 14.0 7 Pear February 44.0 21.0 8 Kiwi February 10.0 25.0 9 Strawberry February 35.0 50.0
Appending values from one column to another in pandas
Hi all I'm doing data cleanup, and I'm facing a bit of an obstacle. I have multiple dataframes that look like this: df1 WL WM WH WP 0 NaN NaN Sea NaN 1 low medium high premium 2 26 26 15 14 3 32 32 18 29 4 41 41 19 42 5 apple dog fur napkins 6 orange cat tesla earphone 7 mango rat tobias controller I am trying to combine the WL and WM column such that the outcome looks like this: df1 WM WH WP 0 NaN NaN NaN 1 medium high premium 2 26 15 14 3 32 18 29 4 41 19 42 5 dog fur napkins 6 cat tesla earphone 7 rat tobias controller 8 apple 9 orange 10 mango My initial attempt was to slice the WL column and append that to the WM column, however that has not yielded a correct output. for num in range(len(df)): low = df.loc[:, df.isin(['WarrantyLow']).any()] low = low[5:] medium = df.loc[:, df.isin(['WarrantyMedium']).any()] medium.append(low)
df.append to combine WM and WL. Call df.reset_index to reset the index for the next concatenation pd.concat(..., ignore_index=True, ...) combines result of (1) with rest of the dataframe, ignoring the index In [400]: pd.concat([df1['WM'].append(df1['WL'].iloc[5:]).reset_index(drop=True), \ df1.iloc[:, 2:]], ignore_index=True, axis=1).fillna('')\ .rename(columns=dict(enumerate(['WM', 'WH', 'WP']))) Out[400]: WM WH WP 0 Sea 1 medium high premium 2 26 15 14 3 32 18 29 4 41 19 42 5 dog fur napkins 6 cat tesla earphone 7 rat tobias controller 8 apple 9 orange 10 mango