Suppose I have a DataFrame that looks something like this:
id
country
grade
category
amount
0
7
fr
a
mango
52
1
5
fr
b
banana
68
2
7
fr
a
banana
73
3
4
it
c
mango
70
4
5
fr
b
banana
99
5
9
uk
a
apple
29
6
3
uk
a
mango
83
7
0
uk
b
banana
59
8
2
it
c
mango
11
9
9
uk
a
banana
91
10
0
uk
b
mango
95
11
8
uk
a
mango
30
12
3
uk
a
mango
82
13
1
it
b
banana
78
14
3
uk
a
apple
76
15
6
it
c
apple
76
16
2
it
c
mango
10
17
1
it
b
mango
30
18
9
uk
a
banana
17
19
2
it
c
mango
58
Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4 total: 70
id: 2 total: 11 + 10 + 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id
country
grade
category
amount
mean_totals
mean_counts
0
7
fr
a
mango
52
52
1
1
5
fr
b
banana
68
167
2
2
7
fr
a
banana
73
73
1
3
4
it
c
mango
70
74.5
2
4
5
fr
b
banana
99
167
2
5
9
uk
a
apple
29
52.5
1
6
3
uk
a
mango
83
97.5
1.5
7
0
uk
b
banana
59
59
1
8
2
it
c
mango
11
74.5
2
9
9
uk
a
banana
91
108
2
10
0
uk
b
mango
95
95
1
11
8
uk
a
mango
30
97.5
1.5
12
3
uk
a
mango
82
97.5
1.5
13
1
it
b
banana
78
78
1
14
3
uk
a
apple
76
52.5
1
15
6
it
c
apple
76
76
1
16
2
it
c
mango
10
74.5
2
17
1
it
b
mango
30
30
1
18
9
uk
a
banana
17
108
2
19
2
it
c
mango
58
74.5
2
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)
Output_dataframe
Related
To be honest, I am very new in programming and learned some of the panda's basic functionality.
I am able to do group by and sum price of each item, but not able to specifically apply cut of date and do summation.
Below are my input data and expected result. Requesting to help how to achieve this using pandas. data image
In the below data, N=5 (no of values need to consider before the cut of date), the expected result for item Grape is 88 i.e. sum of entry 7,6,5,4, and 3. And orange is 90 (entries 13,12,11,10), here only 4 entry available, so considered all.
EntryDate Itemname Price Cut off date Expected result
1 3/9/2020 Grape 16 3/15/2020 88
2 3/10/2020 Grape 15 3/15/2020 88
3 3/11/2020 Grape 12 3/15/2020 88
4 3/12/2020 Grape 18 3/15/2020 88
5 3/13/2020 Grape 20 3/15/2020 88
6 3/13/2020 Grape 18 3/15/2020 88
7 3/14/2020 Grape 20 3/15/2020 88
8 3/15/2020 Grape 12 3/15/2020 88
9 3/16/2020 Grape 19 3/15/2020 88
10 2/10/2020 Orange 22 2/17/2020 90
11 2/11/2020 Orange 21 2/17/2020 90
12 2/12/2020 Orange 26 2/17/2020 90
13 2/13/2020 Orange 21 2/17/2020 90
14 2/20/2020 Orange 26 2/17/2020 90
First convert columns to datetimes, then filter rows by cut off date by Series.lt in boolean indexing and aggregate sum in lambda function for last N values by Series.tail, last for new column use Series.map:
N = 5
df['Date'] = pd.to_datetime(df['Date'])
df['Cut off date'] = pd.to_datetime(df['Cut off date'])
s = (df[df['Date'].lt(df['Cut off date'])]
.groupby('Itemname')['Price']
.agg(lambda x: x.tail(N).sum()))
df['new'] = df['Itemname'].map(s)
print (df)
Entry Date Itemname Price Cut off date Expected result new
0 1 2020-03-09 Grape 16 2020-03-15 88 88
1 2 2020-03-10 Grape 15 2020-03-15 88 88
2 3 2020-03-11 Grape 12 2020-03-15 88 88
3 4 2020-03-12 Grape 18 2020-03-15 88 88
4 5 2020-03-13 Grape 20 2020-03-15 88 88
5 6 2020-03-13 Grape 18 2020-03-15 88 88
6 7 2020-03-14 Grape 20 2020-03-15 88 88
7 8 2020-03-15 Grape 12 2020-03-15 88 88
8 9 2020-03-16 Grape 19 2020-03-15 88 88
9 10 2020-02-10 Orange 22 2020-02-17 90 90
10 11 2020-02-11 Orange 21 2020-02-17 90 90
11 12 2020-02-12 Orange 26 2020-02-17 90 90
12 13 2020-02-13 Orange 21 2020-02-17 90 90
13 14 2020-02-20 Orange 26 2020-02-17 90 90
I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?
You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John
Is there a way to modify the expanding windows in Pandas. For example consider a random DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
When I to df.expanding(1).apply(), it applies the function to expanding every row, is it possible to pass the date column to the expanding function so instead of every row as a window, it accumulates groups of rows based on date
Existing expanding window:
window 1: 0 16 25 37 Apple Orange 2002-01-01
window 2: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
window 3: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
Expected expanding window:
window 1 (all rows for date "2002-01-01"):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
window 2 (all rows for date "2002-01-01" and "2002-02-01" ):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
Assuming df.date is sorted.
I suppose that there is more efficient way to calculate end. If you find better solution please let me know.
Another usage, Pandas custom window rolling, Possible problems
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
lst = []
prev_date = self.custom_name[0]
for i, date in enumerate(self.custom_name[1:], 1):
if prev_date != date:
lst.append(i)
prev_date = date
lst.append(len(df))
end = np.array(lst)
start = np.zeros_like(end)
return start, end
indexer = CustomIndexer(custom_name=df.date)
for window in df.rolling(indexer):
print(window)
Outputs:
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
7 17 17 37 Mango Orange 2002-02-01
8 5 53 0 Apple lemon 2002-02-01
9 16 10 24 Apple Orange 2002-02-01
I have some DataFrame:
df = pd.DataFrame({'fruit':['apple', 'apple', 'apple', 'pear', 'pear', 'pear', 'mango', 'mango', 'mango', 'peach', 'peach', 'peach', 'plum', 'plum', 'plum'],
'region':[5,5,5,7,7,7,2,2,2,2,2,2,2,2,2],
'location':[75000,75000,75000,250,250,250,48897467,48897467,48897467,48897629,48897629,48897629,500000000,500000000,500000000],
'unique':np.random.randint(100, size=15)})
fruit region location unique
0 apple 5 75000 51
1 apple 5 75000 1
2 apple 5 75000 44
3 pear 7 250 36
4 pear 7 250 86
5 pear 7 250 99
6 mango 2 48897467 40
7 mango 2 48897467 12
8 mango 2 48897467 33
9 peach 2 48897629 23
10 peach 2 48897629 85
11 peach 2 48897629 65
12 plum 2 500000000 46
13 plum 2 500000000 87
14 plum 2 500000000 42
I'd like to select all rows of different 'fruit' with identical values in the 'region' column and a difference of less than 1000 in the 'location' column.
So, in this example, I'd like to return:
fruit region location unique
6 mango 2 48897467 40
7 mango 2 48897467 12
8 mango 2 48897467 33
9 peach 2 48897629 23
10 peach 2 48897629 85
11 peach 2 48897629 65
I've tried something like:
df.groupby('region')['location'].diff()
But this isn't exactly what I'm trying to do.
You can do it this way
a = df.groupby('region')['location'].transform(lambda x: x.max()-x.min())
b = df.groupby('region')['fruit'].transform('nunique')
df.loc[(a<=1000) & (b>1)]
output
fruit region location unique
6 mango 2 9000 7
7 mango 2 9000 98
8 mango 2 9000 92
9 peach 2 8800 34
10 peach 2 8800 17
11 peach 2 8800 15
I am adding this as a new answer sicne the previous answer is usefull if someone wants the whole group instead of a portion of the group like you are wanting now.
For your latest need, you can do as below
def func(x):
return (x - x.iloc[0])
a = df.groupby('region')['location'].apply(func)
b = df.groupby('region')['fruit'].transform('nunique')
df.loc[(a<=1000) & (b>1)]
This will only work of the column location is sorted ascendingly (make sure to sort the df by region & location before you start).
output
fruit region location unique
6 mango 2 48897467 79
7 mango 2 48897467 62
8 mango 2 48897467 68
9 peach 2 48897629 71
10 peach 2 48897629 64
11 peach 2 48897629 69
Hi all I'm doing data cleanup, and I'm facing a bit of an obstacle. I have multiple dataframes that look like this:
df1
WL WM WH WP
0 NaN NaN Sea NaN
1 low medium high premium
2 26 26 15 14
3 32 32 18 29
4 41 41 19 42
5 apple dog fur napkins
6 orange cat tesla earphone
7 mango rat tobias controller
I am trying to combine the WL and WM column such that the outcome looks like this:
df1
WM WH WP
0 NaN NaN NaN
1 medium high premium
2 26 15 14
3 32 18 29
4 41 19 42
5 dog fur napkins
6 cat tesla earphone
7 rat tobias controller
8 apple
9 orange
10 mango
My initial attempt was to slice the WL column and append that to the WM column, however that has not yielded a correct output.
for num in range(len(df)):
low = df.loc[:, df.isin(['WarrantyLow']).any()]
low = low[5:]
medium = df.loc[:, df.isin(['WarrantyMedium']).any()]
medium.append(low)
df.append to combine WM and WL. Call df.reset_index to reset the index for the next concatenation
pd.concat(..., ignore_index=True, ...) combines result of (1) with rest of the dataframe, ignoring the index
In [400]: pd.concat([df1['WM'].append(df1['WL'].iloc[5:]).reset_index(drop=True), \
df1.iloc[:, 2:]], ignore_index=True, axis=1).fillna('')\
.rename(columns=dict(enumerate(['WM', 'WH', 'WP'])))
Out[400]:
WM WH WP
0 Sea
1 medium high premium
2 26 15 14
3 32 18 29
4 41 19 42
5 dog fur napkins
6 cat tesla earphone
7 rat tobias controller
8 apple
9 orange
10 mango