I am working on a report automation task I used groupby function which yielded a table
function_d= {"AvgLoadCur": 'mean'}
newdf=df.groupby(['sitename']).agg(function_d)
sitename AvgLoadCur
Biocon-SEZD-66/11KV SS 11 23.0
Biocon-SEZD-GT 1 120V DC 24.2
Biocon-SEZD-GT 2 120V DC 23.9
Biocon-SEZD-PLC 24V 21.4
df contains only 4 sitenames hence groupby table also contains only those four how can I append the missing two sitename which are stored in another dataframe site['sitename']
sitename
Biocon-SEZD-GT 1 120V DC
Biocon-SEZD-GT 2 120V DC
Biocon-SEZD-SCADA UPS
Biocon-SEZD-66/11KV SS 11
Biocon-SEZD-PLC 24V DC
BIOCON SEZ-HT PANEL 220 V
The Final Dataframe should look like
sitename AvgLoadCur
Biocon-SEZD-66/11KV SS 11 23.0
Biocon-SEZD-GT 1 120V DC 24.2
Biocon-SEZD-GT 2 120V DC 23.9
Biocon-SEZD-PLC 24V 21.4
Biocon-SEZD-HT PANEL 220 V --
Biocon-SEZD-SCADA UPS --
In short how to append elements which are not present in a groupby table from another dataframe
groupby table:
Fruit Price
apple 34
A df table
Fruit
--------
apple
orange
Final groupby table
Fruit Price
apple 34
orange --
You can first merge your dataframe and then groupby.
df = pd.DataFrame({'Fruit': {0: 'apple'}, 'Price': {0: 34}})
df2 = pd.DataFrame({'Fruit': {0: 'apple', 1: 'orange'}})
(
pd.merge(df,df2,on='Fruit',how='right')
.groupby('Fruit')
.agg(avg=('Price', 'mean'))
.reset_index()
)
Fruit avg
0 apple 34.0
1 orange NaN
I hope this would answer your question:
df = pd.DataFrame([['apple',1],['orange',2]],columns=['Fruit','Price'])
df2 = pd.DataFrame(['guava','apple','orange'],columns=['Fruit'])
for value in df2['Fruit'].values:
if value not in df['Fruit'].values:
df = df.append({'Fruit':value,'Price':'--'},ignore_index=True)
df
Output
Fruit Price
0 apple 1
1 orange 2
2 guava --
Related
I have a dataframe that shows sales per item per store, it looks like this:
date item storeNbr Sales
2021-06-29 soap 123 100
2021-05-29 hat 129 500
2020-06-29 soap 123 0
2020-05-29 hat 129 10
I'm trying to create a column for last year's sales that should take values that
already exist in the dataframe where the date is equal to the prior year, and where
the store number and item are the same. So it should look like this:
date item storeNbr Sales LY
2021-06-29 soap 123 100 0
2021-05-29 hat 129 500 10
2020-06-29 soap 123 0 Nan
2020-05-29 hat 129 10 Nan
I've tried this:
df['Previous'] =
df.groupby([df['date'].dt.month,df['date'].dt.day,df['StoreNbr']])
['Sales'].shift()
but I'm having trouble getting the desired result. Thank you in advance for any help here!
Sample data:
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({'date': {0: Timestamp('2021-06-29 00:00:00'), 1: Timestamp('2021-05-29 00:00:00'), 2: Timestamp('2020-06-29 00:00:00'), 3: Timestamp('2020-05-29 00:00:00')}, 'item': {0: 'soap', 1: 'hat', 2: 'soap', 3: 'hat'}, 'storeNbr': {0: 123, 1: 129, 2: 123, 3: 129}, 'Sales': {0: 100, 1: 500, 2: 0, 3: 10}})
Code:
# create copy of your data, but add 1 year from the date, then merge.
df2 = df.copy()
df2['date'] = df2['date'] + pd.DateOffset(years=1)
df['LY'] = df.drop('Sales', axis=1).merge(df2, on=['date', 'item', 'storeNbr'])['Sales']
Output:
date item storeNbr Sales LY
0 2021-06-29 soap 123 100 0.0
1 2021-05-29 hat 129 500 10.0
2 2020-06-29 soap 123 0 NaN
3 2020-05-29 hat 129 10 NaN
One-liner provide by #ScottBoston
df.merge(df.assign(date = df['date'] + pd.DateOffset(years=1)),
on=['date','item','storeNbr'],
how='left',
suffixes=('','_y'))\
.rename(columns={'Sales_y':'LY'})
If you sort it first, you can do a groupby and shift.
df = df.sort_values(by=['item','date'])
df['LY'] = df.groupby('item')['Sales'].shift()
Output
date item storeNbr Sales LY
3 2020-05-29 hat 129 10 NaN
1 2021-05-29 hat 129 500 10.0
2 2020-06-29 soap 123 0 NaN
0 2021-06-29 soap 123 100 0.0
Your code is close, just 3 minor changes:
group by one more field on item
add the parameter sort=False in groupby() to ensure the original order is retained (recent year first).
use shift(-1) to get the value of 'next' value instead of shift() which gets previous value.
df['LY'] = df.groupby([df['date'].dt.month ,df['date'].dt.day , df['storeNbr'], df['item']], sort=False)['Sales'].shift(-1)
Result:
print(df)
date item storeNbr Sales LY
0 2021-06-29 soap 123 100 0.0
1 2021-05-29 hat 129 500 10.0
2 2020-06-29 soap 123 0 NaN
3 2020-05-29 hat 129 10 NaN
In my dataframe, df, I am trying to sum the values from the value column for each Product and Year for two periods of the year (Month), specifically Months 1 through 3 and Months 9 through 11. I know I need to use groupby to group Products and Years, and possibly use a lambda function (or an if statement) to separate the two periods of time.
Here's my data frame df:
import pandas as pd
products = {'Product': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C',
'C','C','C'],
'Month': [1,1,3,4,5,10,4,5,10,11,2,3,5,3,9,
10,11,12],
'Year': [1999,1999,1999,1999,1999,1999,2017,2017,1988,1988,2002,2002,2002,2003,2003,
2003,2003,2003],
'value': [250,810,1200,340,250,800,1200,400,250,800,1200,300,290,800,1200,300, 1200, 300]
}
df = pd.DataFrame(products, columns= ['Product', 'Month','Year','value'])
df
And I want a table that looks something like this:
products = {'Product': ['A','A','B','B','C','C','C'],
'MonthGroups': ['Month1:3','Month9:11','Month1:3','Month9:11','Month1:3','Month1:3','Month9:11'],
'Year': [1999,1999,2017,1988,2002, 2003, 2003],
'SummedValue': [2260, 800, 0, 1050, 1500, 800, 2700]
}
new_df = pd.DataFrame(products, columns= ['Product', 'MonthGroups','Year','SummedValue'])
new_df
What I have so far that is that I should use groupby to group Product and Year. What I'm stuck on is defining the two "Month Groups": Months 1 through 3 and Months 9 through 11, which should be the sum of value per year.
df.groupby(['Product','Year']).value.sum().loc[lambda p: p > 10].to_frame()
This isn't right though because it needs to sum based on the month groups.
First created new column by numpy.select with DataFrame.assign, then aggregate also by MonthGroups and because groupby by default remove rows with misisng values if column used for by parameter (like here MonthGroups) are omitted not matched groups:
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
If need also 0 sum values for not matched rows:
df2 = df[['Product','Year']].drop_duplicates().assign(MonthGroups='Month1:3',SummedValue=0)
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
.append(df2)
.drop_duplicates(['Product','MonthGroups','Year'])
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
6 B Month1:3 2017 0
8 B Month1:3 1988 0
A little different approach using pd.cut:
bins = [0,3,8,11]
s = pd.cut(df['Month'],bins,labels=['1:3','irrelevant','9:11'])
(df[s.isin(['1:3','9:11'])].assign(MonthGroups=s.astype(str))
.groupby(['Product','MonthGroups','Year'])['value'].sum().reset_index())
Product MonthGroups Year value
0 A 1:3 1999 2260
1 A 9:11 1999 800
2 B 9:11 1988 1050
3 C 1:3 2002 1500
4 C 1:3 2003 800
5 C 9:11 2003 2700
i have a dataframe like this
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
and i want to get country and product combination count by user wise like below
first split the countries then combine with product and take the count.
wanted output:
Here is one way combining other answers on SO (which just shows the power of searching :D)
import pandas as pd
df = pd.DataFrame({
'User':['101','101','102','102','102'],
'Product':['x','x','x','z','z'],
'Country':['India,Brazil','India','India,Brazil,Japan','India,Brazil','Brazil']
})
# Making use of: https://stackoverflow.com/a/37592047/7386332
j = (df.Country.str.split(',', expand=True).stack()
.reset_index(drop=True, level=1)
.rename('Country'))
df = df.drop('Country', axis=1).join(j)
# Reformat to get desired Country_Product
df = (df.drop(['Country','Product'], 1)
.assign(Country_Product=['_'.join(i) for i in zip(df['Country'], df['Product'])]))
df2 = df.groupby(['User','Country_Product'])['User'].count().rename('Count').reset_index()
print(df2)
Returns:
User Country_Product count
0 101 Brazil_x 1
1 101 India_x 2
2 102 Brazil_x 1
3 102 Brazil_z 2
4 102 India_x 1
5 102 India_z 1
6 102 Japan_x 1
How about get_dummies
df.set_index(['User','Product']).Country.str.get_dummies(sep=',').replace(0,np.nan).stack().sum(level=[0,1,2])
Out[658]:
User Product
101 x Brazil 1.0
India 2.0
102 x Brazil 1.0
India 1.0
Japan 1.0
z Brazil 2.0
India 1.0
dtype: float64
I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])
I hope the title is accurate enough, I wasn't quite sure how to phrase it.
Anyhow, my problem is that I have a Pandas df which looks like the following:
Customer Source CustomerSource
0 Apple A 141
1 Apple B 36
2 Microsoft A 143
3 Oracle C 225
4 Sun C 151
This is a df derived from a greater dataset, and the meaning the value of CustomerSource is that it's the accumulated sum of all occurrences of Customer and Source, for example, in this case there is 141 occurrences of Apple with Soure A and 225 of Customer Oracle with Source B and so on.
What I want to do with this, is I want to do a stacked barplot which gives me all Customers on the x-axis and the values of CustomerSource stacked on top of each other on the y-axis. Similar to the below example. Any hints as to how I would proceed with this?
You can use pivot or unstack for reshape and then DataFrame.bar:
df.pivot('Customer','Source','CustomerSource').plot.bar(stacked=True)
df.set_index(['Customer','Source'])['CustomerSource'].unstack().plot.bar(stacked=True)
Or if duplicates in pairs Customer, Source use pivot_table or groupby with aggregate sum:
print (df)
Customer Source CustomerSource
0 Apple A 141 <-same Apple, A
1 Apple A 200 <-same Apple, A
2 Apple B 36
3 Microsoft A 143
4 Oracle C 225
5 Sun C 151
df = df.pivot_table(index='Customer',columns='Source',values='CustomerSource', aggfunc='sum')
print (df)
Source A B C
Customer
Apple 341.0 36.0 NaN <-141 + 200 = 341
Microsoft 143.0 NaN NaN
Oracle NaN NaN 225.0
Sun NaN NaN 151.0
df.pivot_table(index='Customer',columns='Source',values='CustomerSource', aggfunc='sum')
.plot.bar(stacked=True)
df.groupby(['Customer','Source'])['CustomerSource'].sum().unstack().plot.bar(stacked=True)
Also is possible swap columns:
df.pivot('Customer','Source','CustomerSource').plot.bar(stacked=True)
df.pivot('Source', 'Customer','CustomerSource').plot.bar(stacked=True)