Assume that I have a Pandas data frame that looks like this:
df = pd.DataFrame({
"YEAR":[2000,2000,2001,2001,2002],
"VISITORS":[100,2000,200,300,250],
"SALES":[5000,2500,23500,1512,3510],
"MONTH":[1,2,1,2,1],
"LOCATION":["Loc1", "Loc2", "Loc1" , "Loc2" , "Loc1"]})
I want to join this data frame on MONTH, LOCATION columns with a previous year data of the same Pandas data frame.
I tried this:
def calculate(df):
result_all_years = []
for current_year in df["YEAR"].unique():
df_previous = df.copy()
df_previous = df_previous[df_previous["YEAR"] == current_year - 1]
df_previous.rename(
columns={
"VISITORS": "VISITORS_LAST_YEAR",
"SALES": "SALES_LAST_YEAR",
"YEAR": "PREVIOUS_YEAR",
},
inplace=True,
)
df_current = df[df["YEAR"] == current_year]
df_current = df_current.merge(
df_previous,
how="left",
on=["MONTH", "LOCATION" ]
)
# There are many simular calculations and additional columns to be added like the following:
df_current["SALES_DIFF"] = df_current["SALES"] - df_current["SALES_LAST_YEAR"]
result_all_years.append(df_current)
return pd.concat(result_all_years, ignore_index=True).round(3)
The code in the calculate function is working fine. But is there any faster method to do that? Possibly faster?
Try to merge with the same dataframe and manipulate it accordingly
diff_df = pd.merge(df, df, left_on = [df['YEAR'], df['MONTH'], df['LOCATION']], suffixes=('', '_PREV'),
right_on = [df['YEAR']+1, df['MONTH'], df['LOCATION']])
diff_df = diff_df[['YEAR', 'YEAR_PREV', 'MONTH', 'LOCATION','VISITORS','VISITORS_PREV','SALES','SALES_PREV']]
diff_df = diff_df.assign(VISITORS_DIFF = (diff_df['VISITORS_PREV'] - diff_df['VISITORS']),
SALES_DIFF = (diff_df['SALES_PREV'] - diff_df['SALES']))
Output
YEAR YEAR_PREV MONTH LOCATION VISITORS VISITORS_PREV SALES SALES_PREV VISITORS_DIFF SALES_DIFF
2001 2000 1 Loc1 200 100 23500 5000 -100 -18500
2001 2000 2 Loc2 300 2000 1512 2500 1700 988
2002 2001 1 Loc1 250 200 3510 23500 -50 19990
IIUC, you can muse merge on the dataframe itself with the incremented YEAR:
(df.merge(df.assign(YEAR=df['YEAR']+1).drop(columns=['MONTH']),
on=['YEAR', 'LOCATION'],
how='left',
suffixes=('', '_LAST_YEAR'))
.assign(SALES_DIFF=lambda d: d['SALES']-d['SALES_LAST_YEAR'],
LAST_YEAR=lambda d: d['YEAR'].sub(1).mask(d['SALES_DIFF'].isna())
)
)
output:
YEAR VISITORS SALES MONTH LOCATION VISITORS_LAST_YEAR SALES_LAST_YEAR SALES_DIFF LAST_YEAR
0 2000 100 5000 1 Loc1 NaN NaN NaN NaN
1 2000 2000 2500 2 Loc2 NaN NaN NaN NaN
2 2001 200 23500 1 Loc1 100.0 5000.0 18500.0 2000.0
3 2001 300 1512 2 Loc2 2000.0 2500.0 -988.0 2000.0
4 2002 250 3510 1 Loc1 200.0 23500.0 -19990.0 2001.0
Related
I have 2 Data frames the first looks like this
df1:
MONEY Value
0 EUR 850
1 USD 750
2 CLP 1
3 DCN 1
df2:
Money
0 USD
1 USD
2 USD
3 USD
4 EGP
... ...
25984 USD
25985 DCN
25986 USD
25987 CLP
25988 USD
I want to remove the "Money" values of df2 that are not present in df1. and add any column of the values of the "Value" column in df1
Money Value
0 USD 720
1 USD 720
2 USD 720
3 USD 720
... ...
25984 USD 720
25985 DCN 1
25986 USD 720
25987 CLP 1
25000 USD 720
Step by step:
df1.set_index("MONEY")["Value"]
This code transforms the column MONEY into the Dataframe index. Which results in:
print(df1)
MONEY
EUR 850
USD 150
DCN 1
df2["Money"].map(df1.set_index("MONEY")["Value"])
This code maps the content of df2 to df1. This returns the following:
0 150.0
1 NaN
2 850.0
3 NaN
Now we assign the previous column to a new column in df2 called Value. Putting it all together:
df2["Value"] = df2["Money"].map(df1.set_index("MONEY")["Value"])
df2 now looks like:
Money Value
0 USD 150.0
1 GBP NaN
2 EUR 850.0
3 CLP NaN
Only one thing is left to do: Delete any rows that have NaN value:
df2.dropna(inplace=True)
Entire code sample:
import pandas as pd
# Create df1
x_1 = ["EUR", 850], ["USD", 150], ["DCN", 1]
df1 = pd.DataFrame(x_1, columns=["MONEY", "Value"])
# Create d2
x_2 = "USD", "GBP", "EUR", "CLP"
df2 = pd.DataFrame(x_2, columns=["Money"])
# Create new column in df2 called 'Value'
df2["Value"] = df2["Money"].map(df1.set_index("MONEY")["Value"])
# Drops any rows that have 'NaN' in column 'Value'
df2.dropna(inplace=True)
print(df2)
Outputs:
Money Value
0 USD 150.0
2 EUR 850.0
In my dataframe, df, I am trying to sum the values from the value column for each Product and Year for two periods of the year (Month), specifically Months 1 through 3 and Months 9 through 11. I know I need to use groupby to group Products and Years, and possibly use a lambda function (or an if statement) to separate the two periods of time.
Here's my data frame df:
import pandas as pd
products = {'Product': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C',
'C','C','C'],
'Month': [1,1,3,4,5,10,4,5,10,11,2,3,5,3,9,
10,11,12],
'Year': [1999,1999,1999,1999,1999,1999,2017,2017,1988,1988,2002,2002,2002,2003,2003,
2003,2003,2003],
'value': [250,810,1200,340,250,800,1200,400,250,800,1200,300,290,800,1200,300, 1200, 300]
}
df = pd.DataFrame(products, columns= ['Product', 'Month','Year','value'])
df
And I want a table that looks something like this:
products = {'Product': ['A','A','B','B','C','C','C'],
'MonthGroups': ['Month1:3','Month9:11','Month1:3','Month9:11','Month1:3','Month1:3','Month9:11'],
'Year': [1999,1999,2017,1988,2002, 2003, 2003],
'SummedValue': [2260, 800, 0, 1050, 1500, 800, 2700]
}
new_df = pd.DataFrame(products, columns= ['Product', 'MonthGroups','Year','SummedValue'])
new_df
What I have so far that is that I should use groupby to group Product and Year. What I'm stuck on is defining the two "Month Groups": Months 1 through 3 and Months 9 through 11, which should be the sum of value per year.
df.groupby(['Product','Year']).value.sum().loc[lambda p: p > 10].to_frame()
This isn't right though because it needs to sum based on the month groups.
First created new column by numpy.select with DataFrame.assign, then aggregate also by MonthGroups and because groupby by default remove rows with misisng values if column used for by parameter (like here MonthGroups) are omitted not matched groups:
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
If need also 0 sum values for not matched rows:
df2 = df[['Product','Year']].drop_duplicates().assign(MonthGroups='Month1:3',SummedValue=0)
df1 = (df.assign(MonthGroups = np.select([df['Month'].between(1,3),
df['Month'].between(9,11)],
['Month1:3','Month9:11'], default=None))
.groupby(['Product','MonthGroups','Year']).value
.sum()
.reset_index(name='SummedValue')
.append(df2)
.drop_duplicates(['Product','MonthGroups','Year'])
)
print (df1)
Product MonthGroups Year SummedValue
0 A Month1:3 1999 2260
1 A Month9:11 1999 800
2 B Month9:11 1988 1050
3 C Month1:3 2002 1500
4 C Month1:3 2003 800
5 C Month9:11 2003 2700
6 B Month1:3 2017 0
8 B Month1:3 1988 0
A little different approach using pd.cut:
bins = [0,3,8,11]
s = pd.cut(df['Month'],bins,labels=['1:3','irrelevant','9:11'])
(df[s.isin(['1:3','9:11'])].assign(MonthGroups=s.astype(str))
.groupby(['Product','MonthGroups','Year'])['value'].sum().reset_index())
Product MonthGroups Year value
0 A 1:3 1999 2260
1 A 9:11 1999 800
2 B 9:11 1988 1050
3 C 1:3 2002 1500
4 C 1:3 2003 800
5 C 9:11 2003 2700
I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])
Assume the file1 is:
State Date
0 NSW 01/02/16
1 NSW 01/03/16
3 VIC 01/04/16
...
100 TAS 01/12/17
File 2 is:
State 01/02/16 01/03/16 01/04/16 .... 01/12/17
0 VIC 10000 12000 14000 .... 17600
1 NSW 50000
....
Now I would like to join these two files based on Date
In the other words, I want to combine the file1's Date column with file2 columns' date.
I believe you need melt with merge, parameter on is possible omit for merge by all columns same in both DataFrames:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df = df2.melt('State', var_name='Date', value_name='col').merge(df1, how='right')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
Solution with left join:
df = df1.merge(df2.melt('State', var_name='Date', value_name='col'), how='left')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
You can melt the second data frame to a long format, then merge with first data frame to get the values.
import pandas as pd
df1 = pd.DataFrame({'State': ['NSW','NSW','VIC'],
'Date': ['01/02/16', '01/03/16', '01/04/16']})
df2 = pd.DataFrame([['VIC',10000,12000,14000],
['NSW',50000,60000,62000]],
columns=['State', '01/02/16', '01/03/16', '01/04/16'])
df1.merge(pd.melt(df2, id_vars=['State'], var_name='Date'), on=['State', 'Date'])
# returns:
Date State value
0 01/02/16 NSW 50000
1 01/03/16 NSW 60000
2 01/04/16 VIC 14000
I have data that contains prices, volumes and other data about various financial securities. My input data looks like the following:
import numpy as np
import pandas
prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009,
2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx
# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
# price volume
# year
# 2007 0.121002 30
# 2007 15.256424 70
# 2007 44.479590 50
# 2007 29.096013 0
# 2007 21.424690 0
# 2008 23.019548 40
# 2008 90.011295 0
# 2008 88.487664 30
# 2008 51.609119 70
# 2008 4.265726 80
# 2009 34.402065 140
# 2009 10.259064 100
# 2009 47.024574 110
# 2009 57.614977 140
# 2009 54.718016 50
I want to produce a data frame that looks like:
year 2007 2008 2009
0 0.121002 23.019548 34.402065
1 15.256424 90.011295 10.259064
2 44.479590 88.487664 47.024574
3 29.096013 51.609119 57.614977
4 21.424690 4.265726 54.718016
I know of one way to produce the output above using groupby:
df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
series = data['price'].copy()
series.index = range(len(series))
series.name = group
df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)
And I also know that you can do pivot to get a DataFrame that has NaNs for the missing indices on the pivot:
# df = df.reset_index()
df.pivot(columns='year', values='price')
# Output
# year 2007 2008 2009
# 0 0.121002 NaN NaN
# 1 15.256424 NaN NaN
# 2 44.479590 NaN NaN
# 3 29.096013 NaN NaN
# 4 21.424690 NaN NaN
# 5 NaN 23.019548 NaN
# 6 NaN 90.011295 NaN
# 7 NaN 88.487664 NaN
# 8 NaN 51.609119 NaN
# 9 NaN 4.265726 NaN
# 10 NaN NaN 34.402065
# 11 NaN NaN 10.259064
# 12 NaN NaN 47.024574
# 13 NaN NaN 57.614977
# 14 NaN NaN 54.718016
My question is the following:
Is there a way that I can create my output DataFrame in the groupby without creating the series, or is there a way I can re-index my input DataFrame so that I get the desired output using pivot?
You need to label each year 0-4. To do this, use the cumcount after grouping. Then you can pivot correctly using that new column as the index.
df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')
year 2007 2008 2009
year_count
0 61.682275 32.729113 54.859700
1 44.231296 4.453897 45.325802
2 65.850231 82.023960 28.325119
3 29.098607 86.046499 71.329594
4 67.864723 43.499762 19.255214
You can use groupby with apply new Series created with numpy array by values and then reshape by unstack:
print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year 2007 2008 2009
0 55.360804 68.671626 78.809139
1 50.246485 55.639250 84.483814
2 17.646684 14.386347 87.185550
3 54.824732 91.846018 60.793002
4 24.303751 50.908714 22.084445