I have a dataframe with time-index. I can resample the data to get (e.g) mean per-day, however I would like also to get the counts per day. Here is a sample:
import datetime
import pandas as pd
import numpy as np
dates = pd.date_range(datetime.datetime(2012, 4, 5, 11,
0),datetime.datetime(2012, 4, 7, 7, 0),freq='5H')
var1 = np.random.sample(dates.size) * 10.0
var2 = np.random.sample(dates.size) * 10.0
df = pd.DataFrame(data={'var1': var1, 'var2': var2}, index=dates)
df1=df.resample('D').mean()
I'd like to get also a 3rd column 'count' which counts per day:
count
3
5
7
Thank you very much!
Use Resampler.agg and then flatten MultiIndex in columns:
df1 = df.resample('D').agg({'var1': 'mean','var2': ['mean', 'size']})
df1.columns = df1.columns.map('_'.join)
df1 = df1.rename(columns={'var2_size':'count'})
print (df1)
var1_mean var2_mean count
2012-04-05 3.992166 4.968410 3
2012-04-06 6.843105 6.193568 5
2012-04-07 4.568436 3.135089 1
Alternative solution with Grouper:
df1 = df.groupby(pd.Grouper(freq='D')).agg({'var1': 'mean','var2': ['mean', 'size']})
df1.columns = df1.columns.map('_'.join)
df1 = df1.rename(columns={'var2_size':'count'})
print (df1)
var1_mean var2_mean count
2012-04-05 3.992166 4.968410 3
2012-04-06 6.843105 6.193568 5
2012-04-07 4.568436 3.135089 1
EDIT:
r = df.resample('D')
df1 = r.mean().add_suffix('_mean').join(r.size().rename('count'))
print (df1)
var1_mean var2_mean count
2012-04-05 7.840487 6.885030 3
2012-04-06 4.762477 5.091455 5
2012-04-07 2.702414 6.046200 1
Related
I have two dataframes:
df1:
date score perf
0 2021-08-01 2 4
1 2021-08-02 4 5
2 2021-08-03 6 7
df2:
date score perf
0 2021-08-01 2 7
1 2021-08-02 4 8
2 2021-08-03 6 7
I want to return df1, df2, and a variation in perf of df1 and df2 as a third dataframe together as shown in this picture:
The illustration of tables you shared does not match with the values of df1and df2.
But anyway, you can use pandas.merge to bring the column(s) you need to do the substraction.
import pandas as pd
df1 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [4, 5, 7]})
df2 = pd.DataFrame({'date': ['01/08/2021', '02/08/2021', '03/08/2021'],
'score' : [2, 4, 6],
'perf': [7, 8, 7]})
out = df1.merge(df2[['date','perf']], on='date', how='left')
out['perf'] = out['perf_x'] - out['perf_y']
out = out.drop(['perf_x','perf_y'], axis=1)
>>> print(out)
Note: In case you want to substract all the columns of both dataframes, you can use pandas.DataFrame.sub instead to substract one dataframe from another.
Edit :
Now, if you want to display the three dataframes in a Excel sheet (as shown in your illustation), you can use this function:
list_df = [df1, df2, df3]
df1.name = 'Table 1'
df2.name = 'Table 2'
df3.name = 'Table 3 (tabl1perf - tabl2perf)'
def display_results(dfs):
sr = 1
with pd.ExcelWriter('your_excel_name.xlsx') as writer:
for df in dfs:
df.to_excel(writer, startrow=sr, index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
title = df.name
worksheet.write(sr-1, 0, title)
sr += (df.shape[0] + 3)
display_results(list_df)
>>> Output (in Excel)
I have 2 dataframes df1 and df2 (same index and number of rows), and I would like to create a new dataframe which columns are the sum of all combinations of 2 columns from df1 and df2, example :
input :
import pandas as pd
df1 = pd.DataFrame([[10,20]])
df2 = pd.DataFrame([[1,2]])
output :
import pandas as pd
df3 = pd.DataFrame([[11,12,21,22]])
Use MultiIndex.from_product for all combinations and sum DataFrames with repeated values by DataFrame.reindex:
mux = pd.MultiIndex.from_product([df1.columns, df2.columns])
df = df1.reindex(mux, level=0, axis=1) + df2.reindex(mux, level=1, axis=1)
df.columns = range(len(df.columns))
IIUC you can do this with numpy.
>>> import numpy as np
>>> n = df1.shape[1]
>>> pd.DataFrame(df1.values.repeat(n) + np.tile(df2.values, n))
0 1 2 3
0 11 12 21 22
I have a dataframe with [Year] & [Week] columns sometimes missing. I have another dataframe that is a calendar for reference from which I can get these missing values. How to fill these missing columns using pandas?
I have tried using reindex to set them up, but I am getting the following error
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7,8,9,10]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
df1 = df1.reindex(df2, fill_value=0)
print(df1)
You should adding index so df2.index
df1.reindex(df2.index,fill_value=0)
Out[851]:
Value
Year Week
2019 1 20
2 40
3 0
4 60
5 0
6 75
7 90
df2.index.difference(df1.index)
Out[854]:
MultiIndex(levels=[[2019], [3, 5]],
labels=[[0, 0], [0, 1]],
names=['Year', 'Week'],
sortorder=0)
Update
s=df1.reindex(df2.index)
s[s.bfill().notnull().values].fillna(0)
Out[877]:
Value
Year Week
2019 1 20.0
2 40.0
3 0.0
4 60.0
5 0.0
6 75.0
7 90.0
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
fill_value = df1['Value'].mean() #value to fill `NaN` rows with - can choose another logic if you do not want the mean
df1 = df1.join(df2, how='right')
df1.fillna(value=fill_value,axis=1) # Fill missing data here
print(df1)
I have the a df,
date amount code id
2018-01-01 50 12 1
2018-02-03 100 12 1
2017-12-30 1 13 2
2017-11-30 2 14 2
I want to groupby id, while in each group the date is also sorted in ascending or descending order, so I can do the following,
grouped = df.groupby('id')
a = np.where(grouped['code'].transform('nunique') == 1, 20, 0)
b = np.where(grouped['amount'].transform('max') > 100, 20, 0)
c = np.where(grouped['date'].transform(lambda x: x.diff().dropna().sum()).dt.days < 5, 30, 0)
You can sort the data within each group by using apply and sort_values:
grouped = df.groupby('id').apply(lambda g: g.sort_values('date', ascending=True))
Adding to the previous answer, if you wish indexes to remain as they were, you might consider the following :
import pandas as pd
df = {'a':[1,2,3,0,5], 'b':[2,2,3,2,5], 'c':[22,11,11,42,12]}
df = pd.DataFrame(df)
e = (df.groupby(['c','b', 'a']).size()).reset_index()
e = e[['a', 'b', 'c']]
e = e.sort_values(['c','a'])
print(e)
I have 2 dataframes - players (only has playerid) and dates (only has date). I want new dataframe which will contain for each player each date. In my case, players df contains about 2600 rows and date df has 1100 rows. I used 2 for loops to do this, but it is really slow, is there a way to do it faster via some function? thx
my loop:
player_elo = pd.DataFrame(columns = ['PlayerID','Date'])
for row in players.itertuples():
idx = row.Index
pl = players.at[idx,'PlayerID']
for i in dates.itertuples():
idd = row.Index
dt = dates.at[idd, 0]
new = {'PlayerID': [pl], 'Date': [dt]}
new = pd.DataFrame(new)
player_elo = player_elo.append(new)
If you have a key that is repeated for each df, you can come up with the cartesian product you are looking for using pd.merge().
import pandas as pd
players = pd.DataFrame([['A'], ['B'], ['C']], columns=['PlayerID'])
dates = pd.DataFrame([['12/12/2012'],['12/13/2012'],['12/14/2012']], columns=['Date'])
dates['Date'] = pd.to_datetime(dates['Date'])
players['key'] = 1
dates['key'] = 1
print(pd.merge(players, dates,on='key')[['PlayerID', 'Date']])
Output
PlayerID Date
0 A 2012-12-12
1 A 2012-12-13
2 A 2012-12-14
3 B 2012-12-12
4 B 2012-12-13
5 B 2012-12-14
6 C 2012-12-12
7 C 2012-12-13
8 C 2012-12-14