I am having trouble making the transition from using R data.table to using Pandas for data munging.
Specifically, I am trying to assign the results of aggregations back into the original df as a new column. Note that the aggregations are functions of two columns, so I don't think df.transform() is the right approach.
To illustrate, I'm trying to replicate what I would do in R with:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
Using Pandas, I can create an object from df that contains all rows of the original df, with the aggregations
import io
import numpy as np
import pandas as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
B mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
A mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
This doesn't really get me what I want because a) it re-orders and creates indices, and b) it has calculated the aggregation for all rows.
I can get the subset easily enough with
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
but I can't get the values in the new columns (vwap_ish, avg_id) back into the old df.
I've tried:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
And also a few variations of .merge and .join. For example:
df.merge(df2, how='left')
ValueError: 'term' is both an index level and a column label, which is ambiguous.
and
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
In writing this I realised I could take my first approach and then just do
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, and if the aggregation function were to break say if term='mth' then that would be problematic too.
I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to Pandas and there's so much I would do in a one-liner that is taking me hours to figure out.
You can add group_keys=False parameter for remove MultiIndex, so left join working well:
df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
Solution without left join:
m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
.apply(lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
))
Improved solution with named aggregation and create vwap_ish column before groupby can improve performance:
df2 = (df[df.term == 'qtr']
.assign(vwap_ish = lambda x: x.hours * x.price)
.groupby(['node','term'], as_index=False)
.agg(vwap_ish=('vwap_ish','sum'),
avg_id=('id','mean')))
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
One option is to break it into individual steps, if you are willing to avoid using apply (if you are keen on performance):
Compute the product of hours and price before grouping:
temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)
Get the groupby object after filtering term:
temp = (temp
.loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
.groupby([df.node, df.term])
)
Assign back the aggregated values with transform; pandas will take care of the index alignment:
(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'),
avg_id = temp.avg_id.transform('mean'))
)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
This is just an aside, and you can totally ignore it - pydatatable attempts to mimic R's datatable as much as it can. This is one solution with pydatatable:
from datatable import dt, f, by, ifelse, update
DT = dt.Frame(df)
query = f.term == 'qtr'
agg = {'vwap_ish': ifelse(query, (f.hours * f.price), np.nan).sum(),
'avg_id' : ifelse(query, f.id.mean(), np.nan).sum()}
# update is a near equivalent to `:=`
DT[:, update(**agg), by('node', 'term')]
DT
| id term node hours price vwap_ish avg_id
| int64 str32 str32 int64 int64 float64 float64
-- + ----- ----- ----- ----- ----- -------- -------
0 | 1 qtr A 300 107 90600 6
1 | 2 qtr A 300 104 90600 6
2 | 3 qtr A 300 91 90600 6
3 | 4 qtr B 300 89 95400 15
4 | 5 qtr B 300 113 95400 15
5 | 6 qtr B 300 116 95400 15
6 | 7 mth A 50 110 NA NA
7 | 8 mth A 100 119 NA NA
8 | 9 mth A 150 99 NA NA
9 | 10 mth B 50 111 NA NA
10 | 11 mth B 100 106 NA NA
11 | 12 mth B 150 108 NA NA
[12 rows x 7 columns]
I have a DataFrame and I want to calculate the mean and the variance for each row for each person. Moreover, there is a column date and the chronological order must be respect when calculating the mean and the variance; the dataframe is already sorted by date. The date are just the number of day after the earliest date. The mean for the earliest date of a person row is simply the value in the column Points and the variance should be NAN or 0. Then, for the second date, the mean should be the means of the value in the column Points for this date and the previous one. Here is my code to generate the dataframe:
import pandas as pd
import numpy as np
data=[["Al",0, 12],["Bob",2, 10],["Carl",5, 12],["Al",5, 5],["Bob",9, 2]
,["Al",22, 4],["Bob",22, 16],["Carl",33, 2],["Al",45, 7],["Bob",68, 4]
,["Al",72, 11],["Bob",79, 5]]
df= pd.DataFrame(data, columns=["Name", "Date", "Points"])
print(df)
Name Date Points
0 Al 0 12
1 Bob 2 10
2 Carl 5 12
3 Al 5 5
4 Bob 9 2
5 Al 22 4
6 Bob 22 16
7 Carl 33 2
8 Al 45 7
9 Bob 68 4
10 Al 72 11
11 Bob 79 5
Here is my code to obtain the mean and the variance:
df['Mean'] = df.apply(
lambda x: df[(df.Name == x.Name) & (df.Date < x.Date)].Points.mean(),
axis=1)
df['Variance'] = df.apply(
lambda x: df[(df.Name == x.Name)& (df.Date < x.Date)].Points.var(),
axis=1)
However, the mean is shifted by one row and the variance by two rows. The dataframe obtained when sort by Nameand Dateis:
Name Date Points Mean Variance
0 Al 0 12 NaN NaN
3 Al 5 5 12.000000 NaN
5 Al 22 4 8.50000 24.500000
8 Al 45 7 7.000000 19.000000
10 Al 72 11 7.000000 12.666667
1 Bob 2 10 NaN NaN
4 Bob 9 2 10.000000 NaN
6 Bob 22 16 6.000000 32.000000
9 Bob 68 4 9.333333 49.333333
11 Bob 79 5 8.000000 40.000000
2 Carl 5 12 NaN NaN
7 Carl 33 2 12.000000 NaN
Instead, the dataframe should be as below:
Name Date Points Mean Variance
0 Al 0 12 12 NaN
3 Al 5 5 8.5 24.5
5 Al 22 4 7 19
8 Al 45 7 7 12.67
10 Al 72 11 7.8 ...
1 Bob 2 10 10 NaN
4 Bob 9 2 6 32
6 Bob 22 16 9.33 49.33
9 Bob 68 4 8 40
11 Bob 79 5 7.4 ...
2 Carl 5 12 12 NaN
7 Carl 33 2 7 50
What should I change ?
I am having a data frame like this I have to get missing Quarterly value and count between them
Same with Quarterly Missing count and fill the data frame is
year Data Id
2019Q4 57170 A
2019Q3 55150 A
2019Q2 51109 A
2019Q1 51109 A
2018Q1 57170 B
2018Q4 55150 B
2017Q4 51109 C
2017Q2 51109 C
2017Q1 51109 C
Id Start year end-year count
B 2018Q2 2018Q3 2
B 2017Q3 2018Q3 1
How can I achieve this using python panda
Use:
#changed data for more general solution - multiple missing years per groups
print (df)
year Data Id
0 2015 57170 A
1 2016 55150 A
2 2019 51109 A
3 2023 51109 A
4 2000 47740 B
5 2002 44563 B
6 2003 43643 C
7 2004 42050 C
8 2007 37312 C
#add missing values for no years by reindex
df1 = (df.set_index('year')
.groupby('Id')['Id']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)))
.reset_index(name='val'))
print (df1)
Id year val
0 A 2015 A
1 A 2016 A
2 A 2017 NaN
3 A 2018 NaN
4 A 2019 A
5 A 2020 NaN
6 A 2021 NaN
7 A 2022 NaN
8 A 2023 A
9 B 2000 B
10 B 2001 NaN
11 B 2002 B
12 C 2003 C
13 C 2004 C
14 C 2005 NaN
15 C 2006 NaN
16 C 2007 C
#boolean mask for check no NaNs to variable for reuse
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2017 2018 2
1 A 2020 2022 3
2 B 2001 2001 1
3 C 2005 2006 2
EDIT:
#convert to datetimes
df['year'] = pd.to_datetime(df['year'], format='%Y%m')
#resample by start of months with asfreq
df1 = df.set_index('year').groupby('Id')['Id'].resample('MS').asfreq().rename('val').reset_index()
print (df1)
Id year val
0 A 2015-05-01 A
1 A 2015-06-01 NaN
2 A 2015-07-01 A
3 A 2015-08-01 NaN
4 A 2015-09-01 A
5 B 2000-01-01 B
6 B 2000-02-01 NaN
7 B 2000-03-01 B
8 C 2003-01-01 C
9 C 2003-02-01 C
10 C 2003-03-01 NaN
11 C 2003-04-01 NaN
12 C 2003-05-01 C
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2015-06-01 2015-06-01 1
1 A 2015-08-01 2015-08-01 1
2 B 2000-02-01 2000-02-01 1
3 C 2003-03-01 2003-04-01 2
I currently have the following sample dataframe:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1112 6/4/2017 BBB O
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I want to fill in missing dates for two rows right on top of each other. I want to also fill in the values of the non-date columns with the values in the top row BUT leave 'Type' column blank for filled in rows.
Please see desired output:
No FlNo DATE Loc Type
20 1826 6/1/2017 AAA O
20 1826 6/2/2017 AAA
20 1826 6/3/2017 AAA
20 1112 6/4/2017 BBB O
20 1112 6/5/2017 BBB
20 1234 6/6/2017 CCC O
20 43 6/7/2017 DDD O
20 1840 6/8/2017 EEE O
I have searched all around Google and stackoverflow but did not find any date fill in answers for pandas dataframe.
First, convert DATE to a datetime column using pd.to_datetime,
df.DATE = pd.to_datetime(df.DATE)
Option 1
Use resample + ffill, and then reset the Type column later. First, store the unique dates in some list:
dates = df.DATE.unique()
Now,
df = df.set_index('DATE').resample('1D').ffill().reset_index()
df.Type = df.Type.where(df.DATE.isin(dates), '')
df
DATE No FlNo Loc Type
0 2017-06-01 20 1826 AAA O
1 2017-06-02 20 1826 AAA
2 2017-06-03 20 1826 AAA
3 2017-06-04 20 1112 BBB O
4 2017-06-05 20 1112 BBB
5 2017-06-06 20 1234 CCC O
6 2017-06-07 20 43 DDD O
7 2017-06-08 20 1840 EEE O
If needed, you may bring DATE back to its original state;
df.DATE = df.DATE.dt.strftime('%m/%d/%Y')
Option 2
Another option would be asfreq + ffill + fillna:
df = df.set_index('DATE').asfreq('1D').reset_index()
c = df.columns.difference(['Type'])
df[c] = df[c].ffill()
df['Type'] = df['Type'].fillna('')
df
DATE No FlNo Loc Type
0 2017-06-01 20.0 1826.0 AAA O
1 2017-06-02 20.0 1826.0 AAA
2 2017-06-03 20.0 1826.0 AAA
3 2017-06-04 20.0 1112.0 BBB O
4 2017-06-05 20.0 1112.0 BBB
5 2017-06-06 20.0 1234.0 CCC O
6 2017-06-07 20.0 43.0 DDD O
7 2017-06-08 20.0 1840.0 EEE O
I have 2 dataframes I need to divide the value from 2 data frame which divide contain string and float the division should avoid the string and only do the division on float.
DF1
Col1 Val11 Val12
0 A 1 9
1 B 3 1
2 C 5 4
3 D 1 3
4 E 7 6
DF2
Col2 Val21 Val22
0 A 20 19
1 B 35 11
2 C 46 42
3 D 31 53
4 E 28 55
I wrote the below line of code
df2.iloc['Percent'] = df1.iloc[4]/df2.iloc[4]
But I get the below error message.
TypeError: unsupported operand type(s) for /: 'str' and 'str'
Final DF should look like this
Col2 Val21 Val22
0 A 20 19
1 B 35 11
2 C 46 42
3 D 31 53
4 E 28 55
0.25 0.10
Thanks and Advance for the support
You need get all string columns to index by set_index and then divide:
df2 = df2.set_index('Col2')
df2.loc['Percent'] = df1.set_index('Col1').iloc[4].values / df2.iloc[4]
print (df2)
Val21 Val22
Col2
A 20.00 19.000000
B 35.00 11.000000
C 46.00 42.000000
D 31.00 53.000000
E 28.00 55.000000
Percent 0.25 0.109091
If there is multiple string columns use subsets of columns for divide and also add subset to output:
df2.loc['Percent'] = df1[['Val11','Val12']].iloc[4].values / df2[['Val21','Val22']].iloc[4]
print (df2)
Col2 Val21 Val22
0 A 20.00 19.000000
1 B 35.00 11.000000
2 C 46.00 42.000000
3 D 31.00 53.000000
4 E 28.00 55.000000
Percent NaN 0.25 0.109091
More generic solution:
str_cols1 = ['Col1']
str_cols2 = ['Col2']
df2.loc['Percent'] = df1.drop(str_cols1, axis=1).iloc[4].values /
df2.drop(str_cols2, axis=1).iloc[4]
print (df2)
Col2 Val21 Val22
0 A 20.00 19.000000
1 B 35.00 11.000000
2 C 46.00 42.000000
3 D 31.00 53.000000
4 E 28.00 55.000000
Percent NaN 0.25 0.109091
And better solution with select_dtypes:
df2.loc['Percent'] = df1.select_dtypes(['number']).iloc[4].values /
df2.select_dtypes(['number']).iloc[4]
print (df2)
Col2 Val21 Val22
0 A 20.00 19.000000
1 B 35.00 11.000000
2 C 46.00 42.000000
3 D 31.00 53.000000
4 E 28.00 55.000000
Percent NaN 0.25 0.109091
EDIT by comment:
Use to_numeric for replace non numeric values to NaN:
df1_numeric = df1.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2_numeric = df2.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2.loc['Percent'] = df1_numeric.iloc[4].values / df2_numeric.iloc[4]
print (df2)
Col2 Val21 Val22
0 A 20.00 19
1 B 35.00 a
2 C 46.00 42
3 D 31.00 53
4 E 28.00 55
Percent NaN 0.25 0.109091
Try this out:
df2.loc['Percent'] = df1.iloc[4, 1:] / df2.iloc[4, 1:]