Rolling temporal window on a pandas dataframe by group - python

Consider this example dataframe (code for construction below):
t p
o
2007-01-01 0.0 1.0
2007-01-02 0.0 1.0
2007-01-03 0.0 1.0
2007-01-10 0.0 1.0
2007-01-11 0.0 1.0
2007-01-20 1.0 0.0
2007-01-21 1.0 0.0
2007-01-22 1.0 0.0
2007-01-23 1.0 0.0
2007-01-27 1.0 0.0
I would like a rolling sum over a 2 day forward-looking window, for each 'group' in t. To do this I implemented:
df.iloc[::-1].groupby('t').rolling(window='2D').sum()
However, this returns:
t p
t o
0.0 2007-01-11 0.0 1.0
2007-01-10 0.0 2.0
2007-01-03 0.0 3.0
2007-01-02 0.0 4.0
2007-01-01 0.0 5.0
1.0 2007-01-27 1.0 0.0
2007-01-23 2.0 0.0
2007-01-22 3.0 0.0
2007-01-21 4.0 0.0
2007-01-20 5.0 0.0
which is not a two day rolling window sum. I believe the issue is when I groupby t I lose the temporal information ('o') as it is set as the dataframes index.
Resampling the rows to constant 1 day intervals per group will not work due to the size of my dataframe. I have tried grouping by 't' then 'o' but this does not work.
The solution I would like is:
t p
o
2007-01-01 0.0 2.0
2007-01-02 0.0 1.0
2007-01-03 0.0 0.0
2007-01-10 0.0 1.0
2007-01-11 0.0 0.0
2007-01-20 2.0 0.0
2007-01-21 2.0 0.0
2007-01-22 1.0 0.0
2007-01-23 0.0 0.0
2007-01-27 0.0 0.0
Supplementary code:
# code to construct df used in this example
o = ['2007-01-01','2007-01-02','2007-01-03','2007-01-10','2007-01-11',
'2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-27']
t = np.zeros(10)
p = np.ones(10)
p[5:] = 0
t[5:] = 1
df = pd.DataFrame({'o':o, 't':t, 'p':p})
df['o'] = pd.to_datetime(df['o'], format='%Y-%m-%d')
df = df.set_index('o')

As a work around (for two days):
def day_shift(x, days=2):
ret = pd.DataFrame(0, index=x.index, columns=x.columns)
for day in range(-days, 0):
ret = ret.add(x.shift(day, freq='D'), fill_value=0)
return ret.reindex(x.index)
df.groupby('t', as_index=False).apply(day_shift, days=2)
Output:
t p
o
2007-01-01 0.0 2.0
2007-01-02 0.0 1.0
2007-01-03 0.0 0.0
2007-01-10 0.0 1.0
2007-01-11 0.0 0.0
2007-01-20 2.0 0.0
2007-01-21 2.0 0.0
2007-01-22 1.0 0.0
2007-01-23 0.0 0.0
2007-01-27 0.0 0.0
Edit: Another way to exploit the rolling date is to reverse the date index, then we can use backward rolling, which is actually forward rolling in terms of the original dates:
future_date = pd.to_datetime('2100-01-01')
ancient_date = pd.to_datetime('2000-01-01')
# instead of setting `'o'` as index, let set ['o','t'] as index
df = df.set_index(['o','t'])
# here comes the crazy code
(df
.assign(r_dates = (future_date - df.index.get_level_values('o')) + ancient_date) # reverse date
.sort_values('r_dates')
.groupby('t')
.rolling('2D', on='r_dates').sum() # change 2 to the actual number of days
.reset_index(level=0, drop=True) # remove the index caused by groupby
.assign(r_dates = lambda x: (x.index.get_level_values('o') - pd.to_timedelta('1D')), # shifted the date by one, since rolling includes the current date
)
.reset_index()
.drop('o', axis=1)
.set_index(['r_dates','t'])
.reindex(df.index, fill_value=0)
)
Output:
p
o t
2007-01-01 0.0 2.0
2007-01-02 0.0 1.0
2007-01-03 0.0 0.0
2007-01-10 0.0 1.0
2007-01-11 0.0 0.0
2007-01-01 1.0 0.0
2007-01-02 1.0 0.0
2007-01-03 1.0 0.0
2007-01-10 1.0 0.0
2007-01-11 1.0 0.0

Related

Add columns in dataframe together if colnames are in list

I am trying to add together power plant hourly output data at different locations.
I have a series of the generators at each location
genLocations = pd.Series
MDN SL1
HEN WF34, SL2
OTA WF26, SL3
HLY WF16, WF27, SL4
i.e. locations are on the left and generators on the right.
I then need to add together the columns of another dataframe which contains the hourly output of different generators. I need to sum each column of generators to a single location.
gen = pd.Dataframe
WF1 WF2 WF3 WF4 WF5 ... SL15 SL16 SL17 SL18 SL19
2007_1_1_p1 9.0 0.0 6.0 8.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2007_1_1_p2 8.0 0.0 7.0 8.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2007_1_1_p3 0.0 8.0 7.0 8.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2007_1_1_p4 4.0 0.0 6.0 8.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2007_1_1_p5 0.0 0.0 7.0 8.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
My final output should look something like this
nodes = pd.Dataframe
MDN HEN OTA HLY ....
2007_1_1_p1 7.0 5.0 4.0 6.0 ....
2007_1_1_p2 0.0 0.0 7.0 8.0 ....
So far I have tried
for index, i in genLocations.iteritems():
nodes[index] = gen[[i]].sum(axis='columns')
You can try splitting genLocations and explode:
s = genLocations.str.split(', ').explode()
d = {v:k for k,v in s.iteritems()}
nodes.groupby(nodes.columns.map(d), axis=1).sum()
Note: explode is available in Pandas 0.25+.

Slice multi-index pandas dataframe by date

Say I have the following multi-index dataframe:
arrays = [np.array(['bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo', 'foo']),
pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'])]
df = pd.DataFrame(np.zeros((8, 4)), index=arrays)
0 1 2 3
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
How do I select only the part of this dataframe where the first index level = 'bar', and date > 2020.01.02, such that I can add 1 to this part?
To be clearer, the expected output would be:
0 1 2 3
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 1.0 1.0 1.0 1.0
2020-01-04 1.0 1.0 1.0 1.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
I managed slicing it according to the first index:
df.loc['bar']
But then I am not able to apply the condition on the date.
Here is possible compare each level and then set 1, there is : for all columns in DataFrame.loc:
m1 = df.index.get_level_values(0) =='bar'
m2 = df.index.get_level_values(1) > '2020-01-02'
df.loc[m1 & m2, :] = 1
print (df)
0 1 2 3
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 1.0 1.0 1.0 1.0
2020-01-04 1.0 1.0 1.0 1.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0
#give ur index names :
df.index = df.index.set_names(["names","dates"])
#get the indices that match ur condition
index = df.query('names=="bar" and dates>"2020-01-02"').index
#assign 1 to the relevant points
#IndexSlice makes slicing multiindexes easier ... here though, it might be seen as overkill
idx = pd.IndexSlice
df.loc[idx[index],:] = 1
0 1 2 3
names dates
bar 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 1.0 1.0 1.0 1.0
2020-01-04 1.0 1.0 1.0 1.0
foo 2020-01-01 0.0 0.0 0.0 0.0
2020-01-02 0.0 0.0 0.0 0.0
2020-01-03 0.0 0.0 0.0 0.0
2020-01-04 0.0 0.0 0.0 0.0

Pandas: extract values from column, according to value of another column, and separate into separate dataframes

I have a dataframe indexed by date, with columns of flood size (0-3), and precipitation (ppt):
Size ppt
date
2017-09-11 0.0 0.000000
2017-09-12 0.0 0.000000
2017-09-13 0.0 0.000000
2017-09-14 1.0 34.709998
2017-09-15 0.0 0.000000
2017-09-16 0.0 0.000000
2017-09-17 0.0 0.000000
2017-09-18 0.0 0.600000
2017-09-19 3.0 157.439998
I need to separate the data according to whether a flood occurred ('Size'=1,2 or 3), or no flood occurred ('Size'=0), to give me two separate sets of precipitation data associated with flood or no flood.
I appreciate this is probably quite basic, but I can't seem to find the right answers...
Thanks!
Use boolean indexing with inverting boolean mask by ~:
mask = df['Size'].eq(0)
#alternative
#mask = df['Size'] == 0
df1 = df[~mask]
df2 = df[mask]
EDIT:
For multiple boolean mask use:
m1 = df['Size'].eq(0)
m2 = df['ppt'].eq(0)
#alternative
#m1 = df['Size'] == 0
#m2 = df['ppt'] == 0
SizePos = df[m1 & m2]
dSizeZero_PptPosf2 = df[m1 & ~m2]
SizeZero_PptZero = df[~m1]
print (SizePos)
Size ppt
date
2017-09-11 0.0 0.0
2017-09-12 0.0 0.0
2017-09-13 0.0 0.0
2017-09-15 0.0 0.0
2017-09-16 0.0 0.0
2017-09-17 0.0 0.0
print (dSizeZero_PptPosf2)
Size ppt
date
2017-09-18 0.0 0.6
print (SizeZero_PptZero)
date
2017-09-14 1.0 34.709998
2017-09-19 3.0 157.439998
groupby
We can iterate through the groupby object after grouping by the boolean evaluation of Size being 0 or not. When we assign this to other names (df1, df2 = ...) the resulting iterable is split into its two parts.
df1, df2 = (d for _, d in df.groupby(df.Size.eq(0)))
Print them to see
print(df1, df2, sep='\n\n')
Size ppt
date
2017-09-14 1.0 34.709998
2017-09-19 3.0 157.439998
Size ppt
date
2017-09-11 0.0 0.0
2017-09-12 0.0 0.0
2017-09-13 0.0 0.0
2017-09-15 0.0 0.0
2017-09-16 0.0 0.0
2017-09-17 0.0 0.0
2017-09-18 0.0 0.6
For the purposes of explanation
for name, d in df.groupby(df.Size.eq(0)):
print(name, d, '=' * 40, sep='\n\n')
False
Size ppt
date
2017-09-14 1.0 34.709998
2017-09-19 3.0 157.439998
========================================
True
Size ppt
date
2017-09-11 0.0 0.0
2017-09-12 0.0 0.0
2017-09-13 0.0 0.0
2017-09-15 0.0 0.0
2017-09-16 0.0 0.0
2017-09-17 0.0 0.0
2017-09-18 0.0 0.6
========================================
You can create a dictionary of dataframes:
dfs = dict(tuple(df.groupby(np.where(df['Size'].eq(0), 'ppt_negative', 'ppt_positive'))))
The benefit of this approach is you are explicitly linking related data structures, which may aid subsequent manipulations, transportability, etc.
Result:
{'ppt_negative': date Size ppt
0 2017-09-11 0.0 0.0
1 2017-09-12 0.0 0.0
2 2017-09-13 0.0 0.0
4 2017-09-15 0.0 0.0
5 2017-09-16 0.0 0.0
6 2017-09-17 0.0 0.0
7 2017-09-18 0.0 0.6,
'ppt_positive': date Size ppt
3 2017-09-14 1.0 34.709998
8 2017-09-19 3.0 157.439998}
More elaborate differentiation is possible via np.select:
m1 = df['Size'].eq(0)
m2 = df['ppt'].eq(0)
conds = [m1 & m2, m1 & ~m2, ~m1]
choices = ['SizeZero_PptZero', 'SizeZero_PptPos', 'SizePos']
dfs = dict(tuple(df.groupby(np.select(conds, choices))))
Result:
{'SizePos': date Size ppt
3 2017-09-14 1.0 34.709998
8 2017-09-19 3.0 157.439998,
'SizeZero_PptPos': date Size ppt
7 2017-09-18 0.0 0.6,
'SizeZero_PptZero': date Size ppt
0 2017-09-11 0.0 0.0
1 2017-09-12 0.0 0.0
2 2017-09-13 0.0 0.0
4 2017-09-15 0.0 0.0
5 2017-09-16 0.0 0.0
6 2017-09-17 0.0 0.0}

Finding the difference between rows of columns using shift

I've been coming here for almost two years now and have always been able to figure things out but I'm stumped now. Hopefully this is a quick answer.
https://github.com/MPhillips55/Capstone-Project-2---League-of-Legends/blob/master/EDA/test_case.csv
The link there is what my data looks like. 'min_0', 'min_1' and so on are gold values for League of Legends games at 1 minute intervals, that continue on to 'min_80'. The csv should be available to download.
I want to subtract the red values from the blue values and store that number on the blue rows for each minute.
Then I want to subtract the blue values from the red values and store that number on the red rows for each minute.
For clarity, I am only interested in the comparison for matching 'match_id's.
Here is an image of my desired output:
Desired Output
I think the right answer is likely something like this:
gold_df.loc[gold_df['red_or_blue_side'] == 'blue', :] = \
BLUE_VALUES - BLUE_VALUES.shifted_down
gold_df.loc[gold_df['red_or_blue_side'] == 'red', :] = \
RED_VALUES - RED_VALUES.shifted_up
I'm not clear on two things with that code. I need to select all the columns except the first two to calculate the differences. I also don't know how to select the values and the shifted values across all the relevant columns.
Thank you for the help. Please let me know if more information is needed.
-Mike
You could groupby match_id and then find the difference in each direction using .diff and then add the two components.
g = df.groupby('match_id', sort=False)[df.columns[2:]]
df = g.diff().fillna(0) + g.diff(-1).fillna(0)
df
min_0 min_1 min_2 min_3 min_4 min_5 min_6 min_7 min_8 min_9 \
0 0.0 15.0 46.0 -133.0 -60.0 -904.0 -505.0 -852.0 -763.0 -1224.0
1 0.0 -15.0 -46.0 133.0 60.0 904.0 505.0 852.0 763.0 1224.0
2 0.0 0.0 0.0 89.0 -92.0 -174.0 191.0 69.0 253.0 362.0
3 0.0 0.0 0.0 -89.0 92.0 174.0 -191.0 -69.0 -253.0 -362.0
4 0.0 0.0 17.0 -106.0 -136.0 400.0 363.0 829.0 1532.0 1862.0
5 0.0 0.0 -17.0 106.0 136.0 -400.0 -363.0 -829.0 -1532.0 -1862.0
... min_71 min_72 min_73 min_74 min_75 min_76 min_77 min_78 \
0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
min_79 min_80
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
5 0.0 0.0
To select all columns except the first two:
df[df.columns[2:]]
To select all columns except the first two:
df.iloc[:,2:]

"Cannot reindex from a duplicate axis" when groupby.apply() on MultiIndex columns

I'm playing around with computing subtotals within a DataFrame that looks like this (note the MultiIndex):
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
I can successfully add the subtotals with the following code:
(
df
.groupby(level=0)
.apply(
lambda df: pd.concat(
[df.xs(df.name), df.sum().to_frame('Total').T]
)
)
)
And it looks like this:
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
However, when I work with the transposed DataFrame, it does not work. The DataFrame looks like:
A B
1 2 1 2
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
And I use the following code:
(
df2
.groupby(level=0, axis=1)
.apply(
lambda df: pd.concat(
[df.xs(df.name, axis=1), df.sum(axis=1).to_frame('Total')],
axis=1
)
)
)
I have specified axis=1 everywhere I can think of, but I get an error:
ValueError: cannot reindex from a duplicate axis
I would expect the output to be:
A B
1 2 Total 1 2 Total
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0
Is this a bug? Or have I not specified the axis correctly everywhere? As a workaround, I can obviously transpose the DataFrame, produce the totals, and transpose back, but I'd like to know why it's not working here, and submit a bug report if necessary.
The problem DataFrame can be generated with:
df2 = pd.DataFrame(
np.zeros([6, 4]),
columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
)

Categories