Using shift and rolling in pandas with groupBy - python

df = pd.DataFrame(dict(
list(
zip(["A", "B", "C"],
[np.array(["id %02d" % i for i in range(1, 11)]).repeat(10),
pd.date_range("2018-01-01", periods=100).strftime("%Y-%m-%d"),
[i for i in range(10, 110)]])
)
))
df = df.groupby(["A", "B"]).sum()
df["D"] = df["C"].shift(1).rolling(2).mean()
df
This code generates the following:
I want the rolling logic to start over for every new ID. Right now, ID 02 is using the last two values from ID 01 to calculate the mean.
How can this be achieved?

I believe you need groupby:
df['D'] = df["C"].shift(1).groupby(df['A'], group_keys=False).rolling(2).mean()
print (df.head(20))
C D
A B
id 01 2018-01-01 10 NaN
2018-01-02 11 NaN
2018-01-03 12 10.5
2018-01-04 13 11.5
2018-01-05 14 12.5
2018-01-06 15 13.5
2018-01-07 16 14.5
2018-01-08 17 15.5
2018-01-09 18 16.5
2018-01-10 19 17.5
id 02 2018-01-11 20 NaN
2018-01-12 21 19.5
2018-01-13 22 20.5
2018-01-14 23 21.5
2018-01-15 24 22.5
2018-01-16 25 23.5
2018-01-17 26 24.5
2018-01-18 27 25.5
2018-01-19 28 26.5
2018-01-20 29 27.5
Or:
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
print (df.head(20))
C D
A B
id 01 2018-01-01 10 NaN
2018-01-02 11 NaN
2018-01-03 12 10.5
2018-01-04 13 11.5
2018-01-05 14 12.5
2018-01-06 15 13.5
2018-01-07 16 14.5
2018-01-08 17 15.5
2018-01-09 18 16.5
2018-01-10 19 17.5
id 02 2018-01-11 20 NaN
2018-01-12 21 NaN
2018-01-13 22 20.5
2018-01-14 23 21.5
2018-01-15 24 22.5
2018-01-16 25 23.5
2018-01-17 26 24.5
2018-01-18 27 25.5
2018-01-19 28 26.5
2018-01-20 29 27.5

While the accepted answer by #jezrael works correctly for positive shifts, it gives incorrect result (partially) for negative shifts. Please check the following
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
df['E'] = df["C"].groupby(df['A']).rolling(2).mean().shift(1).values
df['F'] = df["C"].groupby(df['A']).shift(-1).rolling(2).mean()
df['G'] = df["C"].groupby(df['A']).rolling(2).mean().shift(-1).values
df.set_index(['A', 'B'], inplace=True)
print(df.head(20))
C D E F G
A B
id 01 2018-01-01 10 NaN NaN NaN 10.5
2018-01-02 11 NaN NaN 11.5 11.5
2018-01-03 12 10.5 10.5 12.5 12.5
2018-01-04 13 11.5 11.5 13.5 13.5
2018-01-05 14 12.5 12.5 14.5 14.5
2018-01-06 15 13.5 13.5 15.5 15.5
2018-01-07 16 14.5 14.5 16.5 16.5
2018-01-08 17 15.5 15.5 17.5 17.5
2018-01-09 18 16.5 16.5 18.5 18.5
2018-01-10 19 17.5 17.5 NaN NaN
id 02 2018-01-11 20 NaN 18.5 NaN 20.5
2018-01-12 21 NaN NaN 21.5 21.5
2018-01-13 22 20.5 20.5 22.5 22.5
2018-01-14 23 21.5 21.5 23.5 23.5
2018-01-15 24 22.5 22.5 24.5 24.5
2018-01-16 25 23.5 23.5 25.5 25.5
2018-01-17 26 24.5 24.5 26.5 26.5
2018-01-18 27 25.5 25.5 27.5 27.5
2018-01-19 28 26.5 26.5 28.5 28.5
2018-01-20 29 27.5 27.5 NaN NaN
Note that columns D and E are computed for .shift(1) and columns F and G are computed for .shift(-1). Column E is incorrect, since the first value of id 02 uses last two values of id 01. Column F is incorrect since first values are NaNs for both id 01 and id 02. Columns D and G give correct results. So, the full answer should be like this. If shift period is non-negative, use the following
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
If shift period is negative, use the following
df['G'] = df["C"].groupby(df['A']).rolling(2).mean().shift(-1).values
Hope it helps!

Related

How to sort multiindex column month names?

I have this multiindex df:
YEARS_TMAX TMAX YEARS_TMAX TMAX YEARS_TMAX
MONTH April April August August December .....
CODE NAME
000130 RICA PLAYA 21.0 31.5 21.0 21.5 22.0
000132 PUERTO PIZARRO 12.0 33.8 12.0 32.4 11.0
000134 PAPAYAL 23.0 33.2 22.0 22.4 21.0
000135 EL SALTO 22.0 33.6 23.0 22.8 22.0
000136 CAÑAVERAL 16.0 32.7 15.0 33.1 11.0
... ... ... ... ...
158317 SUSAPAYA 19.0 17.6 19.0 17.3 21.0
158321 PALCA 16.0 19.3 17.0 19.8 16.0
158323 TALABAYA 12.0 17.6 13.0 17.5 13.0
158326 CAPAZO 17.0 13.6 17.0 13.0 19.0
158328 PAUCARANI 14.0 13.3 13.0 11.9 15.0
I want to sort columns by month name (and TMAX columns first) like this:
TMAX YEARS_TMAX TMAX YEARS_TMAX TMAX
MONTH January January February February March .....
CODE NAME
000130 RICA PLAYA 22.0 31.5 23.0 27.5 23.0
000132 PUERTO PIZARRO 17.0 32.8 18.0 30.4 18.0
000134 PAPAYAL 25.0 32.2 26.0 28.4 25.0
000135 EL SALTO 26.0 31.6 26.0 26.8 26.0
000136 CAÑAVERAL 16.0 32.7 18.0 31.1 15.0
... ... ... ... ...
158317 SUSAPAYA 19.0 17.6 19.0 17.3 21.0
158321 PALCA 16.0 19.3 17.0 19.8 16.0
158323 TALABAYA 12.0 17.6 13.0 17.5 13.0
158326 CAPAZO 17.0 13.6 17.0 13.0 19.0
158328 PAUCARANI 14.0 13.3 13.0 11.9 15.0
So i wrote this code:
source: Sort "Date" in Multi-Index
dates = pd.to_datetime(df.columns.get_level_values(1), format='%B')
df.columns = [df.columns.get_level_values(0), dates]
df = df.sort_index(axis=1, level=1)
To sort columns by month but dates is not creating month names, dates is creating random dates.
How can i solve this?
Thanks in advance.
Use a CategoricalDtype by creating an ordered dtype from calendar.month_name this will ensure the correct ordering by sort.
month_dtype = pd.CategoricalDtype(categories=list(month_name), ordered=True)
df.columns = [df.columns.get_level_values(0),
df.columns.get_level_values(1).astype(month_dtype)]
df = df.sort_index(axis=1, level=[1, 0])
Sample Data and Imports:
from calendar import month_name
import pandas as pd
df = pd.DataFrame(
[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]],
columns=pd.MultiIndex.from_product([
['YEARS_TMAX', 'TMAX'],
['March', 'January', 'February']
])
)
df before sort:
YEARS_TMAX TMAX
March January February March January February
0 1 2 3 4 5 6
1 7 8 9 10 11 12
df after sort:
TMAX YEARS_TMAX TMAX YEARS_TMAX TMAX YEARS_TMAX
January January February February March March
0 5 2 6 3 4 1
1 11 8 12 9 10 7
The datetime approach would also work, but converting back to strings would be necessary with DatetimeIndex.strftime:
df.columns = [df.columns.get_level_values(0),
pd.to_datetime(df.columns.get_level_values(1), format='%B')]
df = df.sort_index(axis=1, level=[1, 0])
# convert back to strings
df.columns = [df.columns.get_level_values(0),
df.columns.get_level_values(1).strftime('%B')]
df:
TMAX YEARS_TMAX TMAX YEARS_TMAX TMAX YEARS_TMAX
January January February February March March
0 5 2 6 3 4 1
1 11 8 12 9 10 7
The drawback of this approach is level 1 is once again a string type which would need to be converted any time ordering needed to be changed as lexicographic ordering is not expected.

Create well structured pandas dataframe using dataframe

I have a Panda DataFreme data from 2018 to 2020. I want to structure these data as follows.
Month | 2018 | 2019
Jan 115 73
Feb 112 63
....
up to December.
How can I solve this issue using panda data frame syntax?
Date
2018-01-01 115.0
2018-02-01 112.0
2018-03-01 104.5
2018-04-01 91.1
2018-05-01 85.5
2018-06-01 76.5
2018-07-01 86.5
2018-08-01 77.9
2018-09-01 65.0
2018-10-01 71.0
2018-11-01 76.0
2018-12-01 72.5
2019-01-01 73.0
2019-02-01 63.0
2019-03-01 63.0
2019-04-01 61.0
2019-05-01 58.3
2019-06-01 59.0
2019-07-01 67.0
2019-08-01 64.0
2019-09-01 59.9
2019-10-01 70.4
2019-11-01 78.9
2019-12-01 75.0
2020-01-01 73.9
Name: Close, dtype: float64
This is more like pivot but with crosstab
s = pd.crosstab(df.index.strftime('%b'),df.index.year,df.values,aggfunc='sum')
Out[87]:
col_0 2018 2019 2020
row_0
Apr 91.1 61.0 NaN
Aug 77.9 64.0 NaN
Dec 72.5 75.0 NaN
Feb 112.0 63.0 NaN
Jan 115.0 73.0 73.9
Jul 86.5 67.0 NaN
Jun 76.5 59.0 NaN
Mar 104.5 63.0 NaN
May 85.5 58.3 NaN
Nov 76.0 78.9 NaN
Oct 71.0 70.4 NaN
Sep 65.0 59.9 NaN
You can use groupby and unstack:
(s.groupby([s.index.month, s.index.year]).first().unstack()
.rename_axis(columns='Year',index='Month')
)
Output:
Year 2018 2019 2020
Month
1 115.0 73.0 73.9
2 112.0 63.0 NaN
3 104.5 63.0 NaN
4 91.1 61.0 NaN
5 85.5 58.3 NaN
6 76.5 59.0 NaN
7 86.5 67.0 NaN
8 77.9 64.0 NaN
9 65.0 59.9 NaN
10 71.0 70.4 NaN
11 76.0 78.9 NaN
12 72.5 75.0 NaN

Pandas/Python: interpolation of multiple columns based on values specified for one reference column

df
Out[1]:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
0 978.0 345 17.0 16.5 97 12.22 0 0 292.0 326.8 294.1
1 977.0 354 17.8 16.7 93 12.39 1 0 292.9 328.3 295.1
2 970.0 416 23.4 15.4 61 11.47 4 2 299.1 332.9 301.2
3 963.0 479 24.0 14.0 54 10.54 8 3 300.4 331.6 302.3
4 948.7 610 23.0 13.4 55 10.28 15 6 300.7 331.2 302.5
5 925.0 830 21.4 12.4 56 9.87 20 5 301.2 330.6 303.0
6 916.0 914 20.7 11.7 56 9.51 20 4 301.3 329.7 303.0
7 884.0 1219 18.2 9.2 56 8.31 60 4 301.8 326.7 303.3
8 853.1 1524 15.7 6.7 55 7.24 35 3 302.2 324.1 303.5
9 850.0 1555 15.4 6.4 55 7.14 20 2 302.3 323.9 303.6
10 822.8 1829 13.3 5.6 60 6.98 300 4 302.9 324.0 304.1
How do I interpolate the values of all the columns on specified PRES (pressure) values at say PRES=[950, 900, 875]? Is there an elegant pandas type of way to do this?
The only way I can think of doing this is to first start with making empty NaN values for the entire row for each specified PRES values in a loop, then set PRES as index and then use the pandas native interpolate option:
df.interpolate(method='index', inplace=True)
Is there a more elegant solution?
Use your solution with no loop - reindex by union original index values with PRES list, but working only if all values are unique:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = df.reindex(df.index.union(PRES)).sort_index(ascending=False).interpolate(method='index')
print (df)
HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
978.0 345.0 17.0 16.5 97.0 12.22 0.0 0.0 292.0 326.8 294.1
977.0 354.0 17.8 16.7 93.0 12.39 1.0 0.0 292.9 328.3 295.1
970.0 416.0 23.4 15.4 61.0 11.47 4.0 2.0 299.1 332.9 301.2
963.0 479.0 24.0 14.0 54.0 10.54 8.0 3.0 300.4 331.6 302.3
950.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
948.7 610.0 23.0 13.4 55.0 10.28 15.0 6.0 300.7 331.2 302.5
925.0 830.0 21.4 12.4 56.0 9.87 20.0 5.0 301.2 330.6 303.0
916.0 914.0 20.7 11.7 56.0 9.51 20.0 4.0 301.3 329.7 303.0
900.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
884.0 1219.0 18.2 9.2 56.0 8.31 60.0 4.0 301.8 326.7 303.3
875.0 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
853.1 1524.0 15.7 6.7 55.0 7.24 35.0 3.0 302.2 324.1 303.5
850.0 1555.0 15.4 6.4 55.0 7.14 20.0 2.0 302.3 323.9 303.6
822.8 1829.0 13.3 5.6 60.0 6.98 300.0 4.0 302.9 324.0 304.1
If possible not unique values in PRES column, then use concat with sort_index:
PRES=[950, 900, 875]
df = df.set_index('PRES')
df = (pd.concat([df, pd.DataFrame(index=PRES)])
.sort_index(ascending=False)
.interpolate(method='index'))

How to assign a values to dataframe's column by comparing values in another dataframe

I have two data frames. One has rows for every five minutes in a day:
df
TIMESTAMP TEMP
1 2011-06-01 00:05:00 24.5
200 2011-06-01 16:40:00 32.0
1000 2011-06-04 11:20:00 30.2
5000 2011-06-18 08:40:00 28.4
10000 2011-07-05 17:20:00 39.4
15000 2011-07-23 02:00:00 29.3
20000 2011-08-09 10:40:00 29.5
30656 2011-09-15 10:40:00 13.8
I have another dataframe that ranks the days
ranked
TEMP DATE RANK
62 43.3 2011-08-02 1.0
63 43.1 2011-08-03 2.0
65 43.1 2011-08-05 3.0
38 43.0 2011-07-09 4.0
66 42.8 2011-08-06 5.0
64 42.5 2011-08-04 6.0
84 42.2 2011-08-24 7.0
56 42.1 2011-07-27 8.0
61 42.1 2011-08-01 9.0
68 42.0 2011-08-08 10.0
Both the columns TIMESTAMP and DATE are datetime datatypes (dtype returns dtype('M8[ns]').
What I want to be able to do is add a column to the dataframe df and then put the rank of the row based on the TIMESTAMP and corresponding day's rank from ranked (so in a day all the 5 minute timesteps will have the same rank).
So, the final result would look something like this:
df
TIMESTAMP TEMP RANK
1 2011-06-01 00:05:00 24.5 98.0
200 2011-06-01 16:40:00 32.0 98.0
1000 2011-06-04 11:20:00 30.2 96.0
5000 2011-06-18 08:40:00 28.4 50.0
10000 2011-07-05 17:20:00 39.4 9.0
15000 2011-07-23 02:00:00 29.3 45.0
20000 2011-08-09 10:40:00 29.5 40.0
30656 2011-09-15 10:40:00 13.8 100.0
What I have done so far:
# Separate the date and times.
df['DATE'] = df['YYYYMMDDHHmm'].dt.normalize()
df['TIME'] = df['YYYYMMDDHHmm'].dt.time
df = df[['DATE', 'TIME', 'TAIR']]
df['RANK'] = 0
for index, row in df.iterrows():
df.loc[index, 'RANK'] = ranked[ranked['DATE']==row['DATE']]['RANK'].values
But I think I am going in a very wrong direction because this takes ages to complete.
How do I improve this code?
IIUC, you can play with indexes to match the values
df = df.set_index(df.TIMESTAMP.dt.date)\
.assign(RANK=ranked.set_index('DATE').RANK)\
.set_index(df.index)

df.plot.density() returns an empty plot despite values actually being there

I currently have this data set
WSPD GST WD WVHT DPD APD MWD BAR ATMP WTMP
Date
2005-06-06 03:00:00 8.2 9.8 86 NaN NaN NaN 77 1011.1 28.8 29
2005-06-06 04:00:00 9.4 11.2 96 NaN NaN NaN 79 1011.6 29 29
2005-06-06 05:00:00 9.4 10.9 103 NaN NaN NaN 78 1011.6 29 28.9
2005-06-06 06:00:00 7.5 9 114 NaN NaN NaN 84 1011.4 27.7 28.9
2005-06-06 07:00:00 7 10.4 118 NaN NaN NaN 85 1011.1 25.4 28.9
I am attempting to do a probability density plot for the column with WVHT. However, when I type:
df['WVHT'].plot.density()
I receive an empty plot despite the column actually having values. Please note this is just a sample of the data.

Categories