Maximum Monthly Values whilst retaining the Data at which that values occured - python

I have daily rainfall data that looks like the following:
Date Rainfall (mm)
1922-01-01 0.0
1922-01-02 0.0
1922-01-03 0.0
1922-01-04 0.0
1922-01-05 31.5
1922-01-06 0.0
1922-01-07 0.0
1922-01-08 0.0
1922-01-09 0.0
1922-01-10 0.0
1922-01-11 0.0
1922-01-12 9.1
1922-01-13 6.4
.
.
.
I am trying to work out the maximum value for each month for each year, and also what date the maximum value occurred on. I have been using the code:
rain_data.groupby(pd.Grouper(freq = 'M'))['Rainfall (mm)'].max()
This is returning the correct maximum value but returns the end date of each month rather than the date that maximum event occurred on.
1974-11-30 0.0
1974-12-31 0.0
1975-01-31 0.0
1975-02-28 65.0
1975-03-31 129.5
1975-11-30 59.9
1975-12-31 7.1
1976-01-31 10.0
1976-11-30 0.0
1976-12-31 0.0
1977-01-31 4.3
Any suggestions on how I could get the correct date?

I'm new to this, but what I think you're doing in (pd.Grouper(freq = 'M')) is grouping all the values in each month, but it's assigning every value within a group to the same date. I think this is why your groupby isn't returning the dates you're looking for.
I think your question is answered here. Alexander suggests to use:
df.groupby(pd.TimeGrouper('M')).Close.agg({'max date': 'idxmax', 'max rainfall': np.max})
The agg works without the Close I think, so if it's problematic (as I found) you might want to take it out.

Related

Retrieve next row in pandas dataframe / multiple list comprehension outputs

I have a Pandas dataframe, wt, with a datetime index and three columns as well as dataframe t with the same datetime index and three other columns below:
wt
date 0 1 2
2004-11-19 0.2 0.3 0.5
2004-11-22 0.0 0.0 0.0
2004-11-23 0.0 0.0 0.0
2004-11-24 0.0 0.0 0.0
2004-11-26 0.0 0.0 0.0
2004-11-29 0.0 0.0 0.0
2004-11-30 0.0 0.0 0.0
t
date GLD SPY TLT
2004-11-19 0.009013068949977443 -0.011116725618999457 -0.007980218051028332
2004-11-22 0.0037963376507370583 0.004769204564810003 0.005211874008610895
2004-11-23 -0.00444938820912133 0.0015256823190370472 0.0012398557258792575
2004-11-24 0.006703910614525022 0.0023696682464455776 0.0
2004-11-26 0.005327413984461682 -0.0007598784194529085 -0.00652932567826181
2004-11-29 0.002428792227864962 -0.004562737642585524 -0.010651558073654366
2004-11-30 -0.006167400881057272 0.0006790595025889523 -0.004237773450922022
2004-12-01 0.005762411347517871 0.011366528119433505 -0.0015527950310557648
I'm currently using the Pandas iterrrows method to run through each row for processing, and as a first step, I check if the row entries are non-zero, as below:
for dt, row in t.iterrows():
if sum(wt.loc[dt]) <= 0:
...
Based on this, I'd like to assign values to dataframe wt if non-zero values don't currently exist. How can I retrieve the next row for a given dt entry (eg, '11/22/2004' for dt = '11/19/2004')?
Part 2
As an addendum, I'm setting this up using a for loop for testing but would like to use list comprehension once complete. Processing will return the wt dataframe described above, as well as an intermediate, secondary dataframe again with datetime index and a single column (sample below):
r
date r
2004-11-19 0.030202
2004-11-22 -0.01047
2004-11-23 0.002456
2004-11-24 -0.01274
2004-11-26 0.00928
Is there a way to use list comprehensions to return both the above wt and this r dataframes without simply creating two separate comprehensions?
Edit
I was able to get desired results by changing my approach, so adding for clarification (referenced dataframes are as described above). Wonder if there's any way to apply list comprehensions for this.
r = pd.DataFrame(columns=['ret'],index=wt.index.copy())
dts = wt.reset_index().date
for i, dt in enumerate(dts):
row = t.loc[dt]
dt_1 = dts.shift(-1).iloc[i]
try:
wt.loc[dt_1] = ((wt.loc[dt].tolist() * (1+row)).transpose() / np.dot(wt.loc[dt].tolist(), (1+row))).tolist()
r.loc[dt] = np.dot(wt.loc[dt], row)
except:
print(f'Error calculating for date {dt}')
continue

Count All Occurrences of a Specific Value in a Dask Dataframe

I have a dask dataframe with thousands of columns and rows as follows:
pprint(daskdf.head())
grid lat lon ... 2014-12-29 2014-12-30 2014-12-31
0 0 48.125 -124.625 ... 0.0 0.0 -17.034216
1 0 48.625 -124.625 ... 0.0 0.0 -19.904214
4 0 42.375 -124.375 ... 0.0 0.0 -8.380443
5 0 42.625 -124.375 ... 0.0 0.0 -8.796803
6 0 42.875 -124.375 ... 0.0 0.0 -7.683688
I want to count all occurrences in the entire dataframe where a certain value appears. In pandas, this can be done as follows:
pddf[pddf==500].count().sum()
I'm aware that you can't translate all pandas functions/syntax with dask, but how would I do this with a dask dataframe? I tried doing:
daskdf[daskdf==500].count().sum().compute()
but this yielded a "Not Implemented" error.
As in many cases, where there is a row-wise pandas method which is not explicitly implemented yet in dask, you can use map_partitions. In this case this might look like:
ppdf.map_partitions(lambda df: df[df==500].count()).sum().compute()
You can experiment with whether also doing a .sum() within the lambda helps (it would produce smaller intermediaries) and what the meta= argument to map_partition should look like.

Selecting one date at a time from a resampled dataframe

I have a dataframe of tick data, which i have resampled into minute data. doing a vanilla
df.resample('1Min').ohlc().fillna(method='ffill')
super easy.
I now need to iterate over that resampled dataframe each day at a time, but i cant figure out the best way to do it.
ive tried taking my 1min resampled dataframe and then resampling that for "1D" and then converting that to a list to iterate over and filter, but that gives me a list of:
Timestamp('2011-09-13 00:00:00', freq='D')
objects, and it wont let me slice a dataframe based on that.
this seems like it would be something easy, but i just cant find the answer. thanks-
#sample data_1m dataframe
data_1m.head()
open high low close
timestamp
2011-09-13 13:53:00 5.8 6.0 5.8 6.0
2011-09-13 13:54:00 5.8 6.0 5.8 6.0
2011-09-13 13:55:00 5.8 6.0 5.8 6.0
2011-09-13 13:56:00 5.8 6.0 5.8 6.0
2011-09-13 13:57:00 5.8 6.0 5.8 6.0
...
#i want to get everything for date 2011-09-13 im trying
days_in_df = data_1m.resample('1D').ohlc().fillna(method='ffill').index.to_list()
data_1m.loc[days_in_df[0]]
KeyError: Timestamp('2011-09-13 00:00:00', freq='D')
Here's my two cents. I don't resample the data so much as adding another index level to the frame:
data_1m = data_1m.reset_index()
data_1m['date'] = data_1m['timestamp'].astype('datetime64[D]')
data_1m = data_1m.set_index(['date', 'timestamp'])
And to select an entire day:
data_1m.loc['2011-09-13']

Calculating date difference for pandas dataframe rows with changing baseline dates

Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)
You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))
df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64

Replace NaN or missing values with rolling mean or other interpolation

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using
pd.rolling_mean(data["variable"]), 12, center=True)
but it just gives me all NaN values.
Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.
The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.
There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.
df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
'temp' : [65,50,45,np.nan,40,43] }).set_index('month')
You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in #user394430's answer.)
df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3'] = df['temp'].rolling( 3,center=True,min_periods=1).mean()
Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).
df['update'] = df['rollmean3']
df['update'].update( df['temp'] ) # note: this is an inplace operation
There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.
df['ffill'] = df['temp'].ffill() # previous month
df['bfill'] = df['temp'].bfill() # next month
df['interp'] = df['temp'].interpolate() # mean of prev/next
In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question:
Interpolation on DataFrame in pandas
Here is the sample data with all the results:
temp rollmean12 rollmean3 update ffill bfill interp
month
10 65.0 48.6 57.500000 65.0 65.0 65.0 65.0
11 50.0 48.6 53.333333 50.0 50.0 50.0 50.0
12 45.0 48.6 47.500000 45.0 45.0 45.0 45.0
1 NaN 48.6 42.500000 42.5 45.0 40.0 42.5
2 40.0 48.6 41.500000 40.0 40.0 40.0 40.0
3 43.0 48.6 41.500000 43.0 43.0 43.0 43.0
In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.
The real key is having min_periods=1. Also, as of version 18, the proper calling is with a Rolling object. Therefore, your code should be
data["variable"].rolling(min_periods=1, center=True, window=12).mean().

Categories