How to do a rolling Groupby using a Multiindex

How to do a rolling Groupby using a Multiindex - python

I have a multi index series. One of the indices is day and I try to go through and get the data in a day range. I have looked into using just rolling with a time given as a string, but it returns a list of the same length whereas I only need 1 response per unique date index.
This is my current code, is there a simpler way to do this:
result = {}
for date in df.index.levels[2]: #this goes through all of the days
pre_date = date - np.timedelta64(window,'D') #find window days ago
cur_df = df.loc[idx[:,:,pre_dat:date],:] #get all data in that window day range
result[date] = f(cur_df)
result = pd.Series(result)

Related

Resampling a time series

I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0

You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1

How to group date by day and find min and max value in pandas data frame or python

Two event columns dtb(start time) dte(stop time)
In the image two columns is there I want group by day of the value for get min(time) as start of the event on the day and get max(time) as stop of the event on the day.I want like this

I will try to do my best to answer it as I understood it.
Supposing your columns dtb and dte are in datetime format:
df['date'] = df.dtb.dt.date
df['dtb'] = df.dtb.dt.time
df['dte'] = df.dte.dt.time
result = df.groupby('date').agg({'dtb': np.max,
'dte': np.min})
print(result)
What I did is create a new column with the date, and reformat the dtb and dte columns to get only the time, and then group by the date taking the max and min for dtb and dte

You can directly group per day or per week even using the following syntax
dg_bydate= df.groupby(pd.Grouper(key='dtb', freq='1D')).agg({'dte':[np.min, np.max]})

lagging parameters in panda?

I'm new to panda. I have a dataframe of TTI which is sorted by hour of a day for many years. I want to add a new column showing last year's tti value for each value. I wrote this code:
import pandas as pd
tti = pd.read_csv("c:\\users\\Mehrdad\\desktop\\Hourly_TTI.csv")
tti['new_date'] = pd.to_datetime(tti['Date'])
tti['last_year'] = tti['TTI'].shift(1,freq='1-Jan-2009')
print tti.head(10)
but I don't know how to define frequency value for shift! So that it would shift my data for one year behind my first date which is 01-01-2010.!?

df['last_year'] = df['date'].apply(lambda x: x - pd.DateOffset(years=1))
df['new_value'] = df.loc[df['last_year'],:]
df.shift can only move by a fixed distance.
Use offset to create a new datetime index and retrieve the value using the new index. Be aware to truncate the date of the first year.

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?

You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>

Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).

If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to do a rolling Groupby using a Multiindex - python

Related

Resampling a time series

How to group date by day and find min and max value in pandas data frame or python

lagging parameters in panda?

Dataset statistics with custom begin of the year

Mapping Values in a pandas Dataframe column?

Categories

Resources