pandas dataframe group year index by decade - python

suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.

To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).

if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()

Use the year attribute of index:
df.groupby(df.index.year)

lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Related

A simple way of selecting the previous row in a column and performing an operation?

I'm trying to create a forecast which takes the previous day's 'Forecast' total and adds it to the current day's 'Appt'. Something which is straightforward in Excel but I'm struggling in pandas. At the moment all I can get in pandas using .loc is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,0,0,0,0]
})
What I'm looking for it to do is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,47,52,56,69]
})
E.g. 'Forecast' total on the 1st December is 37. On the 2nd December the value in the 'Appt' column in 10. I want it to select 37 and + 10, then put this in the 'Forecast' column for the 2nd December. Then iterate over the rest of the column.
I've tied using .loc() with the index, and experimented with .shift() but neither seem to work for what I'd like. Also looked into .rolling() but I think that's not appropriate either.
I'm sure there must be a simple way to do this?
Apologies, the original df has 'Date' as a datetime column.
You can use mask and cumsum:
df['Forecast'] = df['Forecast'].mask(df['Forecast'].eq(0), df['Appt']).cumsum()
# or
df['Forecast'] = np.where(df['Forecast'].eq(0), df['Appt'], df['Forecast']).cumsum()
Output:
Date Appt Forecast
0 2022-12-01 12 37
1 2022-12-01 10 47
2 2022-12-01 5 52
3 2022-12-01 4 56
4 2022-12-01 13 69
You have to make sure that your column has datetime/date type, then you may filter df like this:
# previous code&imports
yesterday = datetime.now().date() - timedelta(days=1)
df[df["date"] == yesterday]["your_column"].sum()

Find max by year and return date on which max occurred in Pandas with dates as index

I have this dataframe
date,AA
1980-01-01, 77.7
1980-01-02, 86
1980-01-03, 92.3
1980-01-04, 96.4
1980-01-05, 85.7
1980-01-06, 75.7
1980-01-07, 86.8
1980-01-08, 93.2
1985-08-13, 224.6
1985-08-14, 213.9
1985-08-15, 205.7
1985-08-16, 207.3
1985-08-17, 202.1
I would like to compute the max for each year and the date where it happens. I am struggling because I would like to keep indeed the date as index.
Indeed I read it as:
dfr = pd.read_csv(fnamed, sep=',', header = 0, index_col=0, parse_dates=True)
I know that I could resample as
dfr_D = dfr.resample('Y').max()
but in this case I would lose the information about the location of the maximum value within the year.
I have found this:
idx = dfr.groupby(lambda x: dfr['date'][x].year)["A"].idxmax()
However, dfr['date'] seems to be the name of the column while in my case date in an index and '.year' is not one of its properties.
I have the felling that I should work with "groupby" and "indexmax". However, all the attends that I made, they all failed.
Thanks in advance
Assuming "date" is of datetime type and a column, you can use the following to slice your data with the maximum per group:
df.loc[df.groupby(df['date'].dt.year)['AA'].idxmax().values]
output:
date AA
3 1980-01-04 96.4
8 1985-08-13 224.6
If "date" is the index:
df.loc[df.groupby(df.index.year)['AA'].idxmax().values]
output:
AA
date
1980-01-04 96.4
1985-08-13 224.6

Pandas replace NaN values with zeros after pivot operation

I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)

Merging two dataframes with different structure using pandas

I need to merge data from one dataframe onto another.
The main dataframe consists of survey answers with a year, month, and region variable.
The data I need to merge onto this is the weather data for that specific month. This data is stored in my second data frame for weather stations with a year variable, a temperature average variable for each month (eg. value1, value2, ... value12), and a region variable.
I've tried to merge the two dataframes on region and year, and my plan was then afterwards to select the average temperature variable which coincides with the survey.
df1
---------------------------
year month regions
2002 january Pais Vasco
2002 february Pais Vasco
2003 march Pais Vasco
2002 november Florida
2003 december Florida
... ... ...
---------------------------
df2
-----------------------------------------------
year value1 value2 ... value12 regions
2002 10 11 ... 9 Pais Vasco
2003 11 11 ... 10 Pais Vasco
2004 12 11 ... 10 Pais Vasco
2002 11 11 ... 9 Florida
2003 10 11 ... 9 Florida
-----------------------------------------------
So in this example I need for my first survey observation to get the corresponding temperature (value1) data from the region Pais Vasco and year 2002.
When I tried to merge with
df_merged = pd.merge(df1, df2, how = "left", on =["regions", "year"])
I just get a dataframe with way more observations than my original survey dataframe.
I convert this data to tidy format. Assuming value1, value2 etc. correspond to value and month, then use pd.wide_to_long to turn it into long tidy format then merge.
tidy = pd.wide_to_long(df, stubnames=['value'], i=['year', 'region'], j='month', sep='') \
.reset_index()
You need to normalise your months so that they are all either numbers or integers. How you do this is outside the scope of this answer.
Then,
df1.merge(tidy, on=['year', 'month', 'region'], how='left', validate='1:1')
If this raises an error, then you have multiple observations for the same ['year', 'month', 'region'] key. Fix that by dropping duplicates. How you do so is almost certainly based heavily on your data.
sobek noticed that you have a typo, saying 'regions' rather than 'region' in your merge command. Make sure you're referring to columns that actually exist.

Map dataframe column value by another column's value

My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0

Categories