Find annual average of pandas dataframe with date column

Find annual average of pandas dataframe with date column - python

id vi dates f_id
0 5532714 0.549501 2015-07-07 ff_22
1 5532715 0.540969 2015-07-08 ff_22
2 5532716 0.531477 2015-07-09 ff_22
3 5532717 0.521029 2015-07-10 ff_22
4 5532718 0.509694 2015-07-11 ff_22
In the dataframe above, I want to find average yearly value for each year. This does not work:
df.groupby(df.dates.year)['vi'].transform(mean)
I get this error: *** AttributeError: 'Series' object has no attribute 'year'
How to fix this?

Let's make sure that dates is datetime dtype, then use the .dt accessor as .dt.year:
df['dates'] = pd.to_datetime(df.dates)
df.groupby(df.dates.dt.year)['vi'].transform('mean')
Output:
0 0.530534
1 0.530534
2 0.530534
3 0.530534
4 0.530534
Name: vi, dtype: float64

Updating and completing #piRsquared's example below for recent versions of pandas (e.g. v1.1.0), using the Grouper function instead of TimeGrouper which was deprecated:
import pandas as pd
import numpy as np
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
df.groupby(pd.Grouper(freq='1Y')).mean()

You can also use pd.TimeGrouper with the frequency A
Consider the dataframe df consisting of four years of daily data
tidx = pd.date_range('2010-01-01', '2013-12-31', name='dates')
np.random.seed([3,1415])
df = pd.DataFrame(dict(vi=np.random.rand(tidx.size)), tidx)
df.groupby(pd.TimeGrouper('A')).mean()
vi
dates
2010-12-31 0.465121
2011-12-31 0.511640
2012-12-31 0.491363
2013-12-31 0.516614

Related

Cannot use Pandas pct_change with date

I have a data frame:
date value
0 2017-11-30 13:58:57 901
1 2017-11-30 13:59:41 905
2 2017-11-30 13:59:41 925
That was generated by:
import pandas as pd
df = pd.DataFrame.from_items( [('date', ['2017-11-30 13:58:57', '2017-11-30 13:59:41', '2017-11-30 13:59:41']),("value", [901, 905, 925])])
df['date'] = pd.to_datetime(df['date'])
I want to calculate the percentage change between two consecutive rows, but when I use:
df.pct_change()
I get the error:
ufunc true_divide cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')
How do I make it ignore the date column?

How do I make it ignore the date column?
Here's a solution with select_dtypes that should generalise to any dataframe by ignoring non-numeric columns -
df.select_dtypes(include=['number']).pct_change()
value
0 NaN
1 0.004440
2 0.022099

I’d try specifying the value column.
df[‘pctcng’]=df[‘value’].pct_change()

Difference of a date and int column in Pandas

I have a DataFrame that looks like this:
raw_data = {'SeriesDate':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'Test':['1','2','3','4']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['SeriesDate','Test'])
df['SeriesDate'] = pd.to_datetime(df['SeriesDate'])
I want to subtract the number field from the date field to arrive at a date, however when I do this:
df['TestDate'] = df['SeriesDate'] - pd.Series((dd) for dd in df['Test'])
I get the following TypeError:
TypeError: incompatible type [object] for a datetime/timedelta operation
Any idea how I can workaround this?

You can use to_timedelta:
df['TestDate'] = df['SeriesDate'] - pd.to_timedelta(df['Test'].astype(int), unit="d")
print(df)
SeriesDate Test TestDate
0 2017-03-10 1 2017-03-09
1 2017-03-13 2 2017-03-11
2 2017-03-14 3 2017-03-11
3 2017-03-15 4 2017-03-11

python pandas extract year from datetime: df['year'] = df['date'].year is not working

I import a dataframe via read_csv, but for some reason can't extract the year or month from the series df['date'], trying that gives AttributeError: 'Series' object has no attribute 'year':
date Count
6/30/2010 525
7/30/2010 136
8/31/2010 125
9/30/2010 84
10/29/2010 4469
df = pd.read_csv('sample_data.csv', parse_dates=True)
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].year
df['month'] = df['date'].month
UPDATE:
and when I try solutions with df['date'].dt on my pandas version 0.14.1, I get "AttributeError: 'Series' object has no attribute 'dt' ":
df = pd.read_csv('sample_data.csv',parse_dates=True)
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Sorry for this question that seems repetitive - I expect the answer will make me feel like a bonehead... but I have not had any luck using answers to the similar questions on SO.
FOLLOWUP: I can't seem to update my pandas 0.14.1 to a newer release in my Anaconda environment, each of the attempts below generates an invalid syntax error. I'm using Python 3.4.1 64bit.
conda update pandas
conda install pandas==0.15.2
conda install -f pandas
Any ideas?

If you're running a recent-ish version of pandas then you can use the datetime accessor dt to access the datetime components:
In [6]:
df['date'] = pd.to_datetime(df['date'])
df['year'], df['month'] = df['date'].dt.year, df['date'].dt.month
df
Out[6]:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
EDIT
It looks like you're running an older version of pandas in which case the following would work:
In [18]:
df['date'] = pd.to_datetime(df['date'])
df['year'], df['month'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month)
df
Out[18]:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
Regarding why it didn't parse this into a datetime in read_csv you need to pass the ordinal position of your column ([0]) because when True it tries to parse columns [1,2,3] see the docs
In [20]:
t="""date Count
6/30/2010 525
7/30/2010 136
8/31/2010 125
9/30/2010 84
10/29/2010 4469"""
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
date 5 non-null datetime64[ns]
Count 5 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 120.0 bytes
So if you pass param parse_dates=[0] to read_csv there shouldn't be any need to call to_datetime on the 'date' column after loading.

This works:
df['date'].dt.year
Now:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
gives this data frame:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10

When to use dt accessor
A common source of confusion revolves around when to use .year and when to use .dt.year.
The former is an attribute for pd.DatetimeIndex objects; the latter for pd.Series objects. Consider this dataframe:
df = pd.DataFrame({'Dates': pd.to_datetime(['2018-01-01', '2018-10-20', '2018-12-25'])},
index=pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03']))
The definition of the series and index look similar, but the pd.DataFrame constructor converts them to different types:
type(df.index) # pandas.tseries.index.DatetimeIndex
type(df['Dates']) # pandas.core.series.Series
The DatetimeIndex object has a direct year attribute, while the Series object must use the dt accessor. Similarly for month:
df.index.month # array([1, 1, 1])
df['Dates'].dt.month.values # array([ 1, 10, 12], dtype=int64)
A subtle but important difference worth noting is that df.index.month gives a NumPy array, while df['Dates'].dt.month gives a Pandas series. Above, we use pd.Series.values to extract the NumPy array representation.

Probably already too late to answer but since you have already parse the dates while loading the data, you can just do this to get the day
df['date'] = pd.DatetimeIndex(df['date']).year

What worked for me was upgrading pandas to latest version:
From Command Line do:
conda update pandas

Pandas - converting datetime64 to Period

I have a Series of dates in datetime64 format.
I want to convert them to a series of Period with a monthly frequency. (Essentially, I want to group dates into months for analytical purposes).
There must be a way of doing this - I just cannot find it quickly.
Note: these dates are not the index of the data frame - they are just a column of data in the data frame.
Example input data (as a Series)
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-12-01']))
print (data)
My current kludge/work around looks like
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data = pd.DatetimeIndex(data).to_period('M')
data = pd.Series(data.year).astype('str') + '-' + pd.Series((data.month).astype('int')).map('{:0>2d}'.format)
data = data.where(data != '2262-04', other='No Date')
print (data)

Their are some issues currently (even in master) dealing with NaT in PeriodIndex, so your approach won't work like that. But seems that you simply want to resample; so do this. You can of course specify a function for how if you want.
In [57]: data
Out[57]:
0 2014-10-01
1 2014-10-01
2 2014-10-31
3 2014-11-15
4 2014-11-30
5 NaT
6 2014-12-01
dtype: datetime64[ns]
In [58]: df = DataFrame(dict(A = data, B = np.arange(len(data))))
In [59]: df.dropna(how='any',subset=['A']).set_index('A').resample('M',how='count')
Out[59]:
B
A
2014-10-31 3
2014-11-30 2
2014-12-31 1

import pandas as pd
import numpy as np
datetime import datetime
data = pd.to_datetime(
pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data=pd.Series(['{}-{:02d}'.format(x.year,x.month) if isinstance(x, datetime) else "Nat" for x in pd.DatetimeIndex(data).to_pydatetime()])
0 2014-10
1 2014-10
2 2014-10
3 2014-11
4 2014-11
5 Nat
6 2014-01
dtype: object
Best I could come up with, if the only non datetimes objects possible are floats you can change if isinstance(x, datetime) to if not isinstance(x, float)

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.

Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64

You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)

Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help

Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.

To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find annual average of pandas dataframe with date column - python

Let's make sure that dates is datetime dtype, then use the .dt accessor as .dt.year: df['dates'] = pd.to_datetime(df.dates) df.groupby(df.dates.dt.year)['vi'].transform('mean') Output: 0 0.530534 1 0.530534 2 0.530534 3 0.530534 4 0.530534 Name: vi, dtype: float64

Related

Cannot use Pandas pct_change with date

Difference of a date and int column in Pandas

python pandas extract year from datetime: df['year'] = df['date'].year is not working

Pandas - converting datetime64 to Period

Pandas Timedelta in Days

Categories

Resources