Correcting dates with apply function pandas - python

I have a DataFrame with dates in the following format:
12/31/2000 20:00 (month/day/year hours:minutes)
The issue is that there are some dates that are wrong in the data set, for instance:
10/12/2003 24:00 should be 10/13/2003 00:00
This is what I get when I run dfUFO[wrongFormat]
So I have the following code in a pandas notebook to reformat these dates:
def convert2400ToTimestamp(x) :
date = pd.to_datetime(x.datetime.split(" ")[0], format='%m/%d/%Y')
return date + pd.Timedelta(days=1)
wrongFormat = dfUFO.datetime.str.endswith("24:00", na=False)
dfUFO[wrongFormat] = dfUFO[wrongFormat].apply(convert2400ToTimestamp, axis=1)
This code results in
ValueError: Must have equal len keys and value when setting with an iterable
I don't really get what this error means. Something I'm missing?
EDIT: Changed to
dfUFO.loc[wrongFormat, 'datetime'] = dfUFO[wrongFormat].apply(convert2400ToTimestamp, axis=1)
But datetime now shows values like 1160611200000000000 for date 10/11/2006

You can parse your datetime column to "correctly named" parts and use pd.to_datetime():
Source DF:
In [14]: df
Out[14]:
datetime
388 10/11/2006 24:00:00
693 10/1/2001 24:00:00
111 10/1/2001 23:59:59
Vectorized solution:
In [11]: pat = r'(?P<month>\d{1,2})\/(?P<day>\d{1,2})\/(?P<year>\d{4}) (?P<hour>\d{1,2})\:(?P<minute>\d{1,2})\:(?P<second>\d{1,2})'
In [12]: df.datetime.str.extract(pat, expand=True)
Out[12]:
month day year hour minute second
388 10 11 2006 24 00 00
693 10 1 2001 24 00 00
111 10 1 2001 23 59 59
In [13]: pd.to_datetime(df.datetime.str.extract(pat, expand=True))
Out[13]:
388 2006-10-12 00:00:00
693 2001-10-02 00:00:00
111 2001-10-01 23:59:59
dtype: datetime64[ns]
from docs:
Assembling a datetime from multiple columns of a DataFrame. The keys
can be common abbreviations like:
['year', 'month', 'day', 'minute', 'second','ms', 'us', 'ns']
or plurals of the same

Related

How to split DateTime into Year and Month while summing values?

I have a dataframe with a Date column as an index as DateTime type, and a value attached to each entry.
The dates are split into yyyy-mm-dd, with each row being the next day.
Example:
Date: x:
2012-01-01 44
2012-01-02 75
2012-01-03 62
How would I split the Date column into Year and Month columns, using those two as indexes while also summing the values of all the days in a month?
Example of expected output:
Year: Month: x:
2012 1 745
2 402
3 453
...
2013 1 4353
Use Series.dt.year
Series.dt.month with aggregate sum by GroupBy.sum and rename for new columns names:
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.year.rename('Year'),
df['Date'].dt.month.rename('Month')])['x'].sum().reset_index()
print (df1)
Year Month x
0 2012 1 181
Use groupby and sum:
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum())
Year Month
2012 1 181
Name: x, dtype: int64
Note that if "Date" isn't a datetime dtype column, use
df.Date = pd.to_datetime(df.Date, errors='coerce')
To convert it first.
(df.groupby([df.Date.dt.year.rename('Year'), df.Date.dt.month.rename('Month')])['x']
.sum()
.reset_index())
Year Month x
0 2012 1 181

Iterate over pd df with date column by week python

I have a one month DataFrame with a datetime object column and a bunch of functions I want to apply to it - by week. So I want to loop over the DataFrame and apply the functions to each week. How do I iterate over weekly time periods?
My DataFrame looks like this:
here is some random datetime code:
np.random.seed(123)
n = 500
df = pd.DataFrame(
{'date':pd.to_datetime(
pd.DataFrame( { 'year': np.random.choice(range(2017,2019), size=n),
'month': np.random.choice(range(1,2), size=n),
'day': np.random.choice(range(1,28), size=n)
} )
) }
)
df['random_num'] = np.random.choice(range(0,1000), size=n)
My week length is inconsistent (sometimes I have 1000 tweets per week sometimes 100,000). Could please someone give me an example of how to loop over this dataframe by week? (I don't need aggregation or groupby functions.)
If you really don't want to use groupby and aggregations then:
for week in df['date'].dt.week.unique():
this_weeks_data = df[df['date'].dt.week == week]
This will, of course, go wrong if you have data from more than one year.
Given your sample dataframe
date random_num
0 2017-01-01 214
1 2018-01-19 655
2 2017-01-24 663
3 2017-01-26 723
4 2017-01-01 974
First, you can try to set the index to datetime object as follows
df.set_index(df.date, inplace=True)
df.drop('date', axis=1, inplace=True)
This sets the index to the date column and drops the original column. You will get
>>> df.head()
date random_num
2017-01-01 214
2018-01-19 655
2017-01-24 663
2017-01-26 723
2017-01-01 974
Then you can use the pandas groupby function to group the data as per your frequency and apply any function of your choice.
# To group by week and count the number of occurances
>>> df.groupby(pd.Grouper(freq='W')).count().head()
date random_num
2017-01-01 11
2017-01-08 65
2017-01-15 55
2017-01-22 66
2017-01-29 45
# To group by week and sum the random numbers per week
>>> df.groupby(pd.Grouper(freq='W')).sum().head()
date random_num
2017-01-01 7132
2017-01-08 33916
2017-01-15 31028
2017-01-22 31509
2017-01-29 22129
You can also apply any generic function myFunction by using the apply method of pandas
df.groupby(pd.Grouper(freq='W')).apply(myFunction)
If you want to apply a function myFunction to any specific column columnName after grouping, you can also do that as follows
df.groupby(pd.Grouper(freq='W'))[columnName].apply(myFunction)
[SOLVED FOR MULTIPLE YEARS]
pd.Grouper(freq='W') works fine but sometimes I have come across some undesired behaviors related to how weeks are split when the number of weeks are not even. So this is why I sometimes prefer to do the week split by hand like shown in this example.
So, having a dataset that spans in multiple years
import numpy as np
import pandas as pd
import datetime
# Create dataset
np.random.seed(123)
n = 100000
date = pd.to_datetime({
'year': np.random.choice(range(2017, 2020), size=n),
'month': np.random.choice(range(1, 13), size=n),
'day': np.random.choice(range(1, 28), size=n)
})
random_num = np.random.choice(
range(0, 1000),
size=n)
df = pd.DataFrame({'date': date, 'random_num': random_num})
Such as:
print(df.head())
date random_num
0 2019-12-11 413
1 2018-06-08 594
2 2019-08-06 983
3 2019-10-11 73
4 2017-09-19 32
First create a helper index that allows you to iterate by week (considering the year as well):
df['grp_idx'] = df['date'].apply(
lambda x: '%s-%s' % (x.year, '{:02d}'.format(x.week)))
print(df.head())
date random_num grp_idx
0 2019-12-11 413 2019-50
1 2018-06-08 594 2018-23
2 2019-08-06 983 2019-32
3 2019-10-11 73 2019-41
4 2017-09-19 32 2017-38
Then just apply your function that makes a computation on the weekly-subset, something like this:
def something_to_do_by_week(week_data):
"""
Computes the mean random value.
"""
return week_data['random_num'].mean()
weekly_mean = df.groupby('grp_idx').apply(something_to_do_by_week)
print(weekly_mean.head())
grp_idx
2017-01 515.875668
2017-02 487.226704
2017-03 503.371681
2017-04 497.717647
2017-05 475.323420
Once you have your weekly metrics you'll probably would like to get back to actual dates which are more useful than year-week indices:
def from_year_week_to_date(year_week):
"""
"""
year, week = year_week.split('-')
year, week = int(year), int(week)
date = pd.to_datetime('%s-01-01' % year)
date += datetime.timedelta(days=week * 7)
return date
weekly_mean.index = [from_year_week_to_date(x) for x in weekly_mean.index]
print(weekly_mean.head())
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647
2017-02-05 475.323420
dtype: float64
Finally, now you can make nice plots with nice interpretable dates:
Just as a sanity check, the computation using pd.Grouper(freq='W') gives me almost the same results (somehow it adds an extra week at the beginning of the pd.Series)
df.set_index('date').groupby(
pd.Grouper(freq='W')
).mean().head()
Out[27]:
random_num
date
2017-01-01 532.736364
2017-01-08 515.875668
2017-01-15 487.226704
2017-01-22 503.371681
2017-01-29 497.717647

python pandas extract year from datetime: df['year'] = df['date'].year is not working

I import a dataframe via read_csv, but for some reason can't extract the year or month from the series df['date'], trying that gives AttributeError: 'Series' object has no attribute 'year':
date Count
6/30/2010 525
7/30/2010 136
8/31/2010 125
9/30/2010 84
10/29/2010 4469
df = pd.read_csv('sample_data.csv', parse_dates=True)
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].year
df['month'] = df['date'].month
UPDATE:
and when I try solutions with df['date'].dt on my pandas version 0.14.1, I get "AttributeError: 'Series' object has no attribute 'dt' ":
df = pd.read_csv('sample_data.csv',parse_dates=True)
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Sorry for this question that seems repetitive - I expect the answer will make me feel like a bonehead... but I have not had any luck using answers to the similar questions on SO.
FOLLOWUP: I can't seem to update my pandas 0.14.1 to a newer release in my Anaconda environment, each of the attempts below generates an invalid syntax error. I'm using Python 3.4.1 64bit.
conda update pandas
conda install pandas==0.15.2
conda install -f pandas
Any ideas?
If you're running a recent-ish version of pandas then you can use the datetime accessor dt to access the datetime components:
In [6]:
df['date'] = pd.to_datetime(df['date'])
df['year'], df['month'] = df['date'].dt.year, df['date'].dt.month
df
Out[6]:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
EDIT
It looks like you're running an older version of pandas in which case the following would work:
In [18]:
df['date'] = pd.to_datetime(df['date'])
df['year'], df['month'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month)
df
Out[18]:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
Regarding why it didn't parse this into a datetime in read_csv you need to pass the ordinal position of your column ([0]) because when True it tries to parse columns [1,2,3] see the docs
In [20]:
t="""date Count
6/30/2010 525
7/30/2010 136
8/31/2010 125
9/30/2010 84
10/29/2010 4469"""
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
date 5 non-null datetime64[ns]
Count 5 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 120.0 bytes
So if you pass param parse_dates=[0] to read_csv there shouldn't be any need to call to_datetime on the 'date' column after loading.
This works:
df['date'].dt.year
Now:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
gives this data frame:
date Count year month
0 2010-06-30 525 2010 6
1 2010-07-30 136 2010 7
2 2010-08-31 125 2010 8
3 2010-09-30 84 2010 9
4 2010-10-29 4469 2010 10
When to use dt accessor
A common source of confusion revolves around when to use .year and when to use .dt.year.
The former is an attribute for pd.DatetimeIndex objects; the latter for pd.Series objects. Consider this dataframe:
df = pd.DataFrame({'Dates': pd.to_datetime(['2018-01-01', '2018-10-20', '2018-12-25'])},
index=pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03']))
The definition of the series and index look similar, but the pd.DataFrame constructor converts them to different types:
type(df.index) # pandas.tseries.index.DatetimeIndex
type(df['Dates']) # pandas.core.series.Series
The DatetimeIndex object has a direct year attribute, while the Series object must use the dt accessor. Similarly for month:
df.index.month # array([1, 1, 1])
df['Dates'].dt.month.values # array([ 1, 10, 12], dtype=int64)
A subtle but important difference worth noting is that df.index.month gives a NumPy array, while df['Dates'].dt.month gives a Pandas series. Above, we use pd.Series.values to extract the NumPy array representation.
Probably already too late to answer but since you have already parse the dates while loading the data, you can just do this to get the day
df['date'] = pd.DatetimeIndex(df['date']).year
What worked for me was upgrading pandas to latest version:
From Command Line do:
conda update pandas

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?
Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')
If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Categories