Given a Pandas dataframe created as follows:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6),index=dates,columns=list('A'))
A
2013-01-01 0.847528
2013-01-02 0.204139
2013-01-03 0.888526
2013-01-04 0.769775
2013-01-05 0.175165
2013-01-06 -1.564826
I want to add 15 days to the index.
This does not work>
#from pandas.tseries.offsets import *
df.index+relativedelta(days=15)
#df.index + DateOffset(days=5)
TypeError: relativedelta(days=+15)
I seem to be incapable of doing anything right with indexes....
you can use DateOffset:
>>> df = pd.DataFrame(np.random.randn(6),index=dates,columns=list('A'))
>>> df.index = df.index + pd.DateOffset(days=15)
>>> df
A
2013-01-16 0.015282
2013-01-17 1.214255
2013-01-18 1.023534
2013-01-19 1.355001
2013-01-20 1.289749
2013-01-21 1.484291
Marginally shorter/more direct is tshift:
df = df.tshift(15, freq='D')
Link to a list of freq aliases
If need convert it to DatetimeIndex and add days use:
df.index = pd.to_datetime(df.index) + pd.Timedelta('15 days')
If already DatetimeIndex:
df.index += pd.Timedelta('15 days')
Related
I have a dataframe with data for each minutes, it also contains a date column which is used to keep track of the date in timestamp format.
Here I'm trying to aggregate the data by hours instead of minute.
I tried the following code which is working but it needs to index based on date column which I don't want because then I cannot loop through the dataframe using df.loc function.
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='T')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index('date')
df.index = pd.to_datetime(df.index, unit='s')
df = df.resample('H').sum()
df.head(15)
I also tried groupby but it's not working, following is the code.
df.groupby([df.date.dt.hour]).data.sum()
print(df.head(15))
How I can groupby date without indexing it?
Thanks.
Try pd.Grouper and specify the freq parameter:
df.groupby([pd.Grouper(key='date', freq='1H')]).sum()
Full code:
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='T')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))
print(df.groupby([pd.Grouper(key='date', freq='1H')]).sum())
# data
# date
# 2018-01-01 00:00:00 2958
# 2018-01-01 01:00:00 3084
# 2018-01-01 02:00:00 2991
# 2018-01-01 03:00:00 3021
# 2018-01-01 04:00:00 2894
# ... ...
# 2018-01-07 20:00:00 2863
# 2018-01-07 21:00:00 2850
# 2018-01-07 22:00:00 2823
# 2018-01-07 23:00:00 2805
# 2018-01-08 00:00:00 25
# [169 rows x 1 columns]
Hope that helps !
My problem is that I have a big dataframe with over 40000 Rows and now I want to select the rows from 2013-01-01 00:00:00 until 2013-31-12 00:00:00
print(df.loc[df['localhour'] == '2013-01-01 00:00:00'])
Thats my code now but I can not choose an intervall for printing out ... any ideas ?
One way is to set your index as datetime and then use pd.DataFrame.loc with string indexers:
df = pd.DataFrame({'Date': ['2013-01-01', '2014-03-01', '2011-10-01', '2013-05-01'],
'Var': [1, 2, 3, 4]})
df['Date'] = pd.to_datetime(df['Date'])
res = df.set_index('Date').loc['2010-01-01':'2013-01-01']
print(res)
Var
Date
2013-01-01 1
2011-10-01 3
Make a datetime object and then apply the condition:
print(df)
date
0 2013-01-01
1 2014-03-01
2 2011-10-01
3 2013-05-01
df['date']=pd.to_datetime(df['date'])
df['date'].loc[(df['date']<='2013-12-31 00:00:00') & (df['date']>='2013-01-01 00:00:00')]
Output:
0 2013-01-01
3 2013-05-01
How to select multiple rows of a dataframe by list of dates
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
In[1]: df
Out[1]:
A B C D
2013-01-01 0.084393 -2.460860 -0.118468 0.543618
2013-01-02 -0.024358 -1.012406 -0.222457 1.906462
2013-01-03 -0.305999 -0.858261 0.320587 0.302837
2013-01-04 0.527321 0.425767 -0.994142 0.556027
2013-01-05 0.411410 -1.810460 -1.172034 -1.142847
2013-01-06 -0.969854 0.469045 -0.042532 0.699582
myDates = ["2013-01-02", "2013-01-04", "2013-01-06"]
So the output should be
A B C D
2013-01-02 -0.024358 -1.012406 -0.222457 1.906462
2013-01-04 0.527321 0.425767 -0.994142 0.556027
2013-01-06 -0.969854 0.469045 -0.042532 0.699582
You can use index.isin() method to create a logical index for subsetting:
df[df.index.isin(myDates)]
Convert your entry into a DateTimeIndex:
df.loc[pd.to_datetime(myDates)]
A B C D
2013-01-02 -0.047710 -1.827593 -0.944548 -0.149460
2013-01-04 1.437924 0.126788 0.641870 0.198664
2013-01-06 0.408820 -1.842112 -0.287346 0.071397
If you have a timeseries containing hours and minutes in the index (e.g. 2022-03-07 09:03:00+00:00 instead of 2022-03-07), and you want to filter by dates (without hours, minutes, etc.), you can use the following:
df.loc[np.isin(df.index.date, myDates)]
If you try df.loc[df.index.date.isin(myDates)] it might not work and python will throw an error saying 'numpy.ndarray' object has no attribute 'isin', and this is why we use np.isin.
This is an old post but I think this can be useful to a lot of people (such as myself).
I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2
I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD
For e.g. '2013-10-28'
At the moment the dtype of the column is object.
How do I convert the column values to Pandas date format?
Essentially equivalent to #waitingkuo, but I would use pd.to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):
In [11]: df
Out[11]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [12]: pd.to_datetime(df['time'])
Out[12]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]
In [13]: df['time'] = pd.to_datetime(df['time'])
In [14]: df
Out[14]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
Handling ValueErrors
If you run into a situation where doing
df['time'] = pd.to_datetime(df['time'])
Throws a
ValueError: Unknown string format
That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
Use astype
In [31]: df
Out[31]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [32]: df['time'] = df['time'].astype('datetime64[ns]')
In [33]: df
Out[33]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:
dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Now you can do df['column'].dt.date
Note that for datetime objects, if you don't see the hour when they're all 00:00:00, that's not pandas. That's iPython notebook trying to make things look pretty.
If you want to get the DATE and not DATETIME format:
df["id_date"] = pd.to_datetime(df["id_date"]).dt.date
Another way to do this and this works well if you have multiple columns to convert to datetime.
cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)
It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.
#set an index by dates
df.set_index(['time'], drop=True, inplace=True)
After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
#Convert to monthly dates
df.index = df.index.to_period(freq='M')
#Convert to strings
df.index = df.index.strftime('%Y-%m')
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
For brevity, I don't show that I run the following code after each line above:
print(df.index)
print(df.index.dtype)
print(type(df.index))
This gives me the following output:
Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>
Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by #SSS, but using rather the datetime library is:
import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null object
1 endDay 110526 non-null object
import pandas as pd
df['startDay'] = pd.to_datetime(df.startDay)
df['endDay'] = pd.to_datetime(df.endDay)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null datetime64[ns]
1 endDay 110526 non-null datetime64[ns]
Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column