I have a Series of dates in datetime64 format.
I want to convert them to a series of Period with a monthly frequency. (Essentially, I want to group dates into months for analytical purposes).
There must be a way of doing this - I just cannot find it quickly.
Note: these dates are not the index of the data frame - they are just a column of data in the data frame.
Example input data (as a Series)
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-12-01']))
print (data)
My current kludge/work around looks like
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data = pd.DatetimeIndex(data).to_period('M')
data = pd.Series(data.year).astype('str') + '-' + pd.Series((data.month).astype('int')).map('{:0>2d}'.format)
data = data.where(data != '2262-04', other='No Date')
print (data)
Their are some issues currently (even in master) dealing with NaT in PeriodIndex, so your approach won't work like that. But seems that you simply want to resample; so do this. You can of course specify a function for how if you want.
In [57]: data
Out[57]:
0 2014-10-01
1 2014-10-01
2 2014-10-31
3 2014-11-15
4 2014-11-30
5 NaT
6 2014-12-01
dtype: datetime64[ns]
In [58]: df = DataFrame(dict(A = data, B = np.arange(len(data))))
In [59]: df.dropna(how='any',subset=['A']).set_index('A').resample('M',how='count')
Out[59]:
B
A
2014-10-31 3
2014-11-30 2
2014-12-31 1
import pandas as pd
import numpy as np
datetime import datetime
data = pd.to_datetime(
pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data=pd.Series(['{}-{:02d}'.format(x.year,x.month) if isinstance(x, datetime) else "Nat" for x in pd.DatetimeIndex(data).to_pydatetime()])
0 2014-10
1 2014-10
2 2014-10
3 2014-11
4 2014-11
5 Nat
6 2014-01
dtype: object
Best I could come up with, if the only non datetimes objects possible are floats you can change if isinstance(x, datetime) to if not isinstance(x, float)
Related
I have a huge dataframe with many columns, many of which are of type datetime.datetime. The problem is that many also have mixed types, including for instance datetime.datetime values and None values (and potentially other invalid values):
0 2017-07-06 00:00:00
1 2018-02-27 21:30:05
2 2017-04-12 00:00:00
3 2017-05-21 22:05:00
4 2018-01-22 00:00:00
...
352867 2019-10-04 00:00:00
352868 None
352869 some_string
Name: colx, Length: 352872, dtype: object
Hence resulting in an object type column. This can be solved with df.colx.fillna(pd.NaT). The problem is that the dataframe is too big to search for individual columns.
Another approach is to use pd.to_datetime(col, errors='coerce'), however this will cast to datetime many columns that contain numerical values.
I could also do df.fillna(float('nan'), inplace=True), though the columns containing dates are still of object type, and would still have the same problem.
What approach could I follow to cast to datetime those columns whose values really do contain datetime values, but could also contain None, and potentially some invalid values (mentioning since otherwise a pd.to_datetime in a try/except clause would do)? Something like a flexible version of pd.to_datetime(col)
This function will set the data type of a column to datetime, if any value in the column matches the regex pattern(\d{4}-\d{2}-\d{2})+ (e.g. 2019-01-01). Credit to this answer on how to Search for String in all Pandas DataFrame columns and filter that helped with setting and applying the mask.
def presume_date(dataframe):
""" Set datetime by presuming any date values in the column
indicates that the column data type should be datetime.
Args:
dataframe: Pandas dataframe.
Returns:
Pandas dataframe.
Raises:
None
"""
df = dataframe.copy()
mask = dataframe.astype(str).apply(lambda x: x.str.match(
r'(\d{4}-\d{2}-\d{2})+').any())
df_dates = df.loc[:, mask].apply(pd.to_datetime, errors='coerce')
for col in df_dates.columns:
df[col] = df_dates[col]
return df
Working from the suggestion to use dateutil, this may help. It is still working on the presumption that if there are any date-like values in a column, that the column should be a datetime. I tried to consider different dataframe iterations methods that are faster. I think this answer on How to iterate over rows in a DataFrame in Pandas did a good job describing them.
Note that dateutil.parser will use the current day or year for any strings like 'December' or 'November 2019' with no year or day values.
import pandas as pd
import datetime
from dateutil.parser import parse
df = pd.DataFrame(columns=['are_you_a_date','no_dates_here'])
df = df.append(pd.Series({'are_you_a_date':'December 2015','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'February 27 2018','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'May 2017 12','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'2017-05-21','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':None,'no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'some_string','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'Processed: 2019/01/25','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'December','no_dates_here':'just a string'}), ignore_index=True)
def parse_dates(x):
try:
return parse(x,fuzzy=True)
except ValueError:
return ''
except TypeError:
return ''
list_of_datetime_columns = []
for row in df:
if any([isinstance(parse_dates(row[0]),
datetime.datetime) for row in df[[row]].values]):
list_of_datetime_columns.append(row)
df_dates = df.loc[:, list_of_datetime_columns].apply(pd.to_datetime, errors='coerce')
for col in list_of_datetime_columns:
df[col] = df_dates[col]
In case you would also like to use the datatime values from dateutil.parser, you can add this:
for col in list_of_datetime_columns:
df[col] = df[col].apply(lambda x: parse_dates(x))
The main problem I see is when parsing numerical values.
I'd propose converting them to strings first
Setup
dat = {
'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
'columns': ['Mixed', 'Numeric Values', 'Strings'],
'data': [
['2017-07-06 00:00:00', 1, 'HI'],
['2018-02-27 21:30:05', 1, 'HI'],
['2017-04-12 00:00:00', 1, 'HI'],
['2017-05-21 22:05:00', 1, 'HI'],
['2018-01-22 00:00:00', 1, 'HI'],
['2019-10-04 00:00:00', 1, 'HI'],
['None', 1, 'HI'],
['some_string', 1, 'HI']
]
}
df = pd.DataFrame(**dat)
df
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 1 HI
1 2018-02-27 21:30:05 1 HI
2 2017-04-12 00:00:00 1 HI
3 2017-05-21 22:05:00 1 HI
4 2018-01-22 00:00:00 1 HI
352867 2019-10-04 00:00:00 1 HI
352868 None 1 HI
352869 some_string 1 HI
Solution
df.astype(str).apply(pd.to_datetime, errors='coerce')
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 NaT NaT
1 2018-02-27 21:30:05 NaT NaT
2 2017-04-12 00:00:00 NaT NaT
3 2017-05-21 22:05:00 NaT NaT
4 2018-01-22 00:00:00 NaT NaT
352867 2019-10-04 00:00:00 NaT NaT
352868 NaT NaT NaT
352869 NaT NaT NaT
I am working with python 3.5.2, pandas 0.18.1 and sqlite3.
In my data base, I have a column unix_time with INT for seconds since 1970. Ideally I want to read my dataframe from sqlite, and then create a time column which would correspond to the datetime or pandas.tslib.Timestamp conversion of the unix_time column that I woul only use for some processing and then drop before saving the dataframe back.
The issue is that when parsing the unix_time column using :
df = pd.read_from_sql_query("SELECT * FROM test", con, parse_dates=['unix_time'])
I obtain pandas.tslib.Timestamp types which is fine for my processing, but then I have to recreate my original unix_time column using :
df['unix_time'][i] = (df['unix_time'][i] - datetime(1970,1,1)).total_seconds()
which is really 'dirty'
First question : Do you have a better way?
I thought about giving up the unix time format and only use datetime format but the to_datetime method from pandas returns in fact pandas.tslib.Timestamp ... And anyway, doing so would force me to iterate over all rows which is a bad solution. (It is impossible to apply to_datetime on something else than a view over a single cell of the dataframe
Second question : Is it possible to apply it on a series?
My last try was with directly using df['time'] = datetime.datetime.fromtimestamp(df['unix_time']) but surprisingly, it also returns pandas.tslib.Timestamp.
In the end, knowing that I can only save unix timestamps or datetimes, my only choices for the moment are :
parsing but then having to convert them back to unix timestamp one by
one.
Or not parse it but have to convert them to pandas.tslib.Timestamp
one by one.
It would be great if I could convert a whole series.
Last question : Is there a way to convert a unix timestamps series to datetime (or at least pandas.tslib.Timestamp), or a pandas.tslib.Timestamp (or datetime) series to unix timestamps?
Thanks
EDIT:
During my processing, I extract a row that I want to append to my dataset. Apparently, the coversion to pandas.tslib.Timestamp appends implicitly when passing from dataframe to serie :
df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
df['Date'] = pd.to_datetime(df.UNX, unit='s')
print(df.Date.dtypes)
print(type(df['Date'][0]))
test = df.iloc[0]
print(type(test.Date))
new_df = test.to_frame().transpose() #from here, impossible to do : new_df.to_sql("test", con) because the type for 'Date' is not supported
print(new_df.Date.dtypes)
returns
datetime64[ns]
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
object
Is there a way to convert the 'Date' in new_df from pandas.tslib.Timestamp to datetime64[ns] or datetime.datetime (or simply str) ?
IIUC you can do it this way:
In [96]: df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
In [97]: df
Out[97]:
UNX
0 1451606400
1 1451616399
2 1451626398
3 1451636397
4 1451646396
5 1451656395
6 1451666394
7 1451676393
8 1451686392
9 1451696391
Convert UNIX epoch to Python datetime:
In [98]: df['Date'] = pd.to_datetime(df.UNX, unit='s')
In [99]: df
Out[99]:
UNX Date
0 1451606400 2016-01-01 00:00:00
1 1451616399 2016-01-01 02:46:39
2 1451626398 2016-01-01 05:33:18
3 1451636397 2016-01-01 08:19:57
4 1451646396 2016-01-01 11:06:36
5 1451656395 2016-01-01 13:53:15
6 1451666394 2016-01-01 16:39:54
7 1451676393 2016-01-01 19:26:33
8 1451686392 2016-01-01 22:13:12
9 1451696391 2016-01-02 00:59:51
Convert datetime to UNIX epoch:
In [100]: df['UNX2'] = df.Date.astype('int64')//10**9
In [101]: df
Out[101]:
UNX Date UNX2
0 1451606400 2016-01-01 00:00:00 1451606400
1 1451616399 2016-01-01 02:46:39 1451616399
2 1451626398 2016-01-01 05:33:18 1451626398
3 1451636397 2016-01-01 08:19:57 1451636397
4 1451646396 2016-01-01 11:06:36 1451646396
5 1451656395 2016-01-01 13:53:15 1451656395
6 1451666394 2016-01-01 16:39:54 1451666394
7 1451676393 2016-01-01 19:26:33 1451676393
8 1451686392 2016-01-01 22:13:12 1451686392
9 1451696391 2016-01-02 00:59:51 1451696391
Check:
In [102]: df.UNX.eq(df.UNX2).all()
Out[102]: True
Round trip between Pandas Timestamp and Unix Seconds (since 1970-01-01):
date_in = pd.to_datetime("2022-04-07")
# type(date_in) is: pandas._libs.tslibs.timestamps.Timestamp
unix_seconds = date_in.value//10**9
date_out = pd.to_datetime(unix_seconds, unit="s")
Output:
date_in
Out[1]: Timestamp('2021-04-07 00:00:00')
unix_seconds
Out[2]: 1617753600
date_out
Out[3]: Timestamp('2021-04-07 00:00:00')
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1
Follow up from Summing across rows of Pandas Dataframe and Pandas Dataframe object types fillna exception over different datatypes
One of the columns that I am aggregating using
df.groupby(['stock', 'same1', 'same2'], as_index=False)['positions'].sum()
this method is not very forgiving if there are missing data. If there are any missing data in same1, same2, etc it pads totally unrelated values. Workaround is to do a fillna loop over the columns to replace missing strings with '' and missing numbers with zero solves the problem.
I do however have one column with missing dates as well. column type is 'object' with nan of type float and in the missing cells and datetime objects in the existing data fields. important that I know that the data is missing, i.e. the missing indicator must survive the groupby transformation.
Dataset outlining the problem:
csv file that I use as input is:
Date,Stock,Position,Expiry,same
2012/12/01,A,100,2013/06/01,AA
2012/12/01,A,200,2013/06/01,AA
2012/12/01,B,300,,BB
2012/6/01,C,400,2013/06/01,CC
2012/6/01,C,500,2013/06/01,CC
I then read in file:
df = pd.read_csv('example', parse_dates=[0])
def convert_date(d):
'''Converts YYYY/mm/dd to datetime object'''
if type(d) != str or len(d) != 10: return np.nan
dd = d[8:]
mm = d[5:7]
YYYY = d[:4]
return datetime.datetime(int(YYYY), int(mm), int(dd))
df['Expiry'] = df.Expiry.map(convert_date)
df
df looks like:
Date Stock Position Expiry same
0 2012-12-01 00:00:00 A 100 2013-06-01 00:00:00 AA
1 2012-12-01 00:00:00 A 200 2013-06-01 00:00:00 AA
2 2012-12-01 00:00:00 B 300 NaN BB
3 2012-06-01 00:00:00 C 400 2013-06-01 00:00:00 CC
4 2012-06-01 00:00:00 C 500 2013-06-01 00:00:00 CC
can quite easily change the convert_date function to pop anything else for missing data in Expiry column.
Then using:
df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
to aggregate the Position column. Get a TypeError: can't compare datetime.datetime to str with any non date that I plug into missing date data. Important for later functionality to know if Expiry is missing.
You need to convert your dates to the datetime64[ns] dtype (which manages how datetimes work). An object column is not efficient nor does it deal well with datelikes. datetime64[ns] allow missing values usingNaT (not-a-time), see here: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#datetimes
In [6]: df['Expiry'] = pd.to_datetime(df['Expiry'])
# alternative way of reading in the data (in 0.11.1, as ``NaT`` will be set
# for missing values in a datelike column)
In [4]: df = pd.read_csv('example',parse_dates=['Date','Expiry'])
In [9]: df.dtypes
Out[9]:
Date datetime64[ns]
Stock object
Position int64
Expiry datetime64[ns]
same object
dtype: object
In [7]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
Out[7]:
Stock Expiry same Position
0 A 2013-06-01 00:00:00 AA 300
1 B NaT BB 300
2 C 2013-06-01 00:00:00 CC 900
In [8]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum().dtypes
Out[8]:
Stock object
Expiry datetime64[ns]
same object
Position int64
dtype: object
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494