I have a time series in a dataframe with DatetimeIndex like that:
import pandas as pd
dates= ["2015-10-01 00:00:00",
"2015-10-01 01:00:00",
"2015-10-01 02:00:00",
"2015-10-01 03:00:00",
"2015-10-01 04:00:00"]
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Out[]:
values
2015-10-01 00:00:00 0
2015-10-01 01:00:00 1
2015-10-01 02:00:00 2
2015-10-01 03:00:00 3
2015-10-01 04:00:00 4
I would like to as simple clean as possible select a row looking like that, based on the date being the key, e.g. "2015-10-01 02:00:00":
Out[]:
values
2015-10-01 02:00:00 2
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
df.loc["2015-10-01 02:00:00",:]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
print(type(df.loc["2015-10-01 02:00:00"]))
print(type(df.loc["2015-10-01 02:00:00",:]))
print(df.loc["2015-10-01 02:00:00"].shape)
print(df.loc["2015-10-01 02:00:00",:].shape)
Out[]:
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
(1,)
(1,)
I could wrap any of those in DataFrame like that:
slize = pd.DataFrame(df.loc["2015-10-01 02:00:00",:])
Out[]:
2015-10-01 02:00:00
values 2
Of course I could do this to reach my result:
slize.T
Out[]:
values
2015-10-01 02:00:00 2
But as at this point, I could also expect a column as a series it is kinda hard to test if it is a row or columns series to add the T automatically.
Did I miss a way of selecting what I want?
I recommend to generate your index using pd.date_range for convenience, and then to use .loc with a Timestamp or datetime object.
from datetime import datetime
import pandas as pd
start = datetime(2015, 10, 1, 0, 0, 0)
end = datetime(2015, 10, 1, 4, 0, 0)
dates = pd.date_range(start, end, freq='H')
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Then you can use .loc with a Timestamp or datetime object.
In [2]: df.loc[[start]]
Out[2]:
values
2015-10-01 0
Further details
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
KeyError occurs because you try to return a view of the DataFrame by looking for a column named "2015-10-01 02:00:00"
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
Your second option cannot work with str indexing, you should use exact indexing as mentioned instead.
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
If you use .loc on a single row you will have a coercion to Series type as you noticed. Hence you shall cast to DataFrame and then transpose the result.
You can convert string to datetime - using exact indexing:
print (df.loc[[pd.to_datetime("2015-10-01 02:00:00")]])
values
2015-10-01 02:00:00 2
Or convert Series to DataFrame and transpose:
print (df.loc["2015-10-01 02:00:00"].to_frame().T)
values
2015-10-01 02:00:00 2
df[df[time_series_row] == “data_to_match”]
Sorry for the formatting. On my phone, will update when I’m back at a computer.
Edit:
I would generally write it like this:
bitmask = df[time_seried_row] == "data_to_match"
row = df[bitmask]
Related
I have a huge dataframe with many columns, many of which are of type datetime.datetime. The problem is that many also have mixed types, including for instance datetime.datetime values and None values (and potentially other invalid values):
0 2017-07-06 00:00:00
1 2018-02-27 21:30:05
2 2017-04-12 00:00:00
3 2017-05-21 22:05:00
4 2018-01-22 00:00:00
...
352867 2019-10-04 00:00:00
352868 None
352869 some_string
Name: colx, Length: 352872, dtype: object
Hence resulting in an object type column. This can be solved with df.colx.fillna(pd.NaT). The problem is that the dataframe is too big to search for individual columns.
Another approach is to use pd.to_datetime(col, errors='coerce'), however this will cast to datetime many columns that contain numerical values.
I could also do df.fillna(float('nan'), inplace=True), though the columns containing dates are still of object type, and would still have the same problem.
What approach could I follow to cast to datetime those columns whose values really do contain datetime values, but could also contain None, and potentially some invalid values (mentioning since otherwise a pd.to_datetime in a try/except clause would do)? Something like a flexible version of pd.to_datetime(col)
This function will set the data type of a column to datetime, if any value in the column matches the regex pattern(\d{4}-\d{2}-\d{2})+ (e.g. 2019-01-01). Credit to this answer on how to Search for String in all Pandas DataFrame columns and filter that helped with setting and applying the mask.
def presume_date(dataframe):
""" Set datetime by presuming any date values in the column
indicates that the column data type should be datetime.
Args:
dataframe: Pandas dataframe.
Returns:
Pandas dataframe.
Raises:
None
"""
df = dataframe.copy()
mask = dataframe.astype(str).apply(lambda x: x.str.match(
r'(\d{4}-\d{2}-\d{2})+').any())
df_dates = df.loc[:, mask].apply(pd.to_datetime, errors='coerce')
for col in df_dates.columns:
df[col] = df_dates[col]
return df
Working from the suggestion to use dateutil, this may help. It is still working on the presumption that if there are any date-like values in a column, that the column should be a datetime. I tried to consider different dataframe iterations methods that are faster. I think this answer on How to iterate over rows in a DataFrame in Pandas did a good job describing them.
Note that dateutil.parser will use the current day or year for any strings like 'December' or 'November 2019' with no year or day values.
import pandas as pd
import datetime
from dateutil.parser import parse
df = pd.DataFrame(columns=['are_you_a_date','no_dates_here'])
df = df.append(pd.Series({'are_you_a_date':'December 2015','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'February 27 2018','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'May 2017 12','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'2017-05-21','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':None,'no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'some_string','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'Processed: 2019/01/25','no_dates_here':'just a string'}), ignore_index=True)
df = df.append(pd.Series({'are_you_a_date':'December','no_dates_here':'just a string'}), ignore_index=True)
def parse_dates(x):
try:
return parse(x,fuzzy=True)
except ValueError:
return ''
except TypeError:
return ''
list_of_datetime_columns = []
for row in df:
if any([isinstance(parse_dates(row[0]),
datetime.datetime) for row in df[[row]].values]):
list_of_datetime_columns.append(row)
df_dates = df.loc[:, list_of_datetime_columns].apply(pd.to_datetime, errors='coerce')
for col in list_of_datetime_columns:
df[col] = df_dates[col]
In case you would also like to use the datatime values from dateutil.parser, you can add this:
for col in list_of_datetime_columns:
df[col] = df[col].apply(lambda x: parse_dates(x))
The main problem I see is when parsing numerical values.
I'd propose converting them to strings first
Setup
dat = {
'index': [0, 1, 2, 3, 4, 352867, 352868, 352869],
'columns': ['Mixed', 'Numeric Values', 'Strings'],
'data': [
['2017-07-06 00:00:00', 1, 'HI'],
['2018-02-27 21:30:05', 1, 'HI'],
['2017-04-12 00:00:00', 1, 'HI'],
['2017-05-21 22:05:00', 1, 'HI'],
['2018-01-22 00:00:00', 1, 'HI'],
['2019-10-04 00:00:00', 1, 'HI'],
['None', 1, 'HI'],
['some_string', 1, 'HI']
]
}
df = pd.DataFrame(**dat)
df
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 1 HI
1 2018-02-27 21:30:05 1 HI
2 2017-04-12 00:00:00 1 HI
3 2017-05-21 22:05:00 1 HI
4 2018-01-22 00:00:00 1 HI
352867 2019-10-04 00:00:00 1 HI
352868 None 1 HI
352869 some_string 1 HI
Solution
df.astype(str).apply(pd.to_datetime, errors='coerce')
Mixed Numeric Values Strings
0 2017-07-06 00:00:00 NaT NaT
1 2018-02-27 21:30:05 NaT NaT
2 2017-04-12 00:00:00 NaT NaT
3 2017-05-21 22:05:00 NaT NaT
4 2018-01-22 00:00:00 NaT NaT
352867 2019-10-04 00:00:00 NaT NaT
352868 NaT NaT NaT
352869 NaT NaT NaT
I have 2 dataframes, taken from a larger frame (with df.head(x) ), both with the same index:
print df
val
DT
2017-03-06 00:00:00 1.06207
2017-03-06 00:02:00 1.06180
2017-03-06 00:04:00 1.06167
2017-03-06 00:06:00 1.06141
2017-03-06 00:08:00 1.06122
... ...
2017-03-10 21:50:00 1.06719
2017-03-10 21:52:00 1.06719
2017-03-10 21:54:00 1.06697
2017-03-10 21:56:00 1.06713
2017-03-10 21:58:00 1.06740
a and b are then taken from df
print a.index
print b.index
DatetimeIndex(['2017-03-06 00:32:00'], dtype='datetime64[ns]', name=u'DT', freq=None)
DatetimeIndex(['2017-03-06 00:18:00'], dtype='datetime64[ns]', name=u'DT', freq=None)
But, when I use a.index.item(), I get it in the format 1488759480000000000. That means when
I go to take a slice from df based on a and b , I get an empty dataframe
>>> df[a.index.item() : b.index.item()]
Empty DataFrame
and further to that, when I try to convert them both:
df[a.index.to_pydatetime() : b.index.to_pydatetime()]
TypeError: Cannot convert input [[datetime.datetime(2017, 3, 6, 0, 18)]] of type <type 'numpy.ndarray'> to Timestamp
This is infuriating, surely there should be continuity of objects when using item(). Can anyone give me some pointers?
You can use loc with first value of a and b:
df.loc[a.index[0] : b.index[0]]
Your solution working if convert to Timestamp:
print (df.loc[pd.Timestamp(a.index.item()): pd.Timestamp(b.index.item())])
I am working with python 3.5.2, pandas 0.18.1 and sqlite3.
In my data base, I have a column unix_time with INT for seconds since 1970. Ideally I want to read my dataframe from sqlite, and then create a time column which would correspond to the datetime or pandas.tslib.Timestamp conversion of the unix_time column that I woul only use for some processing and then drop before saving the dataframe back.
The issue is that when parsing the unix_time column using :
df = pd.read_from_sql_query("SELECT * FROM test", con, parse_dates=['unix_time'])
I obtain pandas.tslib.Timestamp types which is fine for my processing, but then I have to recreate my original unix_time column using :
df['unix_time'][i] = (df['unix_time'][i] - datetime(1970,1,1)).total_seconds()
which is really 'dirty'
First question : Do you have a better way?
I thought about giving up the unix time format and only use datetime format but the to_datetime method from pandas returns in fact pandas.tslib.Timestamp ... And anyway, doing so would force me to iterate over all rows which is a bad solution. (It is impossible to apply to_datetime on something else than a view over a single cell of the dataframe
Second question : Is it possible to apply it on a series?
My last try was with directly using df['time'] = datetime.datetime.fromtimestamp(df['unix_time']) but surprisingly, it also returns pandas.tslib.Timestamp.
In the end, knowing that I can only save unix timestamps or datetimes, my only choices for the moment are :
parsing but then having to convert them back to unix timestamp one by
one.
Or not parse it but have to convert them to pandas.tslib.Timestamp
one by one.
It would be great if I could convert a whole series.
Last question : Is there a way to convert a unix timestamps series to datetime (or at least pandas.tslib.Timestamp), or a pandas.tslib.Timestamp (or datetime) series to unix timestamps?
Thanks
EDIT:
During my processing, I extract a row that I want to append to my dataset. Apparently, the coversion to pandas.tslib.Timestamp appends implicitly when passing from dataframe to serie :
df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
df['Date'] = pd.to_datetime(df.UNX, unit='s')
print(df.Date.dtypes)
print(type(df['Date'][0]))
test = df.iloc[0]
print(type(test.Date))
new_df = test.to_frame().transpose() #from here, impossible to do : new_df.to_sql("test", con) because the type for 'Date' is not supported
print(new_df.Date.dtypes)
returns
datetime64[ns]
<class 'pandas.tslib.Timestamp'>
<class 'pandas.tslib.Timestamp'>
object
Is there a way to convert the 'Date' in new_df from pandas.tslib.Timestamp to datetime64[ns] or datetime.datetime (or simply str) ?
IIUC you can do it this way:
In [96]: df = pd.DataFrame({'UNX':pd.date_range('2016-01-01', freq='9999S', periods=10).astype(np.int64)//10**9})
In [97]: df
Out[97]:
UNX
0 1451606400
1 1451616399
2 1451626398
3 1451636397
4 1451646396
5 1451656395
6 1451666394
7 1451676393
8 1451686392
9 1451696391
Convert UNIX epoch to Python datetime:
In [98]: df['Date'] = pd.to_datetime(df.UNX, unit='s')
In [99]: df
Out[99]:
UNX Date
0 1451606400 2016-01-01 00:00:00
1 1451616399 2016-01-01 02:46:39
2 1451626398 2016-01-01 05:33:18
3 1451636397 2016-01-01 08:19:57
4 1451646396 2016-01-01 11:06:36
5 1451656395 2016-01-01 13:53:15
6 1451666394 2016-01-01 16:39:54
7 1451676393 2016-01-01 19:26:33
8 1451686392 2016-01-01 22:13:12
9 1451696391 2016-01-02 00:59:51
Convert datetime to UNIX epoch:
In [100]: df['UNX2'] = df.Date.astype('int64')//10**9
In [101]: df
Out[101]:
UNX Date UNX2
0 1451606400 2016-01-01 00:00:00 1451606400
1 1451616399 2016-01-01 02:46:39 1451616399
2 1451626398 2016-01-01 05:33:18 1451626398
3 1451636397 2016-01-01 08:19:57 1451636397
4 1451646396 2016-01-01 11:06:36 1451646396
5 1451656395 2016-01-01 13:53:15 1451656395
6 1451666394 2016-01-01 16:39:54 1451666394
7 1451676393 2016-01-01 19:26:33 1451676393
8 1451686392 2016-01-01 22:13:12 1451686392
9 1451696391 2016-01-02 00:59:51 1451696391
Check:
In [102]: df.UNX.eq(df.UNX2).all()
Out[102]: True
Round trip between Pandas Timestamp and Unix Seconds (since 1970-01-01):
date_in = pd.to_datetime("2022-04-07")
# type(date_in) is: pandas._libs.tslibs.timestamps.Timestamp
unix_seconds = date_in.value//10**9
date_out = pd.to_datetime(unix_seconds, unit="s")
Output:
date_in
Out[1]: Timestamp('2021-04-07 00:00:00')
unix_seconds
Out[2]: 1617753600
date_out
Out[3]: Timestamp('2021-04-07 00:00:00')
Follow up from Summing across rows of Pandas Dataframe and Pandas Dataframe object types fillna exception over different datatypes
One of the columns that I am aggregating using
df.groupby(['stock', 'same1', 'same2'], as_index=False)['positions'].sum()
this method is not very forgiving if there are missing data. If there are any missing data in same1, same2, etc it pads totally unrelated values. Workaround is to do a fillna loop over the columns to replace missing strings with '' and missing numbers with zero solves the problem.
I do however have one column with missing dates as well. column type is 'object' with nan of type float and in the missing cells and datetime objects in the existing data fields. important that I know that the data is missing, i.e. the missing indicator must survive the groupby transformation.
Dataset outlining the problem:
csv file that I use as input is:
Date,Stock,Position,Expiry,same
2012/12/01,A,100,2013/06/01,AA
2012/12/01,A,200,2013/06/01,AA
2012/12/01,B,300,,BB
2012/6/01,C,400,2013/06/01,CC
2012/6/01,C,500,2013/06/01,CC
I then read in file:
df = pd.read_csv('example', parse_dates=[0])
def convert_date(d):
'''Converts YYYY/mm/dd to datetime object'''
if type(d) != str or len(d) != 10: return np.nan
dd = d[8:]
mm = d[5:7]
YYYY = d[:4]
return datetime.datetime(int(YYYY), int(mm), int(dd))
df['Expiry'] = df.Expiry.map(convert_date)
df
df looks like:
Date Stock Position Expiry same
0 2012-12-01 00:00:00 A 100 2013-06-01 00:00:00 AA
1 2012-12-01 00:00:00 A 200 2013-06-01 00:00:00 AA
2 2012-12-01 00:00:00 B 300 NaN BB
3 2012-06-01 00:00:00 C 400 2013-06-01 00:00:00 CC
4 2012-06-01 00:00:00 C 500 2013-06-01 00:00:00 CC
can quite easily change the convert_date function to pop anything else for missing data in Expiry column.
Then using:
df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
to aggregate the Position column. Get a TypeError: can't compare datetime.datetime to str with any non date that I plug into missing date data. Important for later functionality to know if Expiry is missing.
You need to convert your dates to the datetime64[ns] dtype (which manages how datetimes work). An object column is not efficient nor does it deal well with datelikes. datetime64[ns] allow missing values usingNaT (not-a-time), see here: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#datetimes
In [6]: df['Expiry'] = pd.to_datetime(df['Expiry'])
# alternative way of reading in the data (in 0.11.1, as ``NaT`` will be set
# for missing values in a datelike column)
In [4]: df = pd.read_csv('example',parse_dates=['Date','Expiry'])
In [9]: df.dtypes
Out[9]:
Date datetime64[ns]
Stock object
Position int64
Expiry datetime64[ns]
same object
dtype: object
In [7]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
Out[7]:
Stock Expiry same Position
0 A 2013-06-01 00:00:00 AA 300
1 B NaT BB 300
2 C 2013-06-01 00:00:00 CC 900
In [8]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum().dtypes
Out[8]:
Stock object
Expiry datetime64[ns]
same object
Position int64
dtype: object
I have a Pandas data frame, one of the column contains date strings in the format YYYY-MM-DD
For e.g. '2013-10-28'
At the moment the dtype of the column is object.
How do I convert the column values to Pandas date format?
Essentially equivalent to #waitingkuo, but I would use pd.to_datetime here (it seems a little cleaner, and offers some additional functionality e.g. dayfirst):
In [11]: df
Out[11]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [12]: pd.to_datetime(df['time'])
Out[12]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
Name: time, dtype: datetime64[ns]
In [13]: df['time'] = pd.to_datetime(df['time'])
In [14]: df
Out[14]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
Handling ValueErrors
If you run into a situation where doing
df['time'] = pd.to_datetime(df['time'])
Throws a
ValueError: Unknown string format
That means you have invalid (non-coercible) values. If you are okay with having them converted to pd.NaT, you can add an errors='coerce' argument to to_datetime:
df['time'] = pd.to_datetime(df['time'], errors='coerce')
Use astype
In [31]: df
Out[31]:
a time
0 1 2013-01-01
1 2 2013-01-02
2 3 2013-01-03
In [32]: df['time'] = df['time'].astype('datetime64[ns]')
In [33]: df
Out[33]:
a time
0 1 2013-01-01 00:00:00
1 2 2013-01-02 00:00:00
2 3 2013-01-03 00:00:00
I imagine a lot of data comes into Pandas from CSV files, in which case you can simply convert the date during the initial CSV read:
dfcsv = pd.read_csv('xyz.csv', parse_dates=[0]) where the 0 refers to the column the date is in.
You could also add , index_col=0 in there if you want the date to be your index.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Now you can do df['column'].dt.date
Note that for datetime objects, if you don't see the hour when they're all 00:00:00, that's not pandas. That's iPython notebook trying to make things look pretty.
If you want to get the DATE and not DATETIME format:
df["id_date"] = pd.to_datetime(df["id_date"]).dt.date
Another way to do this and this works well if you have multiple columns to convert to datetime.
cols = ['date1','date2']
df[cols] = df[cols].apply(pd.to_datetime)
It may be the case that dates need to be converted to a different frequency. In this case, I would suggest setting an index by dates.
#set an index by dates
df.set_index(['time'], drop=True, inplace=True)
After this, you can more easily convert to the type of date format you will need most. Below, I sequentially convert to a number of date formats, ultimately ending up with a set of daily dates at the beginning of the month.
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
#Convert to monthly dates
df.index = df.index.to_period(freq='M')
#Convert to strings
df.index = df.index.strftime('%Y-%m')
#Convert to daily dates
df.index = pd.DatetimeIndex(data=df.index)
For brevity, I don't show that I run the following code after each line above:
print(df.index)
print(df.index.dtype)
print(type(df.index))
This gives me the following output:
Index(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='object', name='time')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', name='time', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
PeriodIndex(['2013-01', '2013-01', '2013-01'], dtype='period[M]', name='time', freq='M')
period[M]
<class 'pandas.core.indexes.period.PeriodIndex'>
Index(['2013-01', '2013-01', '2013-01'], dtype='object')
object
<class 'pandas.core.indexes.base.Index'>
DatetimeIndex(['2013-01-01', '2013-01-01', '2013-01-01'], dtype='datetime64[ns]', freq=None)
datetime64[ns]
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
For the sake of completeness, another option, which might not be the most straightforward one, a bit similar to the one proposed by #SSS, but using rather the datetime library is:
import datetime
df["Date"] = df["Date"].apply(lambda x: datetime.datetime.strptime(x, '%Y-%d-%m').date())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null object
1 endDay 110526 non-null object
import pandas as pd
df['startDay'] = pd.to_datetime(df.startDay)
df['endDay'] = pd.to_datetime(df.endDay)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 startDay 110526 non-null datetime64[ns]
1 endDay 110526 non-null datetime64[ns]
Try to convert one of the rows into timestamp using the pd.to_datetime function and then use .map to map the formular to the entire column