I've got a column of timestamps (in ms) in a pandas DataFrame. From the timestamps, I'm trying to derive the hour, minute, day of week, and month of the timestamp in separate columns.
I've tried using the apply function across the column, but to no avail. So, I took a very naive (but not very concise) approach to create these columns:
import pandas
import datetime
df=pd.DataFrame( {'time':[1401811621559, 1402673694105, 1402673749561, 1401811615479, 1402673708254], 'person':['Harry', 'Ann', 'Sue', 'Jeremy', 'Anne']})
df['time'] = pandas.to_datetime(df.time, unit='ms')
days = []
tod = []
month = []
minutes = []
for row in df['time']:
days.append(row.strftime('%w'))
tod.append(row.strftime('%H'))
month.append(row.strftime('%m'))
minutes.append(row.strftime('%M'))
##
df['dayOfWeek'] = days
df['timeOfDay'] = tod
df['month'] = month
df['minutes'] = minutes
Is there a way to do this that is more like this?
df['dayOfWeek'] = df['time'].apply(strftime('%w'),axis = 1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'strftime' is not defined
At the moment you have to wrap the column in a DatetimeIndex:
In [11]: dti = pd.DatetimeIndex(df['time'])
In [12]: dti.dayofweek
Out[12]: array([1, 4, 4, 1, 4])
In [13]: dti.time
Out[13]:
array([datetime.time(16, 7, 1, 559000), datetime.time(15, 34, 54, 105000),
datetime.time(15, 35, 49, 561000), datetime.time(16, 6, 55, 479000),
datetime.time(15, 35, 8, 254000)], dtype=object)
In [14]: dti.month
Out[14]: array([6, 6, 6, 6, 6])
In [15]: dti.minute
Out[15]: array([ 7, 34, 35, 6, 35])
etc.
See this issue for making these methods directly available from a datetime series.
You might also make it a lambda function:
df['dayOfWeek2'] = df.time.apply(lambda x:x.strftime('%w'))
Now typing
df.dayOfWeek2 == df.dayOfWeek
yields
0 True
1 True
2 True
3 True
4 True
dtype: bool
Yes there is, modifying your code slightly...
def timeGroups(row):
row['days'] = row['time'].strftime('%w'))
#do the same thing for month,seconds,etc.
return row
df['dayOfWeek'] = df['time'].apply(timeGroups,axis = 1)
Related
I have a question about selecting a range in a pandas DataFrame in Python. I have a column with times and a column with values. I would like to select all the rows with times between 6 a.m. and 6 p.m. (so from 6:00:00 to 18:00:00). I've succeeded at selecting all the night times (between 18:00:00 and 6:00:00), but if I apply the same to the day times, it doesn't work. Is there something wrong with my syntax? Below is a minimal working example. timeslice2 returns an empty DataFrame in my case.
import pandas as pd
times = ("1:00:00", "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00", "7:00:00", "8:00:00", "9:00:00", \
"10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", "15:00:00", "16:00:00", "17:00:00", \
"18:00:00", "19:00:00", "20:00:00", "21:00:00", "22:00:00", "23:00:00")
values = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
data = zip(times, values)
colnames = ["Time", "values"]
df = pd.DataFrame(data=data, columns=colnames)
print(df)
# selecting only night times
timeslice1 = df[(df['Time'] > '18:00:00') & (df['Time'] <= '6:00:00')]
# selecting only day times
timeslice2 = df[(df['Time'] > '6:00:00') & (df['Time'] <= '18:00:00')]
print(timeslice1)
print(timeslice2)
I've been able to select the right range with this answer, but it seems strange to me that the above doesn't work. Moreover, if I convert my 'Time' column to 'datetime', as needed, it uses the date of today and I don't want that.
This way it works, the first range if treated like datetime has no results because it will mean two different dates (days) in chronological order.
import pandas as pd
times = ("1:00:00", "2:00:00", "3:00:00", "4:00:00", "5:00:00", "6:00:00", "7:00:00", "8:00:00", "9:00:00", \
"10:00:00", "11:00:00", "12:00:00", "13:00:00", "14:00:00", "15:00:00", "16:00:00", "17:00:00", \
"18:00:00", "19:00:00", "20:00:00", "21:00:00", "22:00:00", "23:00:00")
values = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
data = zip(times, values)
colnames = ["Time", "values"]
df = pd.DataFrame(data=data, columns=colnames)
print('Original df \n',df)
# selecting only night times
timeslice1 = df[(df['Time'] > '18:00:00') & (df['Time'] <= '6:00:00')]
# selecting only day times
#conver Time column to datetime
df['Time'] = pd.to_datetime(df['Time'])
timeslice2 = df[(df['Time'] > '6:00:00') & (df['Time'] <= '18:00:00')]
#convert df back to string
timeslice2["Time"] = timeslice2["Time"].dt.strftime('%H:%M:%S')
print('Slice 1 \n', timeslice1)
print('Slice 2 \n', timeslice2)
I'm trying to figure out the easiest way to automate the conversion of an array of seconds into datetime. I'm very familiar with converting the seconds from 1970 into datetime, but the values that I have here are for the seconds elapsed in a given day. For example, 14084 is the number if seconds that has passed on 2011,11,11, and I was able to generate the datetime below.
str(dt.timedelta(seconds = 14084))
Out[245]: '3:54:44'
dt.datetime.combine(date(2011,11,11),time(3,54,44))
Out[250]: datetime.datetime(2011, 11, 11, 3, 54, 44)
Is there a faster way of conversion for an array.
numpy has support for arrays of datetimes with a timedelta type for manipulating them:
https://numpy.org/doc/stable/reference/arrays.datetime.html
e.g. you can do this:
import numpy as np
date_array = np.arange('2005-02', '2005-03', dtype='datetime64[D]')
date_array += np.timedelta64(4, 's') # Add 4 seconds
If you have an array of seconds, you could convert it into an array of timedeltas and add that to a fixed datetime
Say you have
seconds = [14084, 14085, 15003]
You can use pandas
import pandas as pd
series = pd.to_timedelta(seconds, unit='s') + pd.to_datetime('2011-11-11')
series = series.to_series().reset_index(drop=True)
print(series)
0 2011-11-11 03:54:44
1 2011-11-11 03:54:45
2 2011-11-11 04:10:03
dtype: datetime64[ns]
Or a list comprehension
list_comp = [datetime.datetime(2011, 11, 11) +
datetime.timedelta(seconds=s) for s in seconds]
print(list_comp)
[datetime.datetime(2011, 11, 11, 3, 54, 44), datetime.datetime(2011, 11, 11, 3, 54, 45), datetime.datetime(2011, 11, 11, 4, 10, 3)]
I have
In [1]: from datetime import datetime
In [2]: datetime.now().isoformat()
Out[2]: '2019-11-05T14:55:58.267650'
and I want to get the isoformat of 10 seconds from now + change the format to yyyymmddThhmmss.
The format change can be done by:
In [6]: datetime.now().isoformat()
Out[6]: '2019-11-05T14:58:36.572646'
In [7]: datetime.now().isoformat().split('.')[0].replace('-', '').replace(':', '')
Out[7]: '20191105T145923'
But how can I add time?
Maybe use datetime.timedelta(), like this:
>>> import datetime
>>> now = datetime.datetime.now()
>>> now
datetime.datetime(2019, 11, 5, 10, 9, 16, 129672)
>>> new_date = now + datetime.timedelta(seconds=30)
>>> new_date
datetime.datetime(2019, 11, 5, 10, 9, 46, 129672)
Now format the new date as string:
>>> new_date.isoformat().split('.')[0].replace('-', '').replace(':', '')
'20191105T100946'
Or way cleaner using .strftime():
>>> new_date.strftime("%Y%m%dT%H%M%S")
'20191105T100946'
I have dates that I'm pulling into a dataframe at regular intervals.
The data is generally well-formed, but sometimes there are bad data in an otherwise date column.
I would always expect to have a date in the parsed 9 digit form:
(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)
How should I check and fix this?
What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions.
Data as shown is published_parsed from feedparser.
import pandas as pd
import datetime
# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
# date is fine
df_date = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
Pseudocode
if the original_date is valid
return original_date
else
return substitute_date
import calendar
import numpy as np
import pandas as pd
def tuple_to_timestamp(x):
try:
return calendar.timegm(x) # 1
except (TypeError, ValueError):
return np.nan
df = pd.DataFrame({'orig': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 30, 23, 59, 12, 0, 362, 0)]})
ts = df['orig'].apply(tuple_to_timestamp) # 2
# 0 1451349050
# 1 1451347152
# 2 NaN
# 3 NaN
# 4 1451519952
# Name: orig, dtype: float64
ts = ts.interpolate() # 3
# 0 1451349050
# 1 1451347152
# 2 1451404752
# 3 1451462352
# 4 1451519952
# Name: orig, dtype: float64
df['fixed'] = pd.to_datetime(ts, unit='s') # 4
print(df)
yields
orig fixed
0 (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1 (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2 None 2015-12-29 15:59:12
3 2015-12-30 07:59:12
4 (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12
Explanation:
calendar.timegm converts each time-tuple to a timestamp. Unlike
time.mktime, it interprets the time-tuple as being in UTC, not local time.
apply calls tuple_to_timestamp for each row of df['orig'].
The nice thing about timestamps is that they are numeric, so you can then use
numerical methods such as Series.interpolate to fill in NaNs with interpolated
values. Note that the two NaNs do not get filled with same interpolated value; their values are linearly interpolated based on their position as given by ts.index.
pd.to_datetime converts to timestamps to dates.
When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime. To use this function, we will convert the list into a string with just the date and time elements. For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string ''.
#convert list into string with date & time
#only elements with lists of length 9 will be parsed
dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '')
#convert to a pandas timestamp
dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce'))
date
0 2015-12-29 00:30:50
1 2015-12-28 23:59:12
2 NaT
3 NaT
4 2015-12-28 23:59:12
Find the indices where the dates are missing use pd.isnull():
>>>missing = pd.isnull(dates_df['date']).index
>>>missing
Int64Index([2, 3], dtype='int64')
To set the missing date as the midpoint between 2 dates:
start_date = dates_df.iloc[0,:]
end_date = dates_df.iloc[4,:]
missing_date = start_date + (end_date - start_date)/2
I am using pandas groupby and was wondering how to implement the following:
Dataframes A and B have the same variable to index on, but A has 20 unique index values and B has 5.
I want to create a dataframe C that contains rows whose indices are present in A and not in B.
Assume that the 5 unique index values in B are all present in A. C in this case would have only those rows associated with index values in A and not in B (i.e. 15).
Using inner, outer, left and right do not do this (unless I misread something).
In SQL I might do this as where A.index <> (not equal) B.index
My Left handed solution:
a) get the respective index columns from each data set, say x and y.
def match(x,y,compareCol):
"""
x and y are series
compare col is the name to the series being returned .
It is the same name as the name of x and y in their respective dataframes"""
x = x.unique()
y = y.unique()
""" Need to compare arrays x.unique() returns arrays"""
new = []
for item in (x):
if item not in y:
new.append(item)
returnADataFrame = pa.DataFrame(pa.Series(new, name = compareCol))
return returnADataFrame
b) now do a left join on this on the data set A.
I am reasonably confident that my elementwise comparison is slow as a tortoise on weed with no
motivation.
What about something like:
A.ix[A.index - B.index]
A.index - B.index is a set difference:
In [30]: A.index
Out[30]: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)
In [31]: B.index
Out[31]: Int64Index([ 0, 1, 2, 3, 999], dtype=int64)
In [32]: A.index - B.index
Out[32]: Int64Index([ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)
In [33]: B.index - A.index
Out[33]: Int64Index([999], dtype=int64)