I am reading from an Excel file that has a column with times. Since I can't upload the actual file, I created the variable timeIntervals to illustrate.
When I run this code...
import pandas as pd
import datetime
from pyPython import *
def main():
timeIntervals = pd.date_range("11:00", "21:30", freq="30min").time
df = pd.DataFrame({"Times": timeIntervals})
grp = pd.Grouper(key="Times", freq="3H")
value = df.groupby(grp).count()
print(value)
if __name__ == '__main__':
main()
I get the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
How can I use pandas.Grouper in combination with DataFrame.groupby to "group" dataframe df into discrete time ranges (3 hours) ? Are there other alternatives?
A few issues:
A date_range cannot be reduced to just time only without losing the required datatype for resampling on time window.
count counts the non-NaN values in a column so one must be provided since there are no remaining columns in the sample frame.
We can fix the first issue by turning the time column into a datetime:
timeIntervals = pd.date_range("11:00", "21:30", freq="30min") # remove time here
df = pd.DataFrame({"Times": timeIntervals})
If we are not creating these values from a date_range we can simply convert the column to_datetime:
df['Times'] = pd.to_datetime(df['Times'], format='%H:%M:%S')
Then we can groupby and count:
value = df.groupby(pd.Grouper(key="Times", freq="3H"))['Times'].count()
If needed we can update the index to only reflect the time after grouping:
value.index = value.index.time
As a result value becomes:
09:00:00 2
12:00:00 6
15:00:00 6
18:00:00 6
21:00:00 2
Name: Times, dtype: int64
All together with to_datetime:
def main():
time_intervals = pd.date_range("11:00", "21:30", freq="30min").time
df = pd.DataFrame({"Times": time_intervals})
# Convert to DateTime
df['Times'] = pd.to_datetime(df['Times'], format='%H:%M:%S')
# Group and count specific column
value = df.groupby(pd.Grouper(key="Times", freq="3H"))['Times'].count()
# Retrieve only Time information
value.index = value.index.time
print(value)
Or without retrieving time before DataFrame creation:
def main():
time_intervals = pd.date_range("11:00", "21:30", freq="30min")
df = pd.DataFrame({"Times": time_intervals})
value = df.groupby(pd.Grouper(key="Times", freq="3H"))['Times'].count()
value.index = value.index.time
print(value)
Related
There is another question that is eleven years old with a similar title.
I have a pandas dataframe with a column of datetime.time values.
val time
a 12:30:01.323
b 12:48:04.583
c 14:38:29.162
I want to convert the time column from UTC to EST.
I tried to do dataframe.tz_localize('utc').tz_convert('US/Eastern') but it gave me the following error: RangeIndex Object has no attribute tz_localize
tz_localize and tz_convert work on the index of the DataFrame. So you can do the following:
convert the "time" to Timestamp format
set the "time" column as index and use the conversion functions
reset_index()
keep only the time
Try:
dataframe["time"] = pd.to_datetime(dataframe["time"],format="%H:%M:%S.%f")
output = (dataframe.set_index("time")
.tz_localize("utc")
.tz_convert("US/Eastern")
.reset_index()
)
output["time"] = output["time"].dt.time
>>> output
time val
0 15:13:12.349211 a
1 15:13:13.435233 b
2 15:13:14.345233 c
to_datetime accepts an argument utc (bool) which, when true, coerces the timestamp to utc.
to_datetime returns a DateTimeIndex, which has a method tz_convert. this method will convert tz-aware timestamps from one timezeone to another.
So, this transformation could be concisely written as
df = pd.DataFrame(
[['a', '12:30:01.323'],
['b', '12:48:04.583'],
['c', '14:38:29.162']],
columns=['val', 'time']
)
df['time'] = pd.to_datetime(df.time, utc=True, format='%H:%M:%S.%f')
# convert string to timezone aware field ^^^
df['time'] = df.time.dt.tz_convert('EST').dt.time
# convert timezone, discarding the date part ^^^
This produces the following dataframe:
val time
0 a 07:30:01.323000
1 b 07:48:04.583000
2 c 09:38:29.162000
This could also be a 1-liner as below:
pd.to_datetime(df.time, utc=True, format='%H:%M:%S.%f').dt.tz_convert('EST').dt.time
list_temp = []
for row in df['time_UTC']:
list_temp.append(Timestamp(row, tz = 'UTC').tz_convert('US/Eastern'))
df['time_EST'] = list_temp
I have a half-hourly dataframe with two columns. I would like to take all the hours of a day, then do some calculation which returns one number and assign that to all half-hours of that day. Below is an example code:
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
print(df)
DATA1 DATA2
2003-01-01 08:30:00 NaN 79.990866
2003-01-01 09:00:00 NaN 5.461791
2003-01-01 09:30:00 NaN 68.892447
2003-01-01 10:00:00 NaN 44.823338
2003-01-01 10:30:00 NaN 57.860309
... ... ...
2003-01-04 22:00:00 0.394574 31.943657
2003-01-04 22:30:00 0.140950 78.275981
Then I would like to apply the following function which returns one numbre:
def my_f(data1,data2):
y = data1[data2>20]
return np.median(y)
This function selects all data in DATA1 based on a condition (DATA2>20) then takes the median of all these data.
How can I create a third column (let's say result) and assign back this fixed number (y) for all half-hours data of that day?
My guess is I should use something like this:
daily_tmp = df.resample('D').apply(my_f)
df['results'] = b.reindex(df.index,method='ffill')
If this approach is correct, how can I pass my_f with two arguments to resample.apply()?
Or is there any other way to do the similar task?
My solution assumes that you have a fairly small dataset. Please let me know if it is not the case.
I would decompose your goal as follows:
(1) group data by day
(2) for each day, compute some complicated function
(3) assign the resulted value in to half-hours.
# specify the day for each datapoint
df['day'] = df.index.map(lambda x: x.strftime('%Y-%m-%d'))
# compute a complicated function for each day and store the result
mapping = {}
for day, data_for_the_day in df.groupby(by='day'):
# assign to mapping[day] the result of a complicated function
mapping[day] = np.mean(data_for_the_day[data_for_the_day['Data2'] > 20]['Data1'])
# assign the values to half-hours
df['result'] = df.index.map(lambda x: mapping.get(x.strftime('%Y-%m-%d'), np.nan) if x.strftime('%M')=='30' else np.nan)
That's not the neatest solution, but it is straight-forward, easy-to-understand, and works well on small datasets.
Here is a fast way to do it.
First, import libraries :
import time
import pandas as pd
import numpy as np
import datetime as dt
Second, the code to achieve it:
%%time
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
#### Create an unique marker per hour
df['Date'] = df.index
df['Date'] = df['Date'].dt.strftime(date_format='%Y-%m-%d %H')
#### Then Stipulate some conditions
_condition_1 = df.Date == df.Date.shift(-1) # if full hour
_condition_2 = df.DATA2 > 20 # yours
_condition_3 = df.Date == df.Date.shift(1) # if half an hour
#### Now, report median where condition 1 and 2 are fullfilled
df['result'] = np.where(_condition_1 & _condition_2,(df.DATA1+df.DATA1.shift(-1)/2),0)
#### Fill the hours with median
df['result'] = np.where(_condition_3,df.result.shift(1),df.result)
#### Drop useless column
df = df.drop(['Date'],axis=1)
df[df.DATA2>20].tail(20)
Third: the output
output
I have to calculate mean() of time column, but this column type is string, how can I do it?
id time
1 1h:2m
2 1h:58m
3 35m
4 2h
...
You can use regex to extract hours and minutes. To calcualte the mean time in minutus:
h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)
(h * 60 + m).mean()
Result:
0 83.75
dtype: float64
It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:
def convertToSecond(time_str):
regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return timedelta(**time_params).total_seconds()
df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})
df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")
Result:
Mean of Time Column: 1:23:45
Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).
Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.
import pandas as pd
from dateutil import parser
from datetime import date
from datetime import datetime
df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])
today = date.today()
midnight = datetime.combine(today, datetime.min.time())
df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)
This will convert your dataframe like this (print(df)):
id time
0 1 01:02:00
1 2 01:58:00
2 3 00:35:00
3 4 02:00:00
from which you can calculate the mean using print(df['time'].mean()):
0 days 01:23:45
Full example: https://ideone.com/Aze9mR
I am using python to do some data cleaning and i've used the datetime module to split date time and tried to create another column with just the time.
My script works but it just takes the last value of the data frame.
Here is the code:
import datetime
i = 0
for index, row in df.iterrows():
date = datetime.datetime.strptime(df.iloc[i, 0], "%Y-%m-%dT%H:%M:%SZ")
df['minutes'] = date.minute
i = i + 1
This is the dataframe :
Output
df['minutes'] = date.minute reassigns the entire 'minutes' column with the scalar value date.minute from the last iteration.
You don't need a loop, as 99% of the cases when using pandas.
You can use vectorized assignment, just replace 'source_column_name' with the name of the column with the source data.
df['minutes'] = pd.to_datetime(df['source_column_name'], format='%Y-%m-%dT%H:%M:%SZ').dt.minute
It is also most likely that you won't need to specify format as pd.to_datetime is fairly smart.
Quick example:
df = pd.DataFrame({'a': ['2020.1.13', '2019.1.13']})
df['year'] = pd.to_datetime(df['a']).dt.year
print(df)
outputs
a year
0 2020.1.13 2020
1 2019.1.13 2019
Seems like you're trying to get the time column from the datetime which is in string format. That's what I understood from your post.
Could you give this a shot?
from datetime import datetime
import pandas as pd
def get_time(date_cell):
dt = datetime.strptime(date_cell, "%Y-%m-%dT%H:%M:%SZ")
return datetime.strftime(dt, "%H:%M:%SZ")
df['time'] = df['date_time'].apply(get_time)
I am attempting to find records in my dataframe that are 30 days old or older. I pretty much have everything working but I need to correct the format of the Age column. Most everything in the program is stuff I found on stack overflow, but I can't figure out how to change the format of the delta that is returned.
import pandas as pd
import datetime as dt
file_name = '/Aging_SRs.xls'
sheet = 'All'
df = pd.read_excel(io=file_name, sheet_name=sheet)
df.rename(columns={'SR Create Date': 'Create_Date', 'SR Number': 'SR'}, inplace=True)
tday = dt.date.today()
tdelta = dt.timedelta(days=30)
aged = tday - tdelta
df = df.loc[df.Create_Date <= aged, :]
# Sets the SR as the index.
df = df.set_index('SR', drop = True)
# Created the Age column.
df.insert(2, 'Age', 0)
# Calculates the days between the Create Date and Today.
df['Age'] = df['Create_Date'].subtract(tday)
The calculation in the last line above gives me the result, but it looks like -197 days +09:39:12 and I need it to just be a positive number 197. I have also tried to search using the python, pandas, and datetime keywords.
df.rename(columns={'Create_Date': 'SR Create Date'}, inplace=True)
writer = pd.ExcelWriter('output_test.xlsx')
df.to_excel(writer)
writer.save()
I can't see your example data, but IIUC and you're just trying to get the absolute value of the number of days of a timedelta, this should work:
df['Age'] = abs(df['Create_Date'].subtract(tday)).dt.days)
Explanation:
Given a dataframe with a timedelta column:
>>> df
delta
0 26523 days 01:57:59
1 -1601 days +01:57:59
You can extract just the number of days as an int using dt.days:
>>> df['delta']dt.days
0 26523
1 -1601
Name: delta, dtype: int64
Then, all you need to do is wrap that in a call to abs to get the absolute value of that int:
>>> abs(df.delta.dt.days)
0 26523
1 1601
Name: delta, dtype: int64
here is what i worked out for basically the same issue.
# create timestamp for today, normalize to 00:00:00
today = pd.to_datetime('today', ).normalize()
# match timezone with datetimes in df so subtraction works
today = today.tz_localize(df['posted'].dt.tz)
# create 'age' column for days old
df['age'] = (today - df['posted']).dt.days
pretty much the same as the answer above, but without the call to abs().