I have the following pandas dataframe. The dates are with time:
from pandas.tseries.holiday import USFederalHolidayCalendar
import pandas as pd<BR>
df = pd.DataFrame([[6,0,"2016-01-02 01:00:00",0.0],
[7,0,"2016-07-04 02:00:00",0.0]])
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2014-01-01', end='2018-12-31')
I want to add a new boolean column with True/False if the date is holiday or not.
Tried df["hd"] = df[2].isin(holidays), but it doesn't work because of time digits.
Use Series.dt.floor or Series.dt.normalize for remove times:
df[2] = pd.to_datetime(df[2])
df["hd"] = df[2].dt.floor('d').isin(holidays)
#alternative
df["hd"] = df[2].dt.normalize().isin(holidays)
print (df)
0 1 2 3 hd
0 6 0 2016-01-02 01:00:00 0.0 False
1 7 0 2016-07-04 02:00:00 0.0 True
Related
I'd like to add 1 if date_ > buy_date larger than 12 months else 0
example df
customer_id date_ buy_date
34555 2019-01-01 2017-02-01
24252 2019-01-01 2018-02-10
96477 2019-01-01 2017-02-18
output df
customer_id date_ buy_date buy_date>_than_12_months
34555 2019-01-01 2017-02-01 1
24252 2019-01-01 2018-02-10 0
96477 2019-01-01 2018-02-18 1
Based on what I understand, you can try adding a year to buy_date and then subtract from date_ , then check if days are + or -.
df['buy_date>_than_12_months'] = ((df['date_'] -
(df['buy_date']+pd.offsets.DateOffset(years=1)))
.dt.days.gt(0).astype(int))
print(df)
customer_id date_ buy_date buy_date>_than_12_months
0 34555 2019-01-01 2017-02-01 1
1 24252 2019-01-01 2018-02-10 0
2 96477 2019-01-01 2017-02-18 1
import pandas as pd
import numpy as np
values = {'customer_id': [34555,24252,96477],
'date_': ['2019-01-01','2019-01-01','2019-01-01'],
'buy_date': ['2017-02-01','2018-02-10','2017-02-18'],
}
df = pd.DataFrame(values, columns = ['customer_id', 'date_', 'buy_date'])
df['date_'] = pd.to_datetime(df['date_'], format='%Y-%m-%d')
df['buy_date'] = pd.to_datetime(df['buy_date'], format='%Y-%m-%d')
print(df['date_'] - df['buy_date'])
df['buy_date>_than_12_months'] = pd.Series([1 if ((df['date_'] - df['buy_date'])[i]> np.timedelta64(1, 'Y')) else 0 for i in range(3)])
print (df)
I am receiving data which consists of a 'StartTime' and a 'Duration' of time active. This is hard to work with when I need to do calculations on a specified time range over multiple days. I would like to break this data down to minutely data to make future calculations easier. Please see the example to get a better understanding.
Data which I currently have:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
What I would like to have:
data_expected = {'Time':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 04:37:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00','2019-01-02 05:14:00+11:00'],
'Duration':[1,1,1,1,1,1,1],
'Site':['1','2','3','3','4','5','5']
}
df_expected = pd.DataFrame(data_expected)
df_expected['Time'] = pd.to_datetime(df_expected['Time']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
I would like to see if anyone has a good solution for this problem. Effectively, I would need data rows with Duration >1 to be duplicated with time +1minute for each minute above 1 minute duration. Is there a way to do this without creating a whole new dataframe?
******** EDIT ********
In response to #DavidErickson 's answer. Putting this here because I can't put images in comments. I ran into a bit of trouble. df1 is a subset of the original dataframe. df2 is df1 after applying the code provided. You can see that the time that is added on to index 635 is incorrect.
I think you might want to address use case where Duration > 2 as well.
For the modified given input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
This code should do the trick:
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = df.explode('offset')
df['offset'] = df['offset'].apply(lambda x: pd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
Basically, it works as follow:
create a list of integer based on Duration value;
replicate row (explode) with consecutive integer offset;
transform integer offset into timedelta offset;
perform datetime arithmetics and reset Duration field.
The result is about:
StartTime Duration Site offset
0 2018-12-30 12:45:00+11:00 1 1 00:00:00
1 2018-12-31 16:48:00+11:00 1 2 00:00:00
2 2019-01-01 04:36:00+11:00 1 3 00:00:00
2 2019-01-01 04:37:00+11:00 1 3 00:01:00
2 2019-01-01 04:38:00+11:00 1 3 00:02:00
3 2019-01-01 19:27:00+11:00 1 4 00:00:00
4 2019-01-02 05:13:00+11:00 1 5 00:00:00
4 2019-01-02 05:14:00+11:00 1 5 00:01:00
Use df.index.repeat according to the Duration column to add the relevant number of rows. Then create a mask with .groupby and cumcount that adds the appropriate number of minutes on top of the base time.
input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,2,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
code:
df = df.loc[df.index.repeat(df['Duration'])]
mask = df.groupby('Site').cumcount()
df['StartTime'] = df['StartTime'] + pd.to_timedelta(mask, unit='m')
df = df.append(df).sort_values('StartTime').assign(Duration=1).drop_duplicates()
df
output:
StartTime Duration Site
0 2018-12-30 12:45:00+11:00 1 1
1 2018-12-31 16:48:00+11:00 1 2
2 2019-01-01 04:36:00+11:00 1 3
2 2019-01-01 04:37:00+11:00 1 3
2 2019-01-01 04:38:00+11:00 1 3
3 2019-01-01 19:27:00+11:00 1 4
4 2019-01-02 05:13:00+11:00 1 5
4 2019-01-02 05:14:00+11:00 1 5
If you are running into memory issues, then you can also try with dask. I have included #jlandercy's pandas answer and changed to dask syntax as I'm not sure if the pandas operation index.repeat would work with dask. Here is documentation on the funcitons/operations. I would research the ones in the code https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table:
import dask.dataframe as dd
#read as a dask dataframe from csv or SQL or other
df = dd.read_csv(files) #df = dd.read_sql_table(table, uri, index_col='StartTime')
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = dd.explode('offset')
df['offset'] = df['offset'].apply(lambda x: dd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
I have a pandas timeline table containing dates objects and scores:
datetime score
2018-11-23 08:33:02 4
2018-11-24 09:43:30 2
2018-11-25 08:21:34 5
2018-11-26 19:33:01 4
2018-11-23 08:50:40 1
2018-11-23 09:03:10 3
I want to aggregate the score by hour without taking into consideration the date, the result desired is :
08:00:00 10
09:00:00 5
19:00:00 4
So basically I have to remove the date-month-year, and then group score by hour,
I tried this command
monthagg = df['score'].resample('H').sum().to_frame()
Which does work but takes into consideration the date-month-year, How to remove DD-MM-YYYY and aggregate by Hour?
One possible solution is use DatetimeIndex.floor for set minutes and seconds to 0 and then convert DatetimeIndex to strings by DatetimeIndex.strftime, then aggregate sum:
a = df['score'].groupby(df.index.floor('H').strftime('%H:%M:%S')).sum()
#if column datetime
#a = df['score'].groupby(df['datetime'].dt.floor('H').dt.strftime('%H:%M:%S')).sum()
print (a)
08:00:00 10
09:00:00 5
19:00:00 4
Name: score, dtype: int64
Or use DatetimeIndex.hour and aggregate sum:
a = df.groupby(df.index.hour)['score'].sum()
#if column datetime
#a = df.groupby(df['datetime'].dt.hour)['score'].sum()
print (a)
datetime
8 10
9 5
19 4
Name: score, dtype: int64
Setup to generate a frame with datetime objects:
import datetime
import pandas as pd
rows = [datetime.datetime.now() + datetime.timedelta(hours=i) for i in range(100)]
df = pd.DataFrame(rows,columns = ["date"])
You can now add a hour-column like this, and then group by it:
df["hour"] = df["date"].dt.hour
df.groupby("hour").sum()
import pandas as pd
df = pd.DataFrame({'datetime':['2018-11-23 08:33:02 ','2018-11-24 09:43:30',
'2018-11-25 08:21:34',
'2018-11-26 19:33:01','2018-11-23 08:50:40',
'2018-11-23 09:03:10'],'score':[4,2,5,4,1,3]})
df['datetime']=pd.to_datetime(df['datetime'], errors='coerce')
df["hour"] = df["datetime"].dt.hour
df.groupby("hour").sum()
Output:
8 10
9 5
19 4
In pd.Grouper we can group by time, for example using 10s
Time Count
10:05:03 2
10:05:04 3
10:05:05 4
10:05:11 3
10:05:12 4
Will provide the result of:
Time Count
10:05:10 9
10:05:20 7
I'm looking for the other way around. Can I group the time by count, for example, using 5
Count Time (s)
5 (4-3)=1s
5 (11-5)=6s
5 (12-11)=1s
Thanks a bunch!
Maybe this is what you have in mind. Start with a pandas Series df:
2018-03-14 06:38:46.308425+00:00 2
2018-03-14 06:38:47.308425+00:00 3
2018-03-14 06:38:48.308425+00:00 4
2018-03-14 06:38:54.308425+00:00 3
2018-03-14 06:38:55.308425+00:00 4
dtype: int64
Find the indices where the cumulative sum crosses a multiple of 5:
df[:] = df.values.cumsum() // 5 * 5
hit5 = (df.diff() == 5).nonzero()[0]
In this case it's array([1, 3, 4]). Then iterate over those indices and take the difference with the previous index:
for i in hit5:
print(df.index[i] - df.index[i-1])
Giving:
0 days 00:00:01
0 days 00:00:06
0 days 00:00:01
If I understand your question correctly, you can try
import io
import numpy as np
import pandas as pd
df_txt = """
Time Count
10:05:03 2
10:05:04 3
10:05:05 4
10:05:11 3
10:05:12 4"""
df = pd.read_csv(io.StringIO(df_txt), sep='\t')
df['Time'] = df.Time.apply(lambda x: pd.to_datetime(x))
df['CumCount'] = df.Count.cumsum()
df['Ind1'] = df.CumCount // 5
df['Ind2'] = df.Ind1.shift()
df['LagTime'] = df.Time.shift()
df.loc[df.Ind1 == df.Ind2, 'LagTime'] = np.nan
df['StartTime'] = df.LagTime.bfill()
out = df.groupby(['StartTime'], as_index=False).last()
out['Time (s)'] = out.Time.values - out.StartTime.values
Output:
print(out['Time (s)'])
# 0 00:00:01
# 1 00:00:06
# 2 00:00:01
# Name: Time (s), dtype: timedelta64[ns]
I have a csv file that I am trying to import into pandas.
There are two columns of intrest. date and hour and are the first two cols.
E.g.
date,hour,...
10-1-2013,0,
10-1-2013,0,
10-1-2013,0,
10-1-2013,1,
10-1-2013,1,
How do I import using pandas so that that hour and date is combined or is that best done after the initial import?
df = DataFrame.from_csv('bingads.csv', sep=',')
If I do the initial import how do I combine the two as a date and then delete the hour?
Thanks
Define your own date_parser:
In [291]: from dateutil.parser import parse
In [292]: import datetime as dt
In [293]: def date_parser(x):
.....: date, hour = x.split(' ')
.....: return parse(date) + dt.timedelta(0, 3600*int(hour))
In [298]: pd.read_csv('test.csv', parse_dates=[[0,1]], date_parser=date_parser)
Out[298]:
date_hour a b c
0 2013-10-01 00:00:00 1 1 1
1 2013-10-01 00:00:00 2 2 2
2 2013-10-01 00:00:00 3 3 3
3 2013-10-01 01:00:00 4 4 4
4 2013-10-01 01:00:00 5 5 5
Apply read_csv instead of read_clipboard to handle your actual data:
>>> df = pd.read_clipboard(sep=',')
>>> df['date'] = pd.to_datetime(df.date) + pd.to_timedelta(df.hour, unit='D')/24
>>> del df['hour']
>>> df
date ...
0 2013-10-01 00:00:00 NaN
1 2013-10-01 00:00:00 NaN
2 2013-10-01 00:00:00 NaN
3 2013-10-01 01:00:00 NaN
4 2013-10-01 01:00:00 NaN
[5 rows x 2 columns]
Take a look at the parse_dates argument which pandas.read_csv accepts.
You can do something like:
df = pandas.read_csv('some.csv', parse_dates=True)
# in which case pandas will parse all columns where it finds dates
df = pandas.read_csv('some.csv', parse_dates=[i,j,k])
# in which case pandas will parse the i, j and kth columns for dates
Since you are only using the two columns from the cdv file and combining those into one, I would squeeze into a series of datetime objects like so:
import pandas as pd
from StringIO import StringIO
import datetime as dt
txt='''\
date,hour,A,B
10-1-2013,0,1,6
10-1-2013,0,2,7
10-1-2013,0,3,8
10-1-2013,1,4,9
10-1-2013,1,5,10'''
def date_parser(date, hour):
dates=[]
for ed, eh in zip(date, hour):
month, day, year=list(map(int, ed.split('-')))
hour=int(eh)
dates.append(dt.datetime(year, month, day, hour))
return dates
p=pd.read_csv(StringIO(txt), usecols=[0,1],
parse_dates=[[0,1]], date_parser=date_parser, squeeze=True)
print p
Prints:
0 2013-10-01 00:00:00
1 2013-10-01 00:00:00
2 2013-10-01 00:00:00
3 2013-10-01 01:00:00
4 2013-10-01 01:00:00
Name: date_hour, dtype: datetime64[ns]