I have the following dataframe and would like to create a new column based on a condition. The new column should contain 'Night' for the hours between 20:00 and 06:00, 'Morning' for the time between 06:00 and 14:30 and 'Afternoon' for the time between 14:30 and 20:00. How can I formulate and apply such a condition best way?
import pandas as pd
df = {'A' : ['test', '2222', '1111', '3333', '1111'],
'B' : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
'Date' : ['15.07.2018 06:23:56', '15.07.2018 01:23:56', '15.07.2018 06:40:06', '15.07.2018 11:38:27', '15.07.2018 21:38:27'],
'Defect': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'])
You can use np.select:
from datetime import time
condlist = [df['Date'].dt.time.between(time(6), time(14, 30)),
df['Date'].dt.time.between(time(14,30), time(20))]
df['Time'] = np.select(condlist, ['Morning', 'Afternoon'], default='Night')
Output:
>>> df
A B Date Defect Time
0 test aaa 2018-07-15 06:23:56 0 Morning
1 2222 aaa 2018-07-15 01:23:56 1 Night
2 1111 bbbb 2018-07-15 06:40:06 0 Morning
3 3333 ccccc 2018-07-15 11:38:27 1 Morning
4 1111 aaa 2018-07-15 21:38:27 0 Night
Note, you don't need the condition for 'Night':
df['Date'].dt.time.between(time(20), time(23,59,59)) \
| df['Date'].dt.time.between(time(0), time(6))
because np.select can take a default value as argument.
You can create an index of the date field and then use indexer_between_time.
idx = pd.DatetimeIndex(df["Date"])
conditions = [
("20:00", "06:00", "Night"),
("06:00", "14:30", "Morning"),
("14:30", "20:00", "Afternoon"),
]
for cond in conditions:
start, end, val = cond
df.loc[idx.indexer_between_time(start, end, include_end=False), "Time_of_Day"] = val
A B Date Defect Time_of_Day
0 test aaa 2018-07-15 06:23:56 0 Morning
1 2222 aaa 2018-07-15 01:23:56 1 Night
2 1111 bbbb 2018-07-15 06:40:06 0 Morning
3 3333 ccccc 2018-07-15 11:38:27 1 Morning
4 1111 aaa 2018-07-15 21:38:27 0 Night
Related
I have some columns in a dataset that contain date and time and my goal is to obtain two separate columns that contain date and time separately.
Example:
Name Dataset: A
Starting
Name column: Cat
12/01/2021 20:15:06
02/01/2021 12:15:07
01/01/2021 15:05:03
01/01/2021 15:05:03
Goal
Name column: Cat1
12/01/2021
02/01/2021
01/01/2021
01/01/2021
Name Column: Cat2
20:15:06
12:15:07
15:05:03
15:05:03
I assume that you 're using pandas, and that you want to use the same dataframe.
# df = A (?)
df['Cat1'] = [d.date() for d in df['Cat']]
df['Cat2'] = [d.time() for d in df['Cat']]
Working example:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame.from_dict(
{'A': [1, 2, 3],
'B': [4, 5, 6],
'Datetime': [datetime.strftime(datetime.now()-timedelta(days=_),
"%m/%d/%Y, %H:%M:%S") for _ in range(3)]},
orient='index',
columns=['A', 'B', 'C']).T
df['Datetime'] = pd.to_datetime(df['Datetime'], format="%m/%d/%Y, %H:%M:%S")
# A B Datetime
# A 1 4 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59
df['Cat1'] = [d.date() for d in df['Datetime']]
df['Cat2'] = [d.time() for d in df['Datetime']]
# A B Datetime Cat1 Cat2
# A 1 4 2021-03-05 14:07:59 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59 2021-03-03 14:07:59
I have a pandas dataframe like this -
ColA ColB ColC
Apple 2019-03-02 18:00:00 Saturday
Orange 2019-03-03 10:00:00 Sunday
Mango 2019-03-04 09:00:00 Monday
I am trying to remove rows from my dateframe based on certain conditions.
Remove the row if the datetime is 9AM and above and 5PM and below.
Do not remove this, if it is weekend (Saturday and Sunday).
Expected output will not have Mango in the dataframe.
Seems it is harder than what I thought
s1=df.ColB.dt.hour.between(9,17,inclusive=False)
df.loc[s1|df.ColC.isin(['Saturday','Sunday'])]
ColA ColB ColC
0 Apple 2019-03-02 18:00:00 Saturday
1 Orange 2019-03-03 10:00:00 Sunday
Or using
s1=pd.Index(df.ColB).indexer_between_time('09:00:00','17:00:00',include_start =False ,include_end =False)
s1=df.index.isin(s1)
df.loc[s1|df.ColC.isin(['Saturday','Sunday'])]
To give another alternative you could write it like this:
cond1 = df.ColB.dt.hour >= 9 # After 09:00
cond2 = df.ColB.dt.hour <= 15 # Before 16:00
cond3 = df.ColB.dt.weekday < 5 # Mon-Fri
df = df[~(cond1&cond2&cond3)]
Full example:
import pandas as pd
df = pd.DataFrame({
'ColA': ['Apple','Orange','Mango'],
'ColB': pd.to_datetime([
'2019-03-02 18:00:00',
'2019-03-03 10:00:00',
'2019-03-04 09:00:00'
]),
'ColC': ['Saturday', 'Sunday', 'Monday']
})
cond1 = df.ColB.dt.hour >= 9 # After 09:00
cond2 = df.ColB.dt.hour <= 15 # Before 16:00
cond3 = df.ColB.dt.weekday < 5 # Mon-Fri
df = df[~(cond1&cond2&cond3)] # conditions mark the rows to drop, hence ~
print(df)
Returns:
ColA ColB ColC
0 Apple 2019-03-02 18:00:00 Saturday
1 Orange 2019-03-03 10:00:00 Sunday
When I run the following code, the results appear to add the non-business day data to the result.
Code
import pandas as pd
df = pd.DataFrame({'id': [30820864, 32295510, 30913444, 30913445],
'ticket_id': [100, 101, 102, 103],
'date_time': [
'6/1/17 9:48',
'6/2/17 13:11',
'6/3/17 13:15',
'6/5/17 13:15'],
})
df['date_time'] = pd.to_datetime(df['date_time'])
df.index = df['date_time']
x = df.resample('B').count()
print(x)
Result
id ticket_id date_time
date_time
2017-06-01 1 0 1
2017-06-02 2 0 2
2017-06-05 1 0 1
I would expect that the count for 2017-06-02 would be 1 and not 2. Shouldn't the data from a non-business day (6/3/17) be ignored?
This seems to be standard behaviour, events on weekends are grouped with friday (another post similar to this, and here it says that this is convention)
One solution, drop the weekends:
df = df[df['date_time'].apply(lambda x: x.weekday() not in [5,6])]
Output:
date_time id ticket_id
date_time
2017-06-01 1 1 1
2017-06-02 1 1 1
2017-06-05 1 1 1
I have dataframe look likes this :
> dt
text timestamp
0 a 2016-06-13 18:00
1 b 2016-06-20 14:08
2 c 2016-07-01 07:41
3 d 2016-07-11 19:07
4 e 2016-08-01 16:00
And I want to summarise every month's data like:
> dt_month
count timestamp
0 2 2016-06
1 2 2016-07
2 1 2016-08
the original dataset(dt) can be generated by:
import pandas as pd
data = {'text': ['a', 'b', 'c', 'd', 'e'],
'timestamp': ['2016-06-13 18:00', '2016-06-20 14:08', '2016-07-01 07:41', '2016-07-11 19:07', '2016-08-01 16:00']}
dt = pd.DataFrame(data)
And are there any ways can plot a time-frequency plot by dt_month ?
You can groupby by timestamp column converted to_period and aggregate size:
print (df.text.groupby(df.timestamp.dt.to_period('m'))
.size()
.rename('count')
.reset_index())
timestamp count
0 2016-06 2
1 2016-07 2
2 2016-08 1
I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()