Max and Min date in pandas groupby - python

I have a dataframe that looks like:
data = {'index': ['2014-06-22 10:46:00', '2014-06-24 19:52:00', '2014-06-25 17:02:00', '2014-06-25 17:55:00', '2014-07-02 11:36:00', '2014-07-06 12:40:00', '2014-07-05 12:46:00', '2014-07-27 15:12:00'],
'type': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'C'],
'sum_col': [1, 2, 3, 1, 1, 3, 2, 1]}
df = pd.DataFrame(data, columns=['index', 'type', 'sum_col'])
df['index'] = pd.to_datetime(df['index'])
df = df.set_index('index')
df['weekofyear'] = df.index.weekofyear
df['date'] = df.index.date
df['date'] = pd.to_datetime(df['date'])
type sum_col weekofyear date
index
2014-06-22 10:46:00 A 1 25 2014-06-22
2014-06-24 19:52:00 B 2 26 2014-06-24
2014-06-25 17:02:00 C 3 26 2014-06-25
2014-06-25 17:55:00 A 1 26 2014-06-25
2014-07-02 11:36:00 B 1 27 2014-07-02
2014-07-06 12:40:00 C 3 27 2014-07-06
2014-07-05 12:46:00 A 2 27 2014-07-05
2014-07-27 15:12:00 C 1 30 2014-07-27
I'm looking to groupby the weekofyear, then sum up the sum_col. In addition, I need to find the earliest, and the latest date for the week. The first part is pretty easy:
gb = df.groupby(['type', 'weekofyear'])
gb['sum_col'].agg({'sum_col' : np.sum})
I've tried to find the min/max date with this, but haven't been successful:
gb = df.groupby(['type', 'weekofyear'])
gb.agg({'sum_col' : np.sum,
'date' : np.min,
'date' : np.max})
How would one find the earliest/latest date that appears?

You need to combine the functions that apply to the same column, like this:
In [116]: gb.agg({'sum_col' : np.sum,
...: 'date' : [np.min, np.max]})
Out[116]:
date sum_col
amin amax sum
type weekofyear
A 25 2014-06-22 2014-06-22 1
26 2014-06-25 2014-06-25 1
27 2014-07-05 2014-07-05 2
B 26 2014-06-24 2014-06-24 2
27 2014-07-02 2014-07-02 1
C 26 2014-06-25 2014-06-25 3
27 2014-07-06 2014-07-06 3
30 2014-07-27 2014-07-27 1

Simple code can be
df.groupby([key_field]).agg({'time_field': [np.min,np.max]})
where key_field here can be event_id and time_field can be timestamp field.

Another possible solution where you can have more control over the resulting column names:
gb = df.groupby(['type', 'weekofyear'])
gb.agg(
sum_col=('sum_col', np.sum),
first_date=('date', np.min),
last_date=('date', np.max)
).reset_index()

Related

Insert rows to fill gaps in year column in Pandas DataFrame

I have the following DataFrame:
import pandas as pd
data = {'id': ['A', 'A','B','C'],
'location':['loc1', 'loc2','loc1','loc3'],
'year_data': [2013,2015,2014,2015],
'c': [10.5, 13.5,12.3,9.75]}
data = pd.DataFrame(data)
For each groupby(['id','location']), I want to insert rows in the DataFrame starting from min(year) till 2015.
The desired output:
data = {'id': ['A', 'A', 'A','A','B','B','C'],
'location':['loc1', 'loc1', 'loc1', 'loc2','loc1','loc1','loc3'],
'year_data': [2013,2014,2015,2015,2014,2015,2015],
'c': [10.5,10.5,10.5, 13.5,12.3,12.3,9.75]}
data = pd.DataFrame(data)
Use lambda function with get minimal year from index created by DataFrame.set_index in range for Series.reindex with method='ffill' per groups:
f = lambda x: x.reindex(range(x.index.min(), 2016), method='ffill')
df = data.set_index("year_data").groupby(['id','location'])['c'].apply(f).reset_index()
print (df)
id location year_data c
0 A loc1 2013 10.50
1 A loc1 2014 10.50
2 A loc1 2015 10.50
3 A loc2 2015 13.50
4 B loc1 2014 12.30
5 B loc1 2015 12.30
6 C loc3 2015 9.75

Pandas Dataframe datetime condition

I have the following dataframe and would like to create a new column based on a condition. The new column should contain 'Night' for the hours between 20:00 and 06:00, 'Morning' for the time between 06:00 and 14:30 and 'Afternoon' for the time between 14:30 and 20:00. How can I formulate and apply such a condition best way?
import pandas as pd
df = {'A' : ['test', '2222', '1111', '3333', '1111'],
'B' : ['aaa', 'aaa', 'bbbb', 'ccccc', 'aaa'],
'Date' : ['15.07.2018 06:23:56', '15.07.2018 01:23:56', '15.07.2018 06:40:06', '15.07.2018 11:38:27', '15.07.2018 21:38:27'],
'Defect': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(df)
df['Date'] = pd.to_datetime(df['Date'])
You can use np.select:
from datetime import time
condlist = [df['Date'].dt.time.between(time(6), time(14, 30)),
df['Date'].dt.time.between(time(14,30), time(20))]
df['Time'] = np.select(condlist, ['Morning', 'Afternoon'], default='Night')
Output:
>>> df
A B Date Defect Time
0 test aaa 2018-07-15 06:23:56 0 Morning
1 2222 aaa 2018-07-15 01:23:56 1 Night
2 1111 bbbb 2018-07-15 06:40:06 0 Morning
3 3333 ccccc 2018-07-15 11:38:27 1 Morning
4 1111 aaa 2018-07-15 21:38:27 0 Night
Note, you don't need the condition for 'Night':
df['Date'].dt.time.between(time(20), time(23,59,59)) \
| df['Date'].dt.time.between(time(0), time(6))
because np.select can take a default value as argument.
You can create an index of the date field and then use indexer_between_time.
idx = pd.DatetimeIndex(df["Date"])
conditions = [
("20:00", "06:00", "Night"),
("06:00", "14:30", "Morning"),
("14:30", "20:00", "Afternoon"),
]
for cond in conditions:
start, end, val = cond
df.loc[idx.indexer_between_time(start, end, include_end=False), "Time_of_Day"] = val
A B Date Defect Time_of_Day
0 test aaa 2018-07-15 06:23:56 0 Morning
1 2222 aaa 2018-07-15 01:23:56 1 Night
2 1111 bbbb 2018-07-15 06:40:06 0 Morning
3 3333 ccccc 2018-07-15 11:38:27 1 Morning
4 1111 aaa 2018-07-15 21:38:27 0 Night

change a column that contain date and time with two columns containing date and time separately

I have some columns in a dataset that contain date and time and my goal is to obtain two separate columns that contain date and time separately.
Example:
Name Dataset: A
Starting
Name column: Cat
12/01/2021 20:15:06
02/01/2021 12:15:07
01/01/2021 15:05:03
01/01/2021 15:05:03
Goal
Name column: Cat1
12/01/2021
02/01/2021
01/01/2021
01/01/2021
Name Column: Cat2
20:15:06
12:15:07
15:05:03
15:05:03
I assume that you 're using pandas, and that you want to use the same dataframe.
# df = A (?)
df['Cat1'] = [d.date() for d in df['Cat']]
df['Cat2'] = [d.time() for d in df['Cat']]
Working example:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame.from_dict(
{'A': [1, 2, 3],
'B': [4, 5, 6],
'Datetime': [datetime.strftime(datetime.now()-timedelta(days=_),
"%m/%d/%Y, %H:%M:%S") for _ in range(3)]},
orient='index',
columns=['A', 'B', 'C']).T
df['Datetime'] = pd.to_datetime(df['Datetime'], format="%m/%d/%Y, %H:%M:%S")
# A B Datetime
# A 1 4 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59
df['Cat1'] = [d.date() for d in df['Datetime']]
df['Cat2'] = [d.time() for d in df['Datetime']]
# A B Datetime Cat1 Cat2
# A 1 4 2021-03-05 14:07:59 2021-03-05 14:07:59
# B 2 5 2021-03-04 14:07:59 2021-03-04 14:07:59
# C 3 6 2021-03-03 14:07:59 2021-03-03 14:07:59

pandas time-series data preprocessing

I have dataframe look likes this :
> dt
text timestamp
0 a 2016-06-13 18:00
1 b 2016-06-20 14:08
2 c 2016-07-01 07:41
3 d 2016-07-11 19:07
4 e 2016-08-01 16:00
And I want to summarise every month's data like:
> dt_month
count timestamp
0 2 2016-06
1 2 2016-07
2 1 2016-08
the original dataset(dt) can be generated by:
import pandas as pd
data = {'text': ['a', 'b', 'c', 'd', 'e'],
'timestamp': ['2016-06-13 18:00', '2016-06-20 14:08', '2016-07-01 07:41', '2016-07-11 19:07', '2016-08-01 16:00']}
dt = pd.DataFrame(data)
And are there any ways can plot a time-frequency plot by dt_month ?
You can groupby by timestamp column converted to_period and aggregate size:
print (df.text.groupby(df.timestamp.dt.to_period('m'))
.size()
.rename('count')
.reset_index())
timestamp count
0 2016-06 2
1 2016-07 2
2 2016-08 1

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

Categories