Pandas DataFrame groupby two columns and get first and last - python

I have a DataFrame Like following.
df = pd.DataFrame({'id' : [1,1,2,3,2],
'value' : ["a","b","a","a","c"], 'Time' : ['6/Nov/2012 23:59:59 -0600','6/Nov/2012 00:00:05 -0600','7/Nov/2012 00:00:09 -0600','27/Nov/2012 00:00:13 -0600','27/Nov/2012 00:00:17 -0600']})
I need to get an output like following.
combined_id | enter time | exit time | time difference
combined_id should be created by grouping 'id' and 'value'
g = df.groupby(['id', 'value'])
Following doesn’t work with grouping by two columns. (How to use first() and last() here as enter and exit times?)
df['enter'] = g.apply(lambda x: x.first())
To get difference would following work?
df['delta'] = (df['exit']-df['enter'].shift()).fillna(0)

First ensure you're column is a proper datetime column:
In [11]: df['Time'] = pd.to_datetime(df['Time'])
Now, you can do the groupby and use agg with the first and last groupby methods:
In [12]: g = df.groupby(['id', 'value'])
In [13]: res = g['Time'].agg({'first': 'first', 'last': 'last'})
In [14]: res = g['Time'].agg({'enter': 'first', 'exit': 'last'})
In [15]: res['time_diff'] = res['exit'] - res['enter']
In [16]: res
Out[16]:
exit enter time_diff
id value
1 a 2012-11-06 23:59:59 2012-11-06 23:59:59 0 days
b 2012-11-06 00:00:05 2012-11-06 00:00:05 0 days
2 a 2012-11-07 00:00:09 2012-11-07 00:00:09 0 days
c 2012-11-27 00:00:17 2012-11-27 00:00:17 0 days
3 a 2012-11-27 00:00:13 2012-11-27 00:00:13 0 days
Note: this is a bit of a boring example since there is only one item in each group...

Related

Fixing dataframe with two data formats for rows

I have data that is in this inconvenient format. Simple reproducible example below:
26/9/21 26/9/21
10:00 Paul
12:00 John
27/9/21 27/9/21
1:00 Ringo
As you can see, the dates have not been entered as a column. Instead, the dates repeat across rows as a "header" row for the rows below it. Each date then has a variable number of data rows beneath it, before the next date "header" row.
The output I would like would be:
26/9/21 10:00 Paul
26/9/21 12:00 John
27/9/21 1:00 Ringo
How can I do this in Python and Pandas?
Code for data entry below:
import pandas as pd
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
df
Convert your column a to datetime with errors='coerce' then fill forward. Now you can add the time offset rows.
sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
msk = sra.isnull()
sra = sra.ffill() + pd.to_timedelta(df.loc[msk, 'a'] + ':00')
out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
>>> out
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Step by step:
>>> sra = pd.to_datetime(df['a'], format='%d/%m/%y', errors='coerce')
0 2021-09-26
1 NaT
2 NaT
3 2021-09-27
4 NaT
Name: a, dtype: datetime64[ns]
>>> msk = sra.isnull()
0 False
1 True
2 True
3 False
4 True
Name: a, dtype: bool
>>> sra = sra.ffill() + pd.to_timedelta(df.loc[msk, 'a'] + ':00')
0 NaT
1 2021-09-26 10:00:00
2 2021-09-26 12:00:00
3 NaT
4 2021-09-27 01:00:00
Name: a, dtype: datetime64[ns]
>>> out = pd.merge(sra[msk], df['b'], left_index=True, right_index=True)
a b
1 2021-09-26 10:00:00 John
2 2021-09-26 12:00:00 Paul
4 2021-09-27 01:00:00 Ringo
Following is simple to understand code, reading original dataframe row by row and creating a new dataframe:
df = pd.DataFrame({'a': ['26/9/21', '10:00', '12:00', '27/9/21', '1:00'],
'b': ['26/9/21', 'Paul', 'John', '27/9/21', 'Ringo']})
dflen = len(df)
newrow = []; newdata = []
for i in range(dflen): # read each row one by one
if '/' in df.iloc[i,0]: # if date found
item0 = df.iloc[i,0] # get new date
newrow = [item0] # put date as first entry of new row
continue # go to next row
newrow.append(df.iloc[i,0]) # add time
newrow.append(df.iloc[i,1]) # add name
newdata.append(newrow) # add row to new data
newrow = [item0] # create new row with same date entry
newdf = pd.DataFrame(newdata, columns=['Date','Time','Name']) # create new dataframe;
print(newdf)
Output:
Date Time Name
0 26/9/21 10:00 Paul
1 26/9/21 12:00 John
2 27/9/21 1:00 Ringo

Python - Pandas, count time diff from first record in a group

In continue to this question
Having the following DF:
group_id timestamp
A 2020-09-29 06:00:00 UTC
A 2020-09-29 08:00:00 UTC
A 2020-09-30 09:00:00 UTC
B 2020-09-01 04:00:00 UTC
B 2020-09-01 06:00:00 UTC
I would like to count the deltas between records using all groups, not counting deltas between groups. Result for the above example:
delta count
2 2
27 1
Explanation: In group A the deltas are
06:00:00 -> 08:00:00 (2 hours)
08:00:00 -> 09:00:00 on the next day (27 hours from the first event)
And in group B:
04:00:00 -> 06:00:00 (2 hours)
How can I achieve this using Python Pandas?
FIrst idea is use custom lambda function with Series.cumsum for cumulative sum:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df.groupby("group_id")['timestamp']
.apply(lambda x: x.diff().dt.total_seconds().cumsum())
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
Or add another groupby with GroupBy.cumsum:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df.groupby("group_id")['timestamp']
.diff()
.dt.total_seconds()
.div(3600)
.groupby(df['group_id'])
.cumsum()
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
Another idea is subtract first values per groups by GroupBy.transform and GroupBy.first, but for remove first rows with 0 is added filter by Series.duplicated:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df['timestamp'].sub(df.groupby("group_id")['timestamp'].transform('first'))
.loc[df['group_id'].duplicated()]
.dt.total_seconds()
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

how to get the datetimes before and after some specific dates in Pandas?

I have a Pandas DataFrame that looks like
col1
2015-02-02
2015-04-05
2016-07-02
I would like to add, for each date in col 1, the x days before and x days after that date.
That means that the resulting DataFrame will contain more rows (specifically, n(1+ 2*x), where n is the orignal number of dates in col1)
How can I do that in a proper Pandonic way?
Output would be (for x=1)
col1
2015-01-01
2015-01-02
2015-01-03
2015-04-04
etc
Thanks!
you can do it this way, but I'm not sure that it's the best / fastest way to do it:
In [143]: df
Out[143]:
col1
0 2015-02-02
1 2015-04-05
2 2016-07-02
In [144]: %paste
N = 2
(df.col1.apply(lambda x: pd.Series(pd.date_range(x - pd.Timedelta(days=N),
x + pd.Timedelta(days=N))
)
)
.stack()
.drop_duplicates()
.reset_index(level=[0,1], drop=True)
.to_frame(name='col1')
)
## -- End pasted text --
Out[144]:
col1
0 2015-01-31
1 2015-02-01
2 2015-02-02
3 2015-02-03
4 2015-02-04
5 2015-04-03
6 2015-04-04
7 2015-04-05
8 2015-04-06
9 2015-04-07
10 2016-06-30
11 2016-07-01
12 2016-07-02
13 2016-07-03
14 2016-07-04
Something like this takes a dataframe with a datetime.date column and then stacks another Series underneath with timedelta shifts to the original data.
import datetime
import pandas as pd
df = pd.DataFrame([{'date': datetime.date(2016, 1, 2)}, {'date': datetime.date(2016, 1, 1)}], columns=['date'])
df = pd.concat([df.date, df.date + datetime.timedelta(days=1)], ignore_index=True).to_frame()

Pandas Timedelta in Days

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Categories