If I have a pd dataframe with three columns: id, start_time, end_time, and I would like to transform it into a pd.df with two columns: id, time
e.g. from [001, 1, 3][002, 3, 4] to [001, 1][001, 2][001, 3][002, 3][002, 4]
Currently, I am using a for loop and append the dataframe in each iteration, but it's very slow. Is there any other method I can use to save time?
If start_time and end_time is timedelta use:
df = pd.DataFrame([['001', 1, 3],['002', 3, 4]],
columns=['id','start_time','end_time'])
print (df)
id start_time end_time
0 001 1 3
1 002 3 4
#stack columns
df1 = pd.melt(df, id_vars='id', value_name='time').drop('variable', axis=1)
#convert int to timedelta
df1['time'] = pd.to_timedelta(df1.time, unit='s')
df1.set_index('time', inplace=True)
print (df1)
id
time
00:00:01 001
00:00:03 002
00:00:03 001
00:00:04 002
#groupby by id and resample by one second
print (df1.groupby('id')
.resample('1S')
.ffill()
.reset_index(drop=True, level=0)
.reset_index())
time id
0 00:00:01 001
1 00:00:02 001
2 00:00:03 001
3 00:00:03 002
4 00:00:04 002
If start_time and end_time is datetime use:
df = pd.DataFrame([['001', '2016-01-01', '2016-01-03'],
['002', '2016-01-03', '2016-01-04']],
columns=['id','start_time','end_time'])
print (df)
id start_time end_time
0 001 2016-01-01 2016-01-03
1 002 2016-01-03 2016-01-04
df1 = pd.melt(df, id_vars='id', value_name='time').drop('variable', axis=1)
#convert to datetime
df1['time'] = pd.to_datetime(df1.time)
df1.set_index('time', inplace=True)
print (df1)
id
time
2016-01-01 001
2016-01-03 002
2016-01-03 001
2016-01-04 002
#groupby by id and resample by one day
print (df1.groupby('id')
.resample('1D')
.ffill()
.reset_index(drop=True, level=0)
.reset_index())
time id
0 2016-01-01 001
1 2016-01-02 001
2 2016-01-03 001
3 2016-01-03 002
4 2016-01-04 002
Here is my take on your question:
df.set_index('id', inplace=True)
reshaped = df.apply(lambda x: pd.Series(range(x['start time'], x['end time']+1)), axis=1).\
stack().reset_index().drop('level_1', axis=1)
reshaped.columns = ['id', 'time']
reshaped
Test
Input:
import pandas as pd
from io import StringIO
data = StringIO("""id,start time,end time
001, 1, 3
002, 3, 4""")
df = pd.read_csv(data, dtype={'id':'object'})
df.set_index('id', inplace=True)
print("In\n", df)
reshaped = df.apply(lambda x: pd.Series(range(x['start time'], x['end time']+1)), axis=1).\
stack().reset_index().drop('level_1', axis=1)
reshaped.columns = ['id', 'time']
print("Out\n", reshaped)
Output:
In
start time end time
id
001 1 3
002 3 4
Out
id time
0 001 1
1 001 2
2 001 3
3 002 3
4 002 4
Related
This is my dataframe:
ID number
Date purchase
1
2022-05-01
1
2021-03-03
1
2020-01-03
2
2019-01-03
2
2018-01-03
I want to get a horizontal dataframe with alle the dates in seperate columns per ID number.
So like this:
ID number
Date 1
Date 2
Date 3
1
2022-05-01
2021-03-03
2020-01-03
2
2019-01-03
2018-01-03
After I did this I want to calculate the difference between these dates.
First step is GroupBy.cumcount with DataFrame.pivot:
df['Date purchase'] = pd.to_datetime(df['Date purchase'])
df1 = (df.sort_values(by=['ID number', 'Date purchase'], ascending=[True, False])
.assign(g=lambda x: x.groupby('ID number').cumcount())
.pivot('ID number','g','Date purchase')
.rename(columns = lambda x: f'Date {x + 1}'))
print (df1)
g Date 1 Date 2 Date 3
ID number
1 2022-05-01 2021-03-03 2020-01-03
2 2019-01-03 2018-01-03 NaT
Then for differencies between columns use DataFrame.diff:
df2 = df1.diff(-1, axis=1)
print (df2)
g Date 1 Date 2 Date 3
ID number
1 424 days 425 days NaT
2 365 days NaT NaT
If need averages:
df3 = df1.apply(pd.Series.mean, axis=1).reset_index(name='Avg Dates').rename_axis(None, axis=1)
print (df3)
ID number Avg Dates
0 1 2021-03-02 16:00:00
1 2 2018-07-04 12:00:00
Could you do something like this?
def format_dataframe(df):
"""
Function formats the dataframe to the following:
| ID number| Date 1 | Date 2 | Date 3 |
| -------- | -------------- | -------------- | -------------- |
| 1 | 2022-05-01 | 2021-03-03 | 2020-01-03 |
| 2 | 2019-01-03 | 2018-01-03 | |
"""
df = df.sort_values(by=['ID number', 'Date purchase'])
df = df.drop_duplicates(subset=['ID number'], keep='first')
df = df.drop(columns=['Date purchase'])
df = df.rename(columns={'ID number': 'ID number', 'Date 1': 'Date 1', 'Date 2': 'Date 2', 'Date 3': 'Date 3'})
return df
initial situation:
d = {'IdNumber': [1,1,1,2,2], 'Date': ['2022-05-01', '2021-03-03','2020-01-03','2019-01-03','2018-01-03']}
df = pd.DataFrame(data=d)
date conversion:
df['Date'] = pd.to_datetime(df['Date'])
creating new column:
df1=df.assign(Col=lambda x: x.groupby('IdNumber').cumcount())
pivoting:
df1=df1.pivot(index=["IdNumber"],columns=["Col"],values="Date")
reset index:
df1 = df1.reset_index(level=0)
rename column:
for i in range(1,len(df1.columns)):
df1.columns.values[i]='Date{0}'.format(i)
final result:
Col IdNumber Date1 Date2 Date3
0 1 2022-05-01 2021-03-03 2020-01-03
1 2 2019-01-03 2018-01-03 NaT
I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15
I'm looking at counting the number of interactions grouped by ID in the last 12 months for each unique ID. The count starts from the latest date to the last one grouped by ID.
ID date
001 2022-02-01
002 2018-03-26
001 2021-08-05
001 2019-05-01
002 2019-02-01
003 2018-07-01
Output is something like the below.
ID Last_12_Months_Count
001 2
002 2
003 1
How can I achieve this in Pandas? Any function that would count the months based on the dates from the latest date per group?
Use:
m = df['date'].gt(df.groupby('ID')['date'].transform('max')
.sub(pd.offsets.DateOffset(years=1)))
df1 = df[m]
df1 = df1.groupby('ID').size().reset_index(name='Last_12_Months_Count')
print (df1)
ID Last_12_Months_Count
0 1 2
1 2 2
2 3 1
Or:
df1 = (df.groupby('ID')['date']
.agg(lambda x: x.gt(x.max() - pd.offsets.DateOffset(years=1)).sum())
.reset_index(name='Last_12_Months_Count'))
print (df1)
ID Last_12_Months_Count
0 1 2
1 2 2
2 3 1
For count multiple columns use named aggregation:
df['date1'] = df['date']
f = lambda x: x.gt(x.max() - pd.offsets.DateOffset(years=1)).sum()
df1 = (df.groupby('ID')
.agg(Last_12_Months_Count_date = ('date', f),
Last_12_Months_Count_date1 = ('date1', f))
.reset_index())
print (df1)
ID Last_12_Months_Count_date Last_12_Months_Count_date1
0 1 2 2
1 2 2 2
2 3 1 1
I have dataframe like this:
Date Location_ID Problem_ID
---------------------+------------+----------
2013-01-02 10:00:00 | 1 | 43
2012-08-09 23:03:01 | 5 | 2
...
How can I count how often a Problem occurs per day and per Location?
Use groupby with converting Date column to dates or Grouper with aggregate size:
print (df)
Date Location_ID Problem_ID
0 2013-01-02 10:00:00 1 43
1 2012-08-09 23:03:01 5 2
#if necessary convert column to datetimes
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.groupby([df['Date'].dt.date, 'Location_ID']).size().reset_index(name='count')
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
Or:
df1 = (df.groupby([pd.Grouper(key='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
If first column is index:
print (df)
Location_ID Problem_ID
Date
2013-01-02 10:00:00 1 43
2012-08-09 23:03:01 5 2
df.index = pd.to_datetime(df.index)
df1 = (df.groupby([df.index.date, 'Location_ID'])
.size()
.reset_index(name='count')
.rename(columns={'level_0':'Date'}))
print (df1)
Date Location_ID count
0 2012-08-09 5 1
1 2013-01-02 1 1
df1 = (df.groupby([pd.Grouper(level='Date', freq='D'), 'Location_ID'])
.size()
.reset_index(name='count'))
I have the below dataframe. Date in DD/MM/YY
Date id
1/5/2017 2:00 PM 100
1/5/2017 3:00 PM 101
2/5/2017 10:00 AM 102
3/5/2017 09:00 AM 103
3/5/2017 10:00 AM 104
4/5/2017 09:00 AM 105
Need output such a way that , able to group by date and also count number of Ids per day , also ignore time. o/p new data frame should be as below
DATE Count
1/5/2017 2 -> count 100,101
2/5/2017 1
3/5/2017 2
4/5/2017 1
Need efficient way to achieve above.
Use:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = df['Date'].dt.date.value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
Alternative solution:
df1 = df.groupby(df['Date'].dt.date).size().reset_index(name='Count')
print (df1)
DATE Count
0 2017-05-01 2
1 2017-05-02 1
2 2017-05-03 2
3 2017-05-04 1
If need same format:
df1 = df['Date'].str.split().str[0].value_counts().sort_index().reset_index()
df1.columns = ['DATE','Count']
new = df['Date'].str.split().str[0]
df1 = df.groupby(new).size().reset_index(name='Count')
print (df1)
Date Count
0 1/5/2017 2
1 2/5/2017 1
2 3/5/2017 2
3 4/5/2017 1