I have this dataframe (type could be 1 or 2):
user_id | timestamp | type
1 | 2015-5-5 12:30 | 1
1 | 2015-5-5 14:00 | 2
1 | 2015-5-5 15:00 | 1
I want to group my data by six hours and when doing this I want to keep type as:
1 (if there is only 1 within that 6 hour frame)
2 (if there is only 2 within that 6 hour frame) or
3 (if there was both 1 and 2 within that 6 hour frame)
Here is the my code:
df = df.groupby(['user_id', pd.TimeGrouper(freq=(6,'H'))]).mean()
which produces:
user_id | timestamp | type
1 | 2015-5-5 12:00 | 4
However, I want to get 3 instead of 4. I wonder how can I replace the mean() in my groupby code to produce the desired output?
Try this:
In [54]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]) \
.agg({'type':lambda x: x.unique().sum()})
Out[54]:
type
user_id timestamp
1 2015-05-05 12:00:00 3
PS it'll work only with given types: (1, 2) as their sum is 3
Another data set:
In [56]: df
Out[56]:
user_id timestamp type
0 1 2015-05-05 12:30:00 1
1 1 2015-05-05 14:00:00 1
2 1 2015-05-05 15:00:00 1
3 1 2015-05-05 20:00:00 1
In [57]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]).agg({'type':lambda x: x.unique().sum()})
Out[57]:
type
user_id timestamp
1 2015-05-05 12:00:00 1
2015-05-05 18:00:00 1
Related
I've a dateset like this:
date
Condition
20-01-2015
1
20-02-2015
1
20-03-2015
2
20-04-2015
2
20-05-2015
2
20-06-2015
1
20-07-2015
1
20-08-2015
2
20-09-2015
2
20-09-2015
1
I want a new column date_new which should look at the condition in next column. If condition is one, do nothing. If condition is 2, add a day to the date and store in date_new.
Additional condition- There should be 3 continuous 2's for this to work.
The final output should look like this.
date
Condition
date_new
20-01-2015
1
20-02-2015
1
20-03-2015
2
21-02-2015
20-04-2015
2
20-05-2015
2
20-06-2015
1
20-07-2015
1
20-08-2015
2
20-09-2015
2
20-09-2015
1
Any help is appreciated. Thank you.
This solution is a little bit different. If condition is 1 put None, otherwise I add condition value -1 to the date
df['date_new'] = np.where(df['condition'] == 1, None, (df['date'] + pd.to_timedelta(df['condition']-1,'d')).dt.strftime('%d-%m-%Y') )
Ok, so I've edited my answer and transform it into a function:
def newdate(df):
L = df.Condition
res = [i for i, j, k in zip(L, L[1:], L[2:]) if i == j == k]
if 2 in res:
df['date'] = pd.to_datetime(df['date'])
df['new_date'] = df.apply(lambda x: x["date"]+pd.DateOffset(days=2) if x["Condition"]==2 else pd.NA, axis=1)
df['new_date'] = pd.to_datetime(df['new_date'])
df1 = df
return df1
#output:
index
date
Condition
new_date
0
2015-01-20 00:00:00
1
NaT
1
2015-02-20 00:00:00
1
NaT
2
2015-03-20 00:00:00
2
2015-03-22 00:00:00
3
2015-04-20 00:00:00
2
2015-04-22 00:00:00
4
2015-05-20 00:00:00
2
2015-05-22 00:00:00
5
2015-06-20 00:00:00
1
NaT
6
2015-07-20 00:00:00
1
NaT
7
2015-08-20 00:00:00
2
2015-08-22 00:00:00
8
2015-09-20 00:00:00
2
2015-09-22 00:00:00
9
2015-09-20 00:00:00
1
NaT
I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different
Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3
Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group
0 1 2021-01-01 1
1 1 2021-01-15 1
2 1 2021-01-30 1
3 1 2021-02-05 1
4 2 2020-02-10 2
5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id not equal to the previous id:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or (|) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True value to delimit the groups.
*Assumes id and datafield columns are appropriately ordered.
It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.
I want to convert my datetime column to be pandas dataframe index. This is my dataframe
Date Observed Min Max Sum Count
0 09/15/2018 12:00:00 AM 2 0 2 10 5
1 09/15/2018 01:00:00 AM 1 0 2 25 20
2 09/15/2018 02:00:00 AM 1 0 1 21 21
3 09/15/2018 03:00:00 AM 1 0 2 23 22
4 09/15/2018 04:00:00 AM 1 0 1 21 21
And I want the Date to be the index for the dataframe.
I've looked for answers and have tried this code
dateparse = lambda dates: pd.datetime.strptime(dates, '%m/%d/%Y %I:%M:%S').strftime('%m/%d/%Y %I:%M:%S %p')
data = pd.read_csv('mandol.csv', sep=';', parse_dates=['Date'], index_col = 'Date', date_parser=dateparse)
data.head()
but the result is still error -> ValueError: unconverted data remains: AM
how can I solve this?
Use pd.to_datetime() to convert the Date column and set_index() to set it as your dataframe index.
import pandas as pd
>>>df
Date Observed Min Max Sum Count
0 09/15/2018 12:00:00 AM 2 0 2 10 5
1 09/15/2018 01:00:00 AM 1 0 2 25 20
2 09/15/2018 02:00:00 AM 1 0 1 21 21
3 09/15/2018 03:00:00 AM 1 0 2 23 22
4 09/15/2018 04:00:00 AM 1 0 1 21 21
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
>>>df
Unnamed: 0 Observed Min Max Sum Count
Date
2018-09-15 00:00:00 0 2 0 2 10 5
2018-09-15 01:00:00 1 1 0 2 25 20
2018-09-15 02:00:00 2 1 0 1 21 21
2018-09-15 03:00:00 3 1 0 2 23 22
2018-09-15 04:00:00 4 1 0 1 21 21
We can set the index to be the Date column values converted with to_datetime (I'm using pop to get values of the Date column and remove it from the DataFrame at the same time):
df.index = pd.to_datetime(df.pop('Date'))
print(df)
Output:
Observed Min Max Sum Count
Date
2018-09-15 00:00:00 2 0 2 10 5
2018-09-15 01:00:00 1 0 2 25 20
2018-09-15 02:00:00 1 0 1 21 21
2018-09-15 03:00:00 1 0 2 23 22
2018-09-15 04:00:00 1 0 1 21 21
Have a look at set_index() method.
If you use this code, it sets the second column (Date) as index and transforms it with the standard datetime parser provided by pandas.to_datetime:
ds = pd.read_csv('mandol.csv', sep=';', index_col=1, parse_dates=True)
parse_dates=True automatically transforms the index to a pandas Datetime object.
I have a data set that looks like this:
Date | ID | Task | Description
2016-01-06 00:00:00 | 1 | 010 | This is text
2016-01-06 00:10:00 | 1 | 020 | This is text
2016-01-06 00:20:00 | 1 | 010 | This is text
2016-01-06 01:00:00 | 1 | 020 | This is text
2016-01-06 01:10:00 | 1 | 030 | This is text
2016-02-06 00:00:00 | 2 | 010 | This is text
2016-02-06 00:10:00 | 2 | 020 | This is text
2016-02-06 00:20:00 | 2 | 010 | This is text
2016-02-06 01:00:00 | 2 | 020 | This is text
2016-02-06 01:01:00 | 2 | 030 | This is text
Task 020usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020, if it comes before any other Task it means that it has stopped.
I need to group by Task calculating the average duration, total sum and count of each type of task in each ID, so I am looking for something like this:
ID | Task | Average | Sum | Count
1 | 010 | 25 | 50 | 2
1 | 020 | 10 | 20 | 2
etc | etc | etc | etc | etc
There are more IDs but I only care about 010 and 020, so whatever number is returned from them is acceptable.
Can someone help me on how to do this in Python?
I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID
df['count'] = df.groupby(['ID','Task']).size()
will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.
See here for more details.
It seems you need agg with groupby, but in sample not numeric column so col was added:
print (df)
Date ID Task Description col
0 2016-01-06 00:00:00 1 010 This is text 1
1 2016-01-06 00:10:00 1 020 This is text 2
2 2016-01-06 00:20:00 1 010 This is text 6
3 2016-01-06 01:00:00 1 020 This is text 1
4 2016-01-06 01:10:00 1 030 This is text 3
5 2016-02-06 00:00:00 2 010 This is text 1
6 2016-02-06 00:10:00 2 020 This is text 8
7 2016-02-06 00:20:00 2 010 This is text 9
8 2016-02-06 01:00:00 2 020 This is text 1
df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
ID Task sum size mean
0 1 010 7 2 3.5
1 1 020 3 2 1.5
2 1 030 3 1 3.0
3 2 010 10 2 5.0
4 2 020 9 2 4.5
If need aggreagte datetime, id is a bit complicated, because need timedeltas:
df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
.agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
ID Task sum size mean
0 1 010 00:00:02.904078 2 00:00:01.452039
1 1 020 00:00:02.904081 2 00:00:01.452040
2 1 030 00:00:01.452042 1 00:00:01.452042
3 2 010 00:00:02.909434 2 00:00:01.454717
4 2 020 00:00:02.909437 2 00:00:01.454718
For finding difference in column date:
print (df.Date.dtypes)
object
#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0 NaT
1 0 days 00:10:00
2 0 days 00:10:00
3 0 days 00:40:00
4 0 days 00:10:00
5 30 days 22:50:00
6 0 days 00:10:00
7 0 days 00:10:00
8 0 days 00:40:00
9 0 days 00:01:00
Name: Date, dtype: timedelta64[ns]
I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1