I have a data set that looks like this:
Date | ID | Task | Description
2016-01-06 00:00:00 | 1 | 010 | This is text
2016-01-06 00:10:00 | 1 | 020 | This is text
2016-01-06 00:20:00 | 1 | 010 | This is text
2016-01-06 01:00:00 | 1 | 020 | This is text
2016-01-06 01:10:00 | 1 | 030 | This is text
2016-02-06 00:00:00 | 2 | 010 | This is text
2016-02-06 00:10:00 | 2 | 020 | This is text
2016-02-06 00:20:00 | 2 | 010 | This is text
2016-02-06 01:00:00 | 2 | 020 | This is text
2016-02-06 01:01:00 | 2 | 030 | This is text
Task 020usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020, if it comes before any other Task it means that it has stopped.
I need to group by Task calculating the average duration, total sum and count of each type of task in each ID, so I am looking for something like this:
ID | Task | Average | Sum | Count
1 | 010 | 25 | 50 | 2
1 | 020 | 10 | 20 | 2
etc | etc | etc | etc | etc
There are more IDs but I only care about 010 and 020, so whatever number is returned from them is acceptable.
Can someone help me on how to do this in Python?
I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID
df['count'] = df.groupby(['ID','Task']).size()
will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.
See here for more details.
It seems you need agg with groupby, but in sample not numeric column so col was added:
print (df)
Date ID Task Description col
0 2016-01-06 00:00:00 1 010 This is text 1
1 2016-01-06 00:10:00 1 020 This is text 2
2 2016-01-06 00:20:00 1 010 This is text 6
3 2016-01-06 01:00:00 1 020 This is text 1
4 2016-01-06 01:10:00 1 030 This is text 3
5 2016-02-06 00:00:00 2 010 This is text 1
6 2016-02-06 00:10:00 2 020 This is text 8
7 2016-02-06 00:20:00 2 010 This is text 9
8 2016-02-06 01:00:00 2 020 This is text 1
df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
ID Task sum size mean
0 1 010 7 2 3.5
1 1 020 3 2 1.5
2 1 030 3 1 3.0
3 2 010 10 2 5.0
4 2 020 9 2 4.5
If need aggreagte datetime, id is a bit complicated, because need timedeltas:
df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
.agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
ID Task sum size mean
0 1 010 00:00:02.904078 2 00:00:01.452039
1 1 020 00:00:02.904081 2 00:00:01.452040
2 1 030 00:00:01.452042 1 00:00:01.452042
3 2 010 00:00:02.909434 2 00:00:01.454717
4 2 020 00:00:02.909437 2 00:00:01.454718
For finding difference in column date:
print (df.Date.dtypes)
object
#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0 NaT
1 0 days 00:10:00
2 0 days 00:10:00
3 0 days 00:40:00
4 0 days 00:10:00
5 30 days 22:50:00
6 0 days 00:10:00
7 0 days 00:10:00
8 0 days 00:40:00
9 0 days 00:01:00
Name: Date, dtype: timedelta64[ns]
Related
I have a dataframe with subjects and dates for a certain measurement. For each subject I want to find if the date in each row of the group corresponds to the first (1), second (2), third (3)... unique date value for that subject.
To clarify this is what I am looking for:
|subject | date | order|
|A | 01.01.2020 | 1|
|A | 01.01.2020 | 1|
|A | 02.01.2020 | 2|
|B | 01.01.2020 | 1|
|B | 02.01.2020 | 2|
|B | 02.01.2020 | 2|
I though about something as bellow, but the for loop is not admissible in the apply function:
df['order']=df.groupby(['subject']).apply(lambda x: i if x['date']=value for i, value in enumerate(x['date'].unique()))
Is there a straightforward way to do this?
Use factorize in GroupBy.transform :
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
print (df)
subject date order order1
0 A 01.01.2020 1 1
1 A 01.01.2020 1 1
2 A 02.01.2020 2 2
3 B 01.01.2020 1 1
4 B 02.01.2020 2 2
5 B 02.01.2020 2 2
Or you can use GroupBy.rank, but is necessary convert column date to datetimes:
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1
0 A 2020-01-01 1 1.0
1 A 2020-01-01 1 1.0
2 A 2020-02-01 2 2.0
3 B 2020-01-01 1 1.0
4 B 2020-02-01 2 2.0
5 B 2020-02-01 2 2.0
Difference of solution is if changed order of datetimes:
print (df)
subject date order (disregarding temporal order of date)
0 A 2020-01-01 1
1 A 2020-03-01 2 <- changed datetime for sample
2 A 2020-02-01 3
3 B 2020-01-01 1
4 B 2020-02-01 2
5 B 2020-02-01 2
df['order1']=df.groupby(['subject'])['date'].transform(lambda x: pd.factorize(x)[0]) + 1
df['order2']=df.groupby(['subject'])['date'].rank(method='dense')
print (df)
subject date order order1 order2
0 A 2020-01-01 1 1 1.0
1 A 2020-03-01 1 2 3.0
2 A 2020-02-01 2 3 2.0
3 B 2020-01-01 1 1 1.0
4 B 2020-02-01 2 2 2.0
5 B 2020-02-01 2 2 2.0
In summary: use the first method if you don't care about the temporal order of date being reflected in the order output, or the second method if the temporal order matters and should reflect in the order output.
I want to get a count & sum of values over +/- 7 days period of a column after the dataframe being grouped to certain column
Example data (edited to reflect my real dataset):
group | date | amount
-------------------------------------------
A | 2017-12-26 04:20:20 | 50000.0
A | 2018-01-17 00:54:15 | 60000.0
A | 2018-01-27 06:10:12 | 150000.0
A | 2018-02-01 01:15:06 | 100000.0
A | 2018-02-11 05:05:34 | 150000.0
A | 2018-03-01 11:20:04 | 150000.0
A | 2018-03-16 12:14:01 | 150000.0
A | 2018-03-23 05:15:07 | 150000.0
A | 2018-04-02 10:40:35 | 150000.0
group by group then sum based on date-7 < date < date+7
Results that I want:
group | date | amount | grouped_sum
-----------------------------------------------------------
A | 2017-12-26 04:00:00 | 50000.0 | 50000.0
A | 2018-01-17 00:00:00 | 60000.0 | 60000.0
A | 2018-01-27 06:00:00 | 150000.0 | 250000.0
A | 2018-02-01 01:00:00 | 100000.0 | 250000.0
A | 2018-02-11 05:05:00 | 150000.0 | 150000.0
A | 2018-03-01 11:00:04 | 150000.0 | 150000.0
A | 2018-03-16 12:00:01 | 150000.0 | 150000.0
A | 2018-03-23 05:00:07 | 100000.0 | 100000.0
A | 2018-04-02 10:00:00 | 100000.0 | 100000.0
Quick snippet to achieve the dataset:
group = 9 * ['A']
date = pd.to_datetime(['2017-12-26 04:20:20', '2018-01-17 00:54:15',
'2018-01-27 06:10:12', '2018-02-01 01:15:06',
'2018-02-11 05:05:34', '2018-03-01 11:20:04',
'2018-03-16 12:14:01', '2018-03-23 05:15:07',
'2018-04-02 10:40:35'])
amount = [50000.0, 60000.0, 150000.0, 100000.0, 150000.0,
150000.0, 150000.0, 150000.0, 150000.0]
df = pd.DataFrame({'group':group, 'date':date, 'amount':amount})
Bit of explanation:
2nd row is 40 because it sums data for A in period 2018-01-14 and 2018-01-15
4th row is 30 because it sums data for B in period 2018-01-03 + next 7 days
6th row is 30 because it sums data for B in period 2018-01-03 + prev 7 days.
I dont have any idea how to do sum over a period of date range. I might be able to do it if I make this way:
1.Create another column that shows date-7 and date+7 for each rows
group | date | amount | date-7 | date+7
-------------------------------------------------------------
A | 2017-12-26 | 50000.0 | 2017-12-19 | 2018-01-02
A | 2018-01-17 | 60000.0 | 2018-01-10 | 2018-01-24
2.calculate amount between the date range: df[df.group == 'A' & df.date > df.date-7 & df.date < df.date+7].amount.sum()
3.But this method is quite tedious.
EDIT (2018-09-01):
Found out this method below based on #jezrael answer which works for me but only works for single group:
t = pd.Timedelta(7, unit='d')
def g(row):
res = df[(df.created > row.created - t) & (df.created < row.created + t)].amount.sum()
return res
df['new'] = df.apply(g, axis=1)
Here is problem need loop for each row and for each groups:
t = pd.Timedelta(7, unit='d')
def f(x):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum() ,axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f)
print (df)
group date amount new
0 A 2018-01-01 10 10.0
1 A 2018-01-14 20 40.0
2 A 2018-01-15 20 40.0
3 B 2018-02-03 10 30.0
4 B 2018-02-04 10 30.0
5 B 2018-02-05 10 30.0
Thanks for improvement by #jpp:
def f(x, t):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum(),axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f, pd.Timedelta(7, unit='d'))
Verify solution:
t = pd.Timedelta(7, unit='d')
df = df[df['group'] == 'A']
def test(y):
a = df.loc[df['date'].between(y['date'] - t, y['date'] + t,inclusive=False)]
print (a)
print (a['amount'])
return a['amount'].sum()
group date amount
0 A 2018-01-01 10
0 10
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
df['new'] = df.apply(test,axis=1)
print (df)
group date amount new
0 A 2018-01-01 10 10
1 A 2018-01-14 20 40
2 A 2018-01-15 20 40
Add column with first days of the week:
df['week_start'] = df['date'].dt.to_period('W').apply(lambda x: x.start_time)
Result:
group date amount week_start
0 A 2018-01-01 10 2017-12-26
1 A 2018-01-14 20 2018-01-09
2 A 2018-01-15 20 2018-01-09
3 B 2018-02-03 10 2018-01-30
4 B 2018-02-04 10 2018-01-30
5 B 2018-02-05 10 2018-01-30
Group by new column and find weekly total amount:
grouped_sum = df.groupby('week_start')['amount'].sum().reset_index()
Result:
week_start amount
0 2017-12-26 10
1 2018-01-09 40
2 2018-01-30 30
Merge dataframes on week_start:
pd.merge(df.drop('amount', axis=1), grouped_sum, on='week_start').drop('week_start', axis=1)
Result:
group date amount
0 A 2018-01-01 10
1 A 2018-01-14 40
2 A 2018-01-15 40
3 B 2018-02-03 30
4 B 2018-02-04 30
5 B 2018-02-05 30
I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.
My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.
I have this dataframe (type could be 1 or 2):
user_id | timestamp | type
1 | 2015-5-5 12:30 | 1
1 | 2015-5-5 14:00 | 2
1 | 2015-5-5 15:00 | 1
I want to group my data by six hours and when doing this I want to keep type as:
1 (if there is only 1 within that 6 hour frame)
2 (if there is only 2 within that 6 hour frame) or
3 (if there was both 1 and 2 within that 6 hour frame)
Here is the my code:
df = df.groupby(['user_id', pd.TimeGrouper(freq=(6,'H'))]).mean()
which produces:
user_id | timestamp | type
1 | 2015-5-5 12:00 | 4
However, I want to get 3 instead of 4. I wonder how can I replace the mean() in my groupby code to produce the desired output?
Try this:
In [54]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]) \
.agg({'type':lambda x: x.unique().sum()})
Out[54]:
type
user_id timestamp
1 2015-05-05 12:00:00 3
PS it'll work only with given types: (1, 2) as their sum is 3
Another data set:
In [56]: df
Out[56]:
user_id timestamp type
0 1 2015-05-05 12:30:00 1
1 1 2015-05-05 14:00:00 1
2 1 2015-05-05 15:00:00 1
3 1 2015-05-05 20:00:00 1
In [57]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]).agg({'type':lambda x: x.unique().sum()})
Out[57]:
type
user_id timestamp
1 2015-05-05 12:00:00 1
2015-05-05 18:00:00 1
There are a number of things I would typically do in SQL and excel that I'm trying to do with Pandas. There are a few different wrangling problems here, combined into one question because they all have the same goal.
I have a data frame df in python with three columns:
| EventID | PictureID | Date
0 | 1 | A | 2010-01-01
1 | 2 | A | 2010-02-01
2 | 3 | A | 2010-02-15
3 | 4 | B | 2010-01-01
4 | 5 | C | 2010-02-01
5 | 6 | C | 2010-02-15
EventIDs are unique. PictureIDs are not unique, although PictureID + Date are distinct.
I. First I would like to add a new column:
df['period'] = the month and year that the event falls into beginning 2010-01.
II. Second, I would like to 'melt' the data into some new dataframe that counts the number of events for a given PictureID in a given period. I'll use examples with just two periods.
| PictureID | Period | Count
0 | A | 2010-01 | 1
1 | A | 2010-02 | 2
2 | B | 2010-01 | 1
3 | C | 2010-02 | 2
So that I can then stack (?) this new data frame into something that provides period counts for all unique PictureIDs:
| PictureID | 2010-01 | 2010-02
0 | A | 1 | 2
1 | B | 1 | 0
2 | C | 0 | 2
My sense is that pandas is built do to this sort of thing easily, is that correct?
[Edit: Removed a confused third part.]
For the first two parts you can do:
>>> df['Period'] = df['Date'].map(lambda d: d.strftime('%Y-%m'))
>>> df
EventID PictureID Date Period
0 1 A 2010-01-01 00:00:00 2010-01
1 2 A 2010-02-01 00:00:00 2010-02
2 3 A 2010-02-15 00:00:00 2010-02
3 4 B 2010-01-01 00:00:00 2010-01
4 5 C 2010-02-01 00:00:00 2010-02
5 6 C 2010-02-15 00:00:00 2010-02
>>> grouped = df[['Period', 'PictureID']].groupby('Period')
>>> grouped['PictureID'].value_counts().unstack(0).fillna(0)
Period 2010-01 2010-02
A 1 2
B 1 0
C 0 2
For the third part, either I haven't understood the question well, or you haven't posted the correct numbers in the example. since the count for the A in the 3rd row should be 2? and for the C in the 6th row should be 1. If the period is six months...
Either way you should do something like this:
>>> ts = df.set_index('Date')
>>> ts.resample('6M', ...)
Update: This is a pretty ugly way to do it, I think I saw a better way to do it, but I can't find the SO question. But, this will also get the job done...
def for_half_year(row, data):
date = row['Date']
pid = row['PictureID']
# Do this 6 month checking better
if '__start' not in data or (date - data['__start']).days > 6*30:
# Reset values
for key in data:
data[key] = 0
data['__start'] = date
data[pid] = data.get(pid, -1) + 1
return data[pid]
df['PastSix'] = df.apply(for_half_year, args=({},), axis=1)