Pandas: get the first occurrence grouping by keys - python

If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)

I think you need GroupBy.first:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00

One can create a new column after merging id and id2 strings, then remove rows where it is duplicated:
df['newcol'] = df.apply(lambda x: str(x.id) + str(x.id2), axis=1)
df = df[~df.newcol.duplicated()].iloc[:,:4] # iloc used to remove new column.
print(df)
Output:
id timestamp code id2
0 10 2017-07-12 13:37:00 206 a1
3 10 2017-07-12 19:00:00 206 a2
4 11 2017-07-12 13:37:00 206 a1

Related

How can I find the nearest date after another date in a different column in grouping by ID using R?

I am looking for a method that will look at each date in "Date A" and find the next nearest date after that value in "Date B" by ID (group_by). I then want to calculate the difference in days. Below is the table that I would like.
ID | Date A | Date B | Difference|
11111 | 09/01/21 | 09/03/21 | 2 |
22222 | 09/06/21 | 09/20/21 | 11 |
11111 | 09/08/21 | 09/18/21 | 10 |
44444 | 09/04/21 | NA | 11 |
44444 | 09/10/21 | 09/15/21 | 5 |
22222 | NA | 09/17/21 | NA |
77777 | NA | 10/16/21 | NA |
77777 | 09/04/21 | 10/17/21 | 24 |
77777 | 09/01/21 | 09/28/21 | 27 |
If you could please help me out with this, I would greatly appreciate it!
Cheers
A dplyr solution via group_by solution is not obvious to me here, but here is a relatively straightforward sqldf solution. Presumably this could be translated into a dplyr solution if you really wanted.
First mock up the data within R
df <- dplyr::tribble(
~'ID', ~'Date A', ~'Date B',
11111, '09/01/21', '09/03/21',
22222, '09/06/21', '09/20/21',
11111, '09/08/21', '09/18/21',
44444, '09/04/21', NA ,
44444, '09/10/21', '09/15/21',
22222, NA , '09/17/21',
77777, NA , '10/16/21',
77777, '09/04/21', '10/17/21',
77777, '09/01/21', '09/28/21'
)
df$`Date A` <- lubridate::mdy(df$`Date A`)
df$`Date B` <- lubridate::mdy(df$`Date B`)
df
Which looks like
# A tibble: 9 x 3
ID `Date A` `Date B`
<dbl> <date> <date>
1 11111 2021-09-01 2021-09-03
2 22222 2021-09-06 2021-09-20
3 11111 2021-09-08 2021-09-18
4 44444 2021-09-04 NA
5 44444 2021-09-10 2021-09-15
6 22222 NA 2021-09-17
7 77777 NA 2021-10-16
8 77777 2021-09-04 2021-10-17
9 77777 2021-09-01 2021-09-28
Then do an inequality join combined with a group by. The column I is added to allow for nuances of the data such as multiple of the same Date A within each ID
df$I <- 1:nrow(df)
df <- sqldf::sqldf('
SELECT a.I, a.ID, a."Date A", a."Date B",
MIN(b."Date B") AS NextB
FROM df a
LEFT JOIN df b
ON a.ID = b.ID
AND a."Date A" < b."Date B"
GROUP BY a.I, a.ID, a."Date A", a."Date B"
ORDER BY a.I
')
df$Difference = df$NextB - as.integer(df$`Date A`)
df$I <- NULL
df$NextB <- NULL
df
Which matches your example data (and should generalize well for edge cases not in your example data). Unclear how well it might scale up to non-trivial data.
ID Date A Date B Difference
1 11111 2021-09-01 2021-09-03 2
2 22222 2021-09-06 2021-09-20 11
3 11111 2021-09-08 2021-09-18 10
4 44444 2021-09-04 <NA> 11
5 44444 2021-09-10 2021-09-15 5
6 22222 <NA> 2021-09-17 NA
7 77777 <NA> 2021-10-16 NA
8 77777 2021-09-04 2021-10-17 24
9 77777 2021-09-01 2021-09-28 27

Downsample to quarter level and get quarter end date value in Pandas

my data frame has daily value from 2005-01-01 to 2021-10-31.
| C1 | C2
-----------------------------
2005-01-01 | 2.7859 | -7.790
2005-01-02 |-0.7756 | -0.97
2005-01-03 |-6.892 | 2.770
2005-01-04 | 2.785 | -0.97
. . .
. . .
2021-10-28 | 6.892 | 2.785
2021-10-29 | 2.785 | -6.892
2021-10-30 |-6.892 | -0.97
2021-10-31 |-0.7756 | 2.34
I want to downsample this data frame to get quarter value as follows.
| C1 | C2
------------------------------
2005-03-01 | 2.7859 | -7.790
2005-06-30 |-0.7756 | -0.97
2005-09-30 |-6.892 | 2.770
2005-12-31 | 2.785 | -0.97
I tried to do it with Pandas resample method but it requires an aggregation method.
df = df.resample('Q').mean()
I don't want the aggregated value I want the current value at the quarter-end date as it is.
Your code works except you are not using the right function. Replace mean by last:
dti = pd.date_range('2005-01-01', '2021-10-31', freq='D')
df = pd.DataFrame(np.random.random((len(dti), 2)), columns=['C1', 'C2'], index=dti)
dfQ = df.resample('Q').last()
print(dfQ)
# Output:
C1 C2
2005-03-31 0.653733 0.334182
2005-06-30 0.425229 0.316189
2005-09-30 0.055675 0.746406
2005-12-31 0.394051 0.541684
2006-03-31 0.525208 0.413624
... ... ...
2020-12-31 0.662081 0.887147
2021-03-31 0.824541 0.363729
2021-06-30 0.064824 0.621555
2021-09-30 0.126891 0.549009
2021-12-31 0.126217 0.044822
[68 rows x 2 columns]
You can do this,
df = df[df.index.is_quarter_end]
You will filter out the dates only at the end of each quarter.

Pandas combining sparse columns in dataframe

I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?
You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0
Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32
Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0
This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]
Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0

Pandas group by then count & sum based on date range +/- x-days

I want to get a count & sum of values over +/- 7 days period of a column after the dataframe being grouped to certain column
Example data (edited to reflect my real dataset):
group | date | amount
-------------------------------------------
A | 2017-12-26 04:20:20 | 50000.0
A | 2018-01-17 00:54:15 | 60000.0
A | 2018-01-27 06:10:12 | 150000.0
A | 2018-02-01 01:15:06 | 100000.0
A | 2018-02-11 05:05:34 | 150000.0
A | 2018-03-01 11:20:04 | 150000.0
A | 2018-03-16 12:14:01 | 150000.0
A | 2018-03-23 05:15:07 | 150000.0
A | 2018-04-02 10:40:35 | 150000.0
group by group then sum based on date-7 < date < date+7
Results that I want:
group | date | amount | grouped_sum
-----------------------------------------------------------
A | 2017-12-26 04:00:00 | 50000.0 | 50000.0
A | 2018-01-17 00:00:00 | 60000.0 | 60000.0
A | 2018-01-27 06:00:00 | 150000.0 | 250000.0
A | 2018-02-01 01:00:00 | 100000.0 | 250000.0
A | 2018-02-11 05:05:00 | 150000.0 | 150000.0
A | 2018-03-01 11:00:04 | 150000.0 | 150000.0
A | 2018-03-16 12:00:01 | 150000.0 | 150000.0
A | 2018-03-23 05:00:07 | 100000.0 | 100000.0
A | 2018-04-02 10:00:00 | 100000.0 | 100000.0
Quick snippet to achieve the dataset:
group = 9 * ['A']
date = pd.to_datetime(['2017-12-26 04:20:20', '2018-01-17 00:54:15',
'2018-01-27 06:10:12', '2018-02-01 01:15:06',
'2018-02-11 05:05:34', '2018-03-01 11:20:04',
'2018-03-16 12:14:01', '2018-03-23 05:15:07',
'2018-04-02 10:40:35'])
amount = [50000.0, 60000.0, 150000.0, 100000.0, 150000.0,
150000.0, 150000.0, 150000.0, 150000.0]
df = pd.DataFrame({'group':group, 'date':date, 'amount':amount})
Bit of explanation:
2nd row is 40 because it sums data for A in period 2018-01-14 and 2018-01-15
4th row is 30 because it sums data for B in period 2018-01-03 + next 7 days
6th row is 30 because it sums data for B in period 2018-01-03 + prev 7 days.
I dont have any idea how to do sum over a period of date range. I might be able to do it if I make this way:
1.Create another column that shows date-7 and date+7 for each rows
group | date | amount | date-7 | date+7
-------------------------------------------------------------
A | 2017-12-26 | 50000.0 | 2017-12-19 | 2018-01-02
A | 2018-01-17 | 60000.0 | 2018-01-10 | 2018-01-24
2.calculate amount between the date range: df[df.group == 'A' & df.date > df.date-7 & df.date < df.date+7].amount.sum()
3.But this method is quite tedious.
EDIT (2018-09-01):
Found out this method below based on #jezrael answer which works for me but only works for single group:
t = pd.Timedelta(7, unit='d')
def g(row):
res = df[(df.created > row.created - t) & (df.created < row.created + t)].amount.sum()
return res
df['new'] = df.apply(g, axis=1)
Here is problem need loop for each row and for each groups:
t = pd.Timedelta(7, unit='d')
def f(x):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum() ,axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f)
print (df)
group date amount new
0 A 2018-01-01 10 10.0
1 A 2018-01-14 20 40.0
2 A 2018-01-15 20 40.0
3 B 2018-02-03 10 30.0
4 B 2018-02-04 10 30.0
5 B 2018-02-05 10 30.0
Thanks for improvement by #jpp:
def f(x, t):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum(),axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f, pd.Timedelta(7, unit='d'))
Verify solution:
t = pd.Timedelta(7, unit='d')
df = df[df['group'] == 'A']
def test(y):
a = df.loc[df['date'].between(y['date'] - t, y['date'] + t,inclusive=False)]
print (a)
print (a['amount'])
return a['amount'].sum()
group date amount
0 A 2018-01-01 10
0 10
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
df['new'] = df.apply(test,axis=1)
print (df)
group date amount new
0 A 2018-01-01 10 10
1 A 2018-01-14 20 40
2 A 2018-01-15 20 40
Add column with first days of the week:
df['week_start'] = df['date'].dt.to_period('W').apply(lambda x: x.start_time)
Result:
group date amount week_start
0 A 2018-01-01 10 2017-12-26
1 A 2018-01-14 20 2018-01-09
2 A 2018-01-15 20 2018-01-09
3 B 2018-02-03 10 2018-01-30
4 B 2018-02-04 10 2018-01-30
5 B 2018-02-05 10 2018-01-30
Group by new column and find weekly total amount:
grouped_sum = df.groupby('week_start')['amount'].sum().reset_index()
Result:
week_start amount
0 2017-12-26 10
1 2018-01-09 40
2 2018-01-30 30
Merge dataframes on week_start:
pd.merge(df.drop('amount', axis=1), grouped_sum, on='week_start').drop('week_start', axis=1)
Result:
group date amount
0 A 2018-01-01 10
1 A 2018-01-14 40
2 A 2018-01-15 40
3 B 2018-02-03 30
4 B 2018-02-04 30
5 B 2018-02-05 30

Remove rows with values repeated on specific columns [duplicate]

If I have following dataframe
| id | timestamp | code | id2
| 10 | 2017-07-12 13:37:00 | 206 | a1
| 10 | 2017-07-12 13:40:00 | 206 | a1
| 10 | 2017-07-12 13:55:00 | 206 | a1
| 10 | 2017-07-12 19:00:00 | 206 | a2
| 11 | 2017-07-12 13:37:00 | 206 | a1
...
I need to group by id, id2 columns and get the first occurrence of timestamp value, e.g. for id=10, id2=a1, timestamp=2017-07-12 13:37:00.
I googled it and found some possible solutions, but cant figure out how to realize them properly. This probably should be something like:
df.groupby(["id", "id2"])["timestamp"].apply(lambda x: ....)
I think you need GroupBy.first:
df.groupby(["id", "id2"])["timestamp"].first()
Or drop_duplicates:
df.drop_duplicates(subset=['id','id2'])
For same output:
df1 = df.groupby(["id", "id2"], as_index=False)["timestamp"].first()
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
df1 = df.drop_duplicates(subset=['id','id2'])[['id','id2','timestamp']]
print (df1)
id id2 timestamp
0 10 a1 2017-07-12 13:37:00
1 10 a2 2017-07-12 19:00:00
2 11 a1 2017-07-12 13:37:00
One can create a new column after merging id and id2 strings, then remove rows where it is duplicated:
df['newcol'] = df.apply(lambda x: str(x.id) + str(x.id2), axis=1)
df = df[~df.newcol.duplicated()].iloc[:,:4] # iloc used to remove new column.
print(df)
Output:
id timestamp code id2
0 10 2017-07-12 13:37:00 206 a1
3 10 2017-07-12 19:00:00 206 a2
4 11 2017-07-12 13:37:00 206 a1

Categories