I need to select the rows of the last value for each user_id and date, but when the last value in the metric column is 'leave' select the last 2 rows(if exists).
My data:
df = pd.DataFrame({
"user_id": [1,1,1, 2,2,2]
,'subscription': [1,1,2,3,4,5]
,"metric": ['enter', 'stay', 'leave', 'enter', 'leave', 'enter']
,'date': ['2020-01-01', '2020-01-01', '2020-03-01', '2020-01-01', '2020-01-01', '2020-01-02']
})
#result
user_id subscription metric date
0 1 1 enter 2020-01-01
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Expected output:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01 # stay because last metric='leave' inside group[user_id, date]
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
What I've tried: drop_duplicates and groupby, both give the same result, only with the last value
df.drop_duplicates(['user_id', 'date'], keep='last')
#or
df.groupby(['user_id', 'date']).tail(1)
You can use boolean masking and return three different conditions that are True or False with variables a, b, or c. Then, filter for when the data a, b, or c returns True with the or operator |:
a = df.groupby(['user_id', 'date', df.groupby(['user_id', 'date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id', 'date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a, b, c)
df
#a groupby the two required groups plus a group that finds the cumulative count, which is necessary in order to return True for the last "metric" within the the group.
0 False
1 False
2 True
3 False
4 True
5 False
Name: metric, dtype: bool
#b if something has a count of one, then you want to keep it.
0 False
1 False
2 True
3 False
4 False
5 True
Name: metric, dtype: bool
#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0 False
1 True
2 False
3 True
4 False
5 False
Name: metric, dtype: bool
Out[18]:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
This is one way, but in my opinion, slow, since we are iterating through the grouping :
df["date"] = pd.to_datetime(df["date"])
df = df.assign(metric_is_leave=df.metric.eq("leave"))
pd.concat(
[
value.iloc[-2:, :-1] if value.metric_is_leave.any() else value.iloc[-1:, :-1]
for key, value in df.groupby(["user_id", "date"])
]
)
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
Related
I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different
Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3
Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group
0 1 2021-01-01 1
1 1 2021-01-15 1
2 1 2021-01-30 1
3 1 2021-02-05 1
4 2 2020-02-10 2
5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id not equal to the previous id:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or (|) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True value to delimit the groups.
*Assumes id and datafield columns are appropriately ordered.
It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.
My table looks like this:
no type 2020-01-01 2020-01-02 2020-01-03 ...................
1 x 1 2 3
2 b 4 3 0
and what I want to do is to melt down the column date and also value to be in separated new columns. I have done it, but I specified the columns that I want to melt like this script below:
cols_dict = dict(zip(df.iloc[:, 3:100].columns, df.iloc[:, 3:100].values[0]))
id_vars = [col for col in df.columns if isinstance(col, str)]
df = df.melt(id_vars = [col for col in df.columns if isinstance(col, str)], var_name = "date", value_name = 'value')
The expected result I want is:
no type date value
1 x 2020-01-01 1
1 x 2020-01-02 2
1 x 2020-01-03 3
2 b 2020-01-01 4
2 b 2020-01-02 3
2 b 2020-01-03 0
I assume that the column dates will be always added into the data frame as time goes by, so my script would not be worked anymore when the column date is more than 100.
How should I write my script so it will provide any number of date column in the future, as basically my current script could only access until columns number 100.
Thanks in advance.
>>> df.set_index(["no", "type"]) \
.rename_axis(columns="date") \
.stack() \
.rename("value") \
.reset_index()
no type date value
0 1 x 2020-01-01 1
1 1 x 2020-01-02 2
2 1 x 2020-01-03 3
3 2 b 2020-01-01 4
4 2 b 2020-01-02 3
5 2 b 2020-01-03 0
**Edit at bottom **
I have a data frame with inventory data that looks like the following:
d = {'product': [a, b, a, b, c], 'amount': [1, 2, 3, 5, 2], 'date': [2020-6-6, 2020-6-6, 2020-6-7,
2020-6-7, 2020-6-7]}
df = pd.DataFrame(data=d)
df
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 a 3 2020-6-7
3 b 5 2020-6-7
4 c 2 2020-6-7
I would like to know what the inventory difference is month to month. The output would look like this:
df
product diff isnew date
0 a nan nan 2020-6-6
1 b nan nan 2020-6-6
2 a 2 False 2020-6-7
3 b 3 False 2020-6-7
4 c 2 True 2020-6-7
Sorry if I was not clear in the first example, In reality I have many months of data, so I am not just looking at doing the difference of one period vs the other. It would need to be a general case where it looks at the difference of month n vs n-1 and then n-1 and n-2 and so on.
What's the best way to do this in Pandas?
I guess the key here is to find the isnew:
# new products by `product`
new_prods = df['date'] != df.date.min()
duplicated = df.duplicated('product')
# first appearance of new products
# or duplicated *old* products
valids = new_prods ^ duplicated
df.loc[valids,'is_new'] = ~ duplicated
# then the difference:
df['diff'] = (df.groupby('product')['amount'].diff() # normal differences
.fillna(df['amount']) # fill the first value for all product
.where(df['is_new'].notna()) # remove the first month
)
Output:
product amount date is_new diff
0 a 1 2020-6-6 NaN NaN
1 b 2 2020-6-6 NaN NaN
2 a 3 2020-6-7 False 2.0
3 b 5 2020-6-7 False 3.0
4 c 2 2020-6-7 True 2.0
you can try groupby on the column product and diff the column amount for the column 'diff'. Then use duplicated for the column 'isnew'.
df['diff'] = df.groupby('product')['amount'].diff()
df['isnew'] = ~df['product'].duplicated()
print (df)
product amount date diff isnew
0 a 1 2020-6-6 NaN True
1 b 2 2020-6-6 NaN True
2 a 3 2020-6-7 2.0 False
3 b 5 2020-6-7 3.0 False
4 c 2 2020-6-7 NaN True
I have a dataframe that looks like the following:
ID1 ID2 Date
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018
3 4 09/07/2018
etc.
What I need to do is to flag the first time that an ID in ID1 appears in ID2. In the above example this would look like
ID1 ID2 Date Flag
1 2 01/01/2018
1 2 03/01/2018
1 2 04/05/2018
2 1 06/06/2018
1 2 08/06/2018 Y
3 4 09/07/2018
I've used the following code to tell me if ID1 ever occurs in ID2:
ID2List= df['ID2'].tolist()
ID2List= list(set(IDList)) # dedupe list
df['ID1 is in ID2List'] = np.where(df[ID1].isin(ID2List), 'Yes', 'No')
But this only tells me that there is an occasion where ID1 appears in ID2 at some point but not the event at which this first occurs.
Any help?
One idea is to use next with a generator expression to calculate the indices of matches in ID1. Then compare with index and use argmax to get the index of the first True value:
idx = df.apply(lambda row: next((idx for idx, val in enumerate(df['ID1']) \
if row['ID2'] == val), 0), axis=1)
df.loc[(df.index > idx).argmax(), 'Flag'] = 'Y'
print(df)
ID1 ID2 Date Flag
0 1 2 01/01/2018 NaN
1 1 2 03/01/2018 NaN
2 1 2 04/05/2018 NaN
3 2 1 06/06/2018 Y
4 1 2 08/06/2018 NaN
5 3 4 09/07/2018 NaN
Imagine that I've got the following DataFrame
A | B | C | D
-------------------------------
2000-01-01 00:00:00 | 1 | 1 | 1
2000-01-01 00:04:30 | 1 | 2 | 2
2000-01-01 00:04:30 | 2 | 3 | 3
2000-01-02 00:00:00 | 1 | 4 | 4
And I want to drop rows where B are equal, and the values in A are "close". Say, withing five minutes of each other. So in this case the first two rows, but keep the last two.
So, instead of doing df.dropna(subset=['A', 'B'], inplace=True, keep=False), I'd like something that's more like df.dropna(subset=['A', 'B'], inplace=True, keep=False, func={'A': some_func}). With
def some_func(ts1, ts2):
delta = ts1 - ts2
return abs(delta.total_seconds()) >= 5 * 60
Is there a way to do this in Pandas?
m = df.groupby('B').A.apply(lambda x: x.diff().dt.seconds < 300)
m2 = df.B.duplicated(keep=False) & (m | m.shift(-1))
df[~m2]
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4
Details
m gets a mask of all rows within 5 minutes of each other.
m
0 False
1 True
2 False
3 False
Name: A, dtype: bool
m2 is the final mask of all items that must be dropped.
m2
0 True
1 True
2 False
3 False
dtype: bool
I break down the steps ...And you can test with your real data to see whether it works or not ..
df['dropme']=df.A.diff().shift(-1).dt.seconds/60
df['dropme2']=df.A
df.loc[df.dropme<=5,'dropme2']=1
df.drop_duplicates(['dropme2'],keep=False).drop(['dropme','dropme2'],axis=1)
Out[553]:
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4
write a function that accepts a data frame, calculates the delta between two successive timestamps, and return the filtered dataframe. Then groupby & apply.
import pandas as pd
import datetime
# this one preserves 1 row from two or more closeby rows.
def filter_window(df):
df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
df['filt'] = df.filt.fillna(10.0)
df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
return df[['A', 'C', 'D']]
df2 = df.groupby('B').apply(filter_window).reset_index()
# With your sample dataset, this is the output of df2
A B C D
0 2000-01-01 00:00:00 1 1 1
1 2000-01-02 00:00:00 1 4 4
2 2000-01-01 00:04:30 2 3 3
# this one drops all closeby rows.
def filter_window2(df):
df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
df['filt2'] = (df.A.shift(-1) - df.A) / datetime.timedelta(minutes=1)
df['filt'] = df.filt.fillna(df.filt2)
df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
return df[['A', 'C', 'D']]
df3 = df.groupby('B').apply(filter_window2).reset_index()
# With your sample dataset, this is the output of df3
A B C D
0 2000-01-02 00:00:00 1 4 4
1 2000-01-01 00:04:30 2 3 3