Group by first occurrence of each value in a pandas dataframe - python

I have a pandas dataframe that looks like this:
id
user
action
timestamp
1
Jim
start
12/10/2022
2
Jim
start
12/10/2022
3
Jim
end
2/2/2022
4
Linette
start
8/18/2022
5
Linette
start
3/24/2022
6
Linette
end
8/27/2022
7
Rachel
start
2/7/2022
8
Rachel
end
1/4/2023
9
James
start
6/12/2022
10
James
end
5/14/2022
11
James
start
11/28/2022
12
James
start
8/9/2022
13
James
end
2/15/2022
For each user, there can be more than one start event, but only one end. Imagine that they sometimes need to start a book over again, but only finish it once.
What I want is to calculate the time difference between the first start and the end, so keep, for each user, the first occurrence of "start" and "end" in each group.
Any hint?

>>> (df.groupby(["user", "action"], sort=False)["timestamp"]
.first()
.droplevel("action")
.diff().iloc[1::2])
user
James 29 days
Jim 311 days
Linette -9 days
Rachel -331 days
Name: timestamp, dtype: timedelta64[ns]
for "timestamp" of each "user" & "action" pair, get the first occurences
this will effectively take the first start, and the (only) end
then drop the carried over "action" level of groupers
take the difference from ends and starts
take every other value to get per-user difference
(sort=False ensures during groupby that start's don't get mixed up.)

Related

Finding earliest date after groupby a specific column

I have a dataframe that look like below.
id name tag location date
1 John 34 FL 01/12/1990
1 Peter 32 NC 01/12/1990
1 Dave 66 SC 11/25/1990
1 Mary 12 CA 03/09/1990
1 Sue 29 NY 07/10/1990
1 Eve 89 MA 06/12/1990
: : : : :
n John 34 FL 01/12/2000
n Peter 32 NC 01/12/2000
n Dave 66 SC 11/25/1999
n Mary 12 CA 03/09/1999
n Sue 29 NY 07/10/1998
n Eve 89 MA 06/12/1997
I need to find the location information based on the id column but with one condition, only need the earliest date. For example, the earliest date for id=1 group is 01/12/1990, which means the location is FL and NC. Then apply it to all the different id group to get the top 3 locations. I have written the code to do this for me.
#Get the earliest date base on id group
df_ear = df.loc[df.groupby('id')['date'].idxmin()]
#Count the occurancees of the location
df_ear['location'].value_counts()
The code works perfectly fine but it cannot return more than 1 location (using my first line of code) if they have the same earliest date, for example, id=1 group will only return FL instead FL and NC. I am wondering how can I fix my code to include the condition that if the earliest date is more than 1.
Thanks!
Use GroupBy.transform for Series for minimal date per groups, so possible compare by column Date in boolean indexing:
df['date'] = pd.to_datetime(df['date'])
df_ear = df[df.groupby('id')['date'].transform('min').eq(df['date'])]

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

How do I use pandas to count the number of times a name and type occur together within a 60 period from the first instance?

My dataframe is this:
Date Name Type Description Number
2020-07-24 John Doe Type1 NaN NaN
2020-08-10 Jo Doehn Type1 NaN NaN
2020-08-15 John Doe Type1 NaN NaN
2020-09-10 John Doe Type2 NaN NaN
2020-11-24 John Doe Type1 NaN NaN
I want the Number column to have the instance number with the 60 day period. So for entry 1, the Number should just be 1 since it's the first instance - same with entry 2 since it's a different name. Entry 3 however, should have 2 in the Number column since it's the second instance of John Doe and Type 1 in the 60 day period starting 7/24 (the first instance date). Entry 4 would be 1 as well since the Type is different. Entry 5 would also be 1 since it's outside the 60 day period from 7/24. However, any entries after this with John Doe, Type 1 would have a new 60 day period starting 11/24.
Sorry, I know this is a pretty loaded question with a lot of aspects to it, but I'm trying to get up to speed on dataframes again and I'm not sure where to begin.
As a starting point, you could create a pivot table. (The assign statement just creates a temporary column of ones, to support counting.) In the example below, each row is a date, and each column is a (name, type) pair.
Then, use the resample function (to get one row for every calendar day), and the rolling function (to sum the numbers in the 60-day window).
x = (df.assign(temp = 1)
.pivot_table(index='date',
columns=['name', 'type'],
values='temp',
aggfunc='count',
fill_value=0)
)
x.resample('1d').count().rolling(60).sum()
Can you post sample data in text format (for copy/paste)?

How can I remove the current and next instance in a dataframe (Python)?

Say I have this dataframe, df. It's structured like this:
index date animal park_visits
0 Jan cat 1
1 Jan dog 2
2 Feb cat 1
3 Feb dog 1
4 Feb pig 4
5 March cat 3
6 March dog 2
7 March pig 3
8 April cat 2
How can I create a new dataframe such that, if in the current month an animal has less than a single park visits a month, to exclude that row as well as the next month's row?
For example, at index 0, the cat had only one park visit in January, so then I would exclude entries at index 0, and 2. Additionally, since the cat visited the park in February one time, I would also exclude the entry at index 5 when the cat visited the park 3 times in March. But since the cat attended the park 3 times in March, I would include the entry for April.
As a result, the ending, sample dataframe I would ultimately want is going to look something like this:
index date animal park_visits
0 Jan dog 2
1 Feb pig 4
2 March pig 3
3 April cat 2
Is there any way to do this efficiently without a loop? My best guess is to create a new dataframe that is where park_visits is = 1, and with that, try and remove the next instance the date and animal are the same. However, I'm not sure how to remove ONLY the next instance, not all instances (so I would need to keep the entry where the date is April, animal is cat, and park_visits is 2). Any help would be appreciated.
We want to identify those rows where park_visits were greater than one this month and the prior month. We use shift to check the prior month
f = lambda x: (lambda y: y & y.shift().fillna(True))(x > 1)
df[df.groupby('animal').park_visits.transform(f)]
date animal park_visits
index
1 Jan dog 2
4 Feb pig 4
7 March pig 3
8 April cat 2

Python Pandas: Change value associated with each first day entry in every month

I'd like to change the value associated with the first day in every month for a pandas.Series I have. For example, given something like this:
Date
1984-01-03 0.992701
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 1.009894
1984-02-02 0.996608
1984-02-03 0.996595
...
I'd like to change the values associated with 1984-01-03, 1984-02-01 and so on. I've racked my brain for hours on this one and have looked around Stack Overflow a fair bit. Some solutions have come close. For example, using:
[In]: series.groupby((m_ret.index.year, m_ret.index.month)).first()
[Out]:
Date Date
1984 1 0.992701
2 1.009894
3 1.005963
4 0.997899
5 1.000342
6 0.995429
7 0.994620
8 1.019377
9 0.993209
10 1.000992
11 1.009786
12 0.999069
1985 1 0.981220
2 1.011928
3 0.993042
4 1.015153
...
Is almost there, but I'm sturggling to proceed further.
What I'd ike to do is set the values associated with the first day present in each month for every year to 1.
series[m_ret.index.is_month_start] = 1 comes close, but the problem here is that is_month_start only selects rows where the day value is 1. However in my series, this isn't always the case as you can see. For example, the date of the first day in January is 1984-01-03.
series.groupby(pd.TimeGrouper('BM')).nth(0) doesn't appear to return the first day either, instead I get the last day:
Date
1984-01-31 0.992701
1984-02-29 1.009894
1984-03-30 1.005963
1984-04-30 0.997899
1984-05-31 1.000342
1984-06-29 0.995429
1984-07-31 0.994620
1984-08-31 1.019377
...
I'm completely stumped. Your help is as always, greatly appreciated! Thank you.
One way would to be to use your .groupby((m_ret.index.year, m_ret.index.month)) idea, but use idxmin instead on the index itself converted into a Series:
In [74]: s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
Out[74]:
Date Date
1984 1 1984-01-03
2 1984-02-01
Name: Date, dtype: datetime64[ns]
In [75]: start = s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
In [76]: s.loc[start] = 999
In [77]: s
Out[77]:
Date
1984-01-03 999.000000
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 999.000000
1984-02-02 0.996608
1984-02-03 0.996595
dtype: float64

Categories