comparing columns within groupby objects in pandas

comparing columns within groupby objects in pandas - python

My dataframe looks like this:
id month spent limit
1 1 2.6 10
1 2 4 10
1 3 6 10
2 1 3 100
2 2 89 100
2 3 101 100
3 1 239 500
3 2 432 500
3 3 100 500
I want to groupby id and then get the ids for which spent column is less than or equal to limit column for every row in the grouped by object.
For my above example, I should get ids 1 and 3 as my result because id 2 spends 101 in 3rd month and hence exceeds the limit of 100.
How can I do this in pandas efficiently?
Thanks in advance!

You can create a mask by finding the ids where spent is greater than limit. The mask out the ids in the mask
mask = df.loc[df['spent'] > df['limit'], 'id'].values.tolist()
df.id[df['id'] != mask].unique()
gives you
array([1, 3])

This should give you something like what you want
df.groupby('id').apply(lambda g: (g.spent < g.limit).all()).to_frame('not_exceeded').query('not_exceeded == True')

Reverse logic! Check for unique ids where spent is greater than limit. Then filter out those.
df[~df.id.isin(df.set_index('id').query('limit < spent').index.unique())]
id month spent limit
0 1 1 2.6 10
1 1 2 4.0 10
2 1 3 6.0 10
6 3 1 239.0 500
7 3 2 432.0 500
8 3 3 100.0 500

Related

how to delete columns with a certain count condition

I'm trying to delete id that doesn't contain all 3 months in month.
For example, we have df as:
id month
100 1
100 2
100 3
101 2
102 3
Then I would like to have the new df as just with the id 100 like this:
id month
100 1
100 2
100 3
So what I've done is
df.groupby(['id'].month.count() == 3
which gives me
id month
100 True
101 False
102 False
I'm currently stuck on how to continue.

You can use groupby+transform('nunique') and slice on the boolean output after comparison with 3:
df[df.groupby('id')['month'].transform('nunique').eq(3)]
output:
id month
0 100 1
1 100 2
2 100 3
NB. if you are sure there are no duplicated months, transform('count') will also work

I think you are close, but you need to modify your code a bit. Use your code but swap count with nunique which will return a series showing your ID's with True or False depending whether they have all the months. Then, you can filter:
t = (df.groupby(['id']).month.nunique() == 3)
print(df.loc[df.id.isin(t[t].index)])
id month
0 100 1
1 100 2
2 100 3

Pandas Dataframe fill column with sequence_id based on multiple columns ids and timestamp

*Im editing the df given it contained a typo in ne1_id
having a really hard time trying to solve the following, ill really much appreciate any assistance or light with the following
I have a DataFrame df that looks like this:
timestamp
user_id
ne1_id.
ne2_id.
attempt_no
0
18:11:42.838363
1
100
1
1
18:11:42.838364
100
123456
2
18:11:42.838365
100
123456
3
18:11:42.83836
100
123456
4
18:11:45.838365
1
100
2
5
18:11:45.838366
100
321234
6
18:11:45.838369
100
321234
7
18:11:46.838363
3
12
3
8
18:11:46.838364
12
9832
9
18:11:47.838363
2
12
4
10
18:11:47.838369
100
What I want to do is to fill the attempt_no of the empty cells (empties are empties not NaN) for the next rows based on timestamp (or index) with the proper attempt_no by associating user_id, ne1_id, ne2_id associations,
I im not seeing the logic of it neither the way of do it.
the result should be something like this
timestamp
user_id
ne1_id.
ne2_id.
attempt_no
0
18:11:42.838363
1
100
1
1
18:11:42.838364
100
123456
1
2
18:11:42.838365
100
123456
3
18:11:42.838369
100
123456
4
18:11:45.838365
1
100
2
5
18:11:45.838366
100
321234
2
6
18:11:45.838369
100
321234
7
18:11:46.838363
3
12
3
8
18:11:46.838364
12
9832
3
9
18:11:47.838363
2
12
4
10
18:11:47.838369
100
4
something that says the following:
"find all the rows where there is a user_id and find the next row with the same ne1_id with an empty user_id and attemp_no and fill atppemp_no with the attemp_no of the previous row"
i tried with groupby -that i believe is the way of do it-, but kind of stuck there
i appreciate any suggestion.

def f(x):
last = None
for i in range(len(x)):
if np.isnan(x[i]):
x[i] = last
else:
last = x[i]
return x
df = pd.DataFrame({'x': [1, None, None, 2, None, None, None, 3, None]})
df[['x']].apply(f)
By applying the function on axis=0 you are able to jointly process the entire column.

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.

This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000

To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Python Pandas if column id value greater

I have what i think is a simple quesiton, but I am not sure how to implement.
I have the following dataframe:
ID Value
1 100
2 250
3 300
4 400
5 600
7 800
I would like to look at 2 id's: 3 & 5 and then drop the one with the lower value. So i am assuming I would use something like the following code, but again, i am not sure how to implement,nor am i sure how to utilize the inequality to point towards the value while directing my function at a very specific pair of id's.
def ChooseGreater(x):
if df['id'] == 3 > df['id'] ==5
return del df['id']==5
else:
return del df['id']==3
Thank you!

I think you can do:
df.drop(df.loc[df.ID.isin([3,5]),'Value'].idxmin(), inplace=True)

Using Python's min
df.drop(min(df.query('ID in [3, 5]').index, key=df.Value.get))
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800
groupby and tail
df.sort_values('Value').groupby(df.ID.replace({3: 5})).tail(1)
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800

You can calculate idxmin and then use np.in1d with pd.DataFrame.loc:
idx = df.loc[df['ID'].isin([3,5]), 'Value'].idxmin()
res = df.loc[~np.in1d(df.index, idx)]
print(res)
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800

This is method from groupby
df.loc[df.Value.groupby((~df.ID.isin([3,5])).sort_values().cumsum()).idxmax()].sort_index()
Out[167]:
ID Value
0 1 100
1 2 250
3 4 400
4 5 600
5 7 800

Is it possible to use vectorization for a conditionnal count of rows in a Pandas Dataframe?

I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?

IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

comparing columns within groupby objects in pandas - python

You can create a mask by finding the ids where spent is greater than limit. The mask out the ids in the mask mask = df.loc[df['spent'] > df['limit'], 'id'].values.tolist() df.id[df['id'] != mask].unique() gives you array([1, 3])

This should give you something like what you want df.groupby('id').apply(lambda g: (g.spent < g.limit).all()).to_frame('not_exceeded').query('not_exceeded == True')

Reverse logic! Check for unique ids where spent is greater than limit. Then filter out those. df[~df.id.isin(df.set_index('id').query('limit < spent').index.unique())] id month spent limit 0 1 1 2.6 10 1 1 2 4.0 10 2 1 3 6.0 10 6 3 1 239.0 500 7 3 2 432.0 500 8 3 3 100.0 500

Related

how to delete columns with a certain count condition

Pandas Dataframe fill column with sequence_id based on multiple columns ids and timestamp

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Python Pandas if column id value greater

Is it possible to use vectorization for a conditionnal count of rows in a Pandas Dataframe?

Categories

Resources