Create new column based on multiple groupby conditions - python

I want a new column in this df with the following condition. The column education is a categorical value that goes from 1 to 5 (1 is the lower level of education and 5 is the higher level of education). I want to create a function with the following logic (so as to create a new column in the df)
First, for any id check if there is at least a education level graduated, then the new column must have the higher level of education graduated.
Second, if there is no graduated education level for some particular id (must have all educaction level in "In course"). So, must check the maximium level of education and substract one.
df
id education stage
1 2 Graduated
1 3 Graduated
1 4 In course
2 3 In course
3 2 Graduated
3 3 In course
4 2 In course
expected output:
id education stage new_column
1 2 Graduated 3
1 3 Graduated 3
1 4 In course 3
2 3 In course 2
3 2 Graduated 2
3 3 In course 2
4 2 In course 1

You can do it like this:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 2, 3, 3, 4], 'education': [2, 3, 4, 3, 2, 3, 2],
'stage': ['Graduated', 'Graduated', 'In course', 'In course', 'Graduated', 'In course', 'In course']})
max_gr = df[df.stage == 'Graduated'].groupby('id').education.max()
max_ic = df[df.stage == 'In course'].groupby('id').education.max()
# set all cells to the value from max_ed
df['new_col'] = df.id.map(max_gr)
# set cells that have not been filled to the value from max_ic - 1
df.loc[df.new_col.isna(), ['new_col']] = df.id.map(max_ic - 1)
series.map(other_series) returns a new series where the values from series have been replaced by the values from other_series.

This is one way.
df['new'] = df.loc[df['stage'] == 'Graduated']\
.groupby('id')['education']\
.transform(max).astype(int)
df['new'] = df['new'].fillna(df.loc[df['stage'] == 'InCourse']\
.groupby('id')['education']\
.transform(max).sub(1)).astype(int)
Result
id education stage new
0 1 2 Graduated 3
1 1 3 Graduated 3
2 1 4 InCourse 3
3 2 3 InCourse 2
4 3 2 Graduated 2
5 3 3 InCourse 2
6 4 2 InCourse 1
Explanation
First, map to "Graduated" dataset grouped by id on max education.
Second, map to "InCourse" dataset grouped by id on max education minus 1.

Alternative solution based on Markus Löffler.
max_ic = df[df.stage.eq('In course')].groupby('id').education.max() - 1
max_gr = df[df.stage.eq('Graduated')].groupby('id').education.max()
# Update with max_gr
max_ic.update(max_gr)
df['new_col'] = df.id.map(max_ic)

Related

Get column name where value match with multiple condition python

Looking for a solution to my problem an entire day and cannot find the answer. I'm trying to follow the example of this topic: Get column name where value is something in pandas dataframe
to make a version with multiple conditions.
I want to extract column name (under a list) where :
value == 4 or/and value == 3
+
Only if there is no 4 or/and 3, then extract the column name where value == 2
Example:
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'acne': [1, 4, 1, 2], 'wrinkles': [1, 3, 4, 4],'darkspot': [2, 2, 3, 4] }
df1 = pd.DataFrame(data)
df1
df1
'''
Name acne wrinkles darkspot
0 Tom 1 1 2
1 Joseph 4 3 2
2 Krish 1 4 3
3 John 2 4 4
'''
The result i'm looking for :
df2
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
'''
I tried with the apply function with a lambda detailled in the topic i mentionned above but it can only take one argument.
Many thanks for your answers if somebody can help me :)
You can use boolean mask:
problems = ['acne', 'wrinkles', 'darkspot']
m1 = df1[problems].isin([3, 4]) # main condition
m2 = df1[problems].eq(2) # fallback condition
mask = m1 | (m1.loc[~m1.any(axis=1)] | m2)
df1['problem'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
Output:
>>> df1
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
You can use a Boolean mask to figure out which columns you need.
First check if any of the values are 3 or 4, and then if not, check if any of the values are 2. Form the composite mask (variable m below) with an | (or) between those two conditions.
Finally you can NaN the False values, that way when you stack and groupby.agg(list) you're left with just the column labels for the Trues.
cols = ['acne', 'wrinkles', 'darkspot']
m1 = df1[cols].isin([3, 4])
# If no `3` or `4` on the rows, check if there is a `2`
m2 = pd.DataFrame((~m1.any(1)).to_numpy()[:, None] & df1[cols].eq(2).to_numpy(),
index=m1.index, columns=m1.columns)
m = (m1 | m2)
# acne wrinkles darkspot
#0 False False True
#1 True True False
#2 False True True
#3 False True True
# Assignment aligns on original DataFrame index, i.e. `'level_0'`
df1['problem'] = m.where(m).stack().reset_index().groupby('level_0')['level_1'].agg(list)
print(df1)
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]

How do i get only the new unique values per group?

import pandas as pd
df = pd.DataFrame({'Month': [2, 2, 3, 3],
'user': ['Michael', 'Michael', 'Lea', 'Michael']})
I have a dataframe like this, it is already a result grouped by Month.
Month user
0 2 Michael
1 2 Michael
2 3 Lea
3 3 Michael
What I want is to count the total unique AND the new unique users compared to the month before.
Total is no problem, can just use:
df.groupby(['Month'])['user'].nunique()
Month
2 1
3 2
But what I want are only the new unique ones, I do not want to count the ones that already were there in Month 2 when I count in Month 3.
In my minimal example "Lea" is a new user in Month "3", "Michael" is not because he was already user in Month "2". So my expected result would be per month the count of new unique users like this
Month Unique_Count_New_Users
0 2 1
1 3 1 <- Lea is new compared to February, Michael isn't
How can I achieve this in python? Do I need some sort of element wise comparison between the groups?
So i edit here to make it more clear: I need compare to all previous month if the user was already there.
import pandas as pd
df = pd.DataFrame({'Month':[2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'user':['Michael', 'Michael', 'Markus', 'Moritz', 'Lea',
'Michael', 'Stefan', 'Dora', 'Erika',
'Dora', 'Markus']})
df
Month user
0 2 Michael
1 2 Michael
2 2 Markus
3 2 Moritz
4 2 Lea
5 3 Michael
6 3 Stefan
7 3 Dora
8 3 Erika
9 4 Dora
10 4 Markus
df.groupby(['Month'])['user'].nunique()
# Solution
# Sort the dataframe first
df.sort_values(by='month', inplace=True)
# Duplicated trick
(~df['user'].duplicated()).groupby(df['Month']).sum()
# Result
Month
2 4
3 3
4 0
IIUC, you can use
(~df['user'].duplicated()).groupby(df['Month']).sum()
Demo:
>>> df
Month user
0 2 Michael
1 2 Michael
2 3 Lea
3 3 Michael
>>> (~df['user'].duplicated()).groupby(df['Month']).sum()
Month
2 1
3 1
I'm assuming that the 'Month' column is sorted, otherwise the duplicated trick won't work.
edit: your exact output can be produced with
(~df['user'].duplicated()).groupby(df['Month']).sum().reset_index().rename({'user': 'Unique_Count_New_Users'}, axis=1)

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Pandas multi index Dataframe - Select and remove

I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5

Pandas indexing behavior after grouping: do I see an "extra row"?

This might be a very simple question, but I am trying to understand how grouping and indexing work in pandas.
Let's say I have a DataFrame with the following data:
df = pd.DataFrame(data={
'p_id': [1, 1, 1, 2, 3, 3, 3, 4, 4],
'rating': [5, 3, 2, 2, 5, 1, 3, 4, 5]
})
Now, the index would be assigned automatically, so the DataFrame looks like:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
When I try to group it by p_id, I get:
>> df[['p_id', 'rating']].groupby('p_id').count()
rating
p_id
1 3
2 1
3 3
4 2
I noticed that p_id now becomes an index for the grouped DataFrame, but the first row looks weird to me -- why does it have p_id index in it with empty rating?
I know how to fix it, kind of, if I do this:
>> df[['p_id', 'rating']].groupby('p_id', as_index=False).count()
p_id rating
0 1 3
1 2 1
2 3 3
3 4 2
Now I don't have this weird first column, but I have both index and p_id.
So my question is, where does this extra row coming from when I don't use as_index=False and is there a way to group DataFrame and keep p_id as index while not having to deal with this extra row? If there are any docs I can read on this, that would also be greatly appreciated.
It's just an index name...
Demo:
In [46]: df
Out[46]:
p_id rating
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
In [47]: df.index.name = 'AAA'
pay attention at the index name: AAA
In [48]: df
Out[48]:
p_id rating
AAA
0 1 5
1 1 3
2 1 2
3 2 2
4 3 5
5 3 1
6 3 3
7 4 4
8 4 5
You can get rid of it using rename_axis() method:
In [42]: df[['p_id', 'rating']].groupby('p_id').count().rename_axis(None)
Out[42]:
rating
1 3
2 1
3 3
4 2
There is no "extra row", it's simply how pandas visually renders a GroupBy object, i.e. how pandas.core.groupby.generic.DataFrameGroupBy.__str__ method renders a grouped dataframe object: rating is the column, but now p_id has now gone from being a column to being the (row) index.
Another reason they stagger them (i.e. the row with the column names, and the row with the index/multi-index name) is because the index can be a MultiIndex (if you grouped-by multiple columns).

Categories