extract rows with conditions and with new created column in python - python

I have a data like this
id name sub marks
1 a m 52
1 a s 69
1 a p 63
2 b m 36
2 b s 52
2 b p 56
3 c m 85
3 c s 62
3 c p 56
And I want output table which contain columns such as id, name and new column result(using criteria if marks in all subject is greater than 40 then this student is pass)
id name result
1 a pass
2 b fail
3 c pass
I would like to do this in python.

Create a boolean mask from marks, and then use groupby (on id and name) + all:
import pandas as pd
df = pd.read_csv('file.csv')
v = df.assign(result=df.marks.gt(40))\
.groupby(['id', 'name'])\
.result\
.all()\
.reset_index()
v['result'] = np.where(v['result'], 'pass', 'fail')
v
id name result
0 1 a pass
1 2 b fail
2 3 c pass

Here's one way
In [127]: df.groupby(['id', 'name']).marks.agg(
lambda x: 'pass' if x.ge(40).all() else 'fail'
).reset_index(name='result')
Out[127]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Another way, inspired from jpp's solution, use replace or map
In [132]: df.groupby(['id', 'name']).marks.min().ge(40).replace(
{True: 'pass', False: 'fail'}
).reset_index(name='result')
Out[132]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass

Here is one way via pandas. Note your criteria is equivalent to the minimum mark being above 40. This algorithm is computationally more efficient.
import pandas as pd
df = pd.read_csv('file.csv')
df = df.groupby(['id', 'name'])['marks'].apply(min).reset_index()
df['result'] = np.where(df['marks'] > 40, 'pass', 'fail')
df = df[['id', 'name', 'result']]
Result
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Explanation
First perform a groupby.min() by id and name.
Then assign the column a string depending on value.

Related

Pandas lambda function works only on single column not multiple

I'm trying to apply a simple function (eliminating spaces) across multiple columns of a pandas DataFrame. However, while the .apply() method works properly on a single column, it doesn't work properly over multiple columns. Example:
#Weird Pandas behavior
######
#Input
df = pd.DataFrame ({'a' : ["7 7","5 3"],
'b' : ['f o', 'b r'],
'c' : ["77","53"]})
print(df)
a b c
0 7 7 f o 77
1 5 3 b r 53
df[["a","b"]]=df[["a","b"]].apply(lambda x: x.replace(" ",""))
print(df)
a b c
0 7 7 f o 77
1 5 3 b r 53
df2=copy.deepcopy(df)
print(df2)
a b c
0 7 7 f o 77
1 5 3 b r 53
df2["a"]=df2["a"].apply(lambda x: x.replace(" ",""))
print(df2)
a b c
0 77 f o 77
1 53 b r 53
As you can see, df doesn't change at all when I try to apply the "replace" operation to two columns, but the same dataset (or rather a copy of it) does change when I run the same operation on a single column. How can I remove spaces from two or more columns at once using the .apply() syntax?
I tried passing in the arguments '[a]' (nothing happens) and 'list(a)' (nothing happens) to df[].
You can use the replace function directly:
df[['a','b']] = df[['a','b']].replace(' ','', regex=True)
Output:
When you pass multiple columns, x is a pandas series, not the individual column values. You need to use .str.replace() to operate on each column.
df[["a","b"]]=df[["a","b"]].apply(lambda x: x.str.replace(" ",""))

Is there a way to make custom function in pandas aggregation function?

Want to apply custom function in a Dataframe
eg. Dataframe
index City Age
0 1 A 50
1 2 A 24
2 3 B 65
3 4 A 40
4 5 B 68
5 6 B 48
Function to apply
def count_people_above_60(age):
** *** #i dont know if the age can or can't be passed as series or list to perform any operation later
return count_people_above_60
expecting to do something like
df.groupby(['City']).agg{"AGE" : ["mean",""count_people_above_60"]}
expected Output
City Mean People_Above_60
A 38 0
B 60.33 2
If performance is important create new column filled by compared values converted to integers, so for count is used aggregation sum:
df = (df.assign(new = df['Age'].gt(60).astype(int))
.groupby(['City'])
.agg(Mean= ("Age" , "mean"), People_Above_60= ('new',"sum")))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2
Your solution should be changed with compare values and sum, but is is slow if many groups or large DataFrame:
def count_people_above_60(age):
return (age > 60).sum()
df = (df.groupby(['City']).agg(Mean=("Age" , "mean"),
People_Above_60=('Age',count_people_above_60)))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2

Create a function to extract specific columns and rename pandas

I have a target table structure (3 columns). I have multiple sources, each with its own nuances but ultimately I want to use each table to populate the target table (append entries)
I want to use a function (I know I can do it without a function but it will help me out in the long run to be able to use a function)
I have the following source table
id col1 col2 col3 col4
1 a b c g
1 a b d h
1 c d e i
I want this final structure
id num group
1 a b
1 a b
1 c d
So all I am doing is returning id, col1 and col2 from the source table (but note the column name changes. For different source tables it will be a different set of 3 columns that I will be extracting hence the use of a function).
The function I am using is currently returning only 1 column (instead of 3)
Defining function:
def func(x, col1='id', col2='num', col3='group'):
d=[{'id':x[col1], 'num':x[col2], 'group':x[col3]}]
return pd.DataFrame(d)
Applying the function to a source table.
target= source.apply(func, axis=1)
Here's a flexible way to write this function:
def func(dframe, **kwargs):
return dframe.filter(items=kwargs.keys()).rename(columns=kwargs)
func(df, id="id", col1="num", col2="group")
# group id num
# 0 b 1 a
# 1 b 1 a
# 2 d 1 c
To ensure that your new dataframe preserves the column order of the original, you can sort the argument keys first:
def func(dframe, **kwargs):
keys = sorted(kwargs.keys(), key=lambda x: list(dframe).index(x))
return dframe.filter(items=keys).rename(columns=kwargs)
func(df, id="id", col1="num", col2="group")
# id num group
# 0 1 a b
# 1 1 a b
# 2 1 c d
You can also do:
def func(df, *l):
d = pd.DataFrame(df, columns=l)
d.rename(columns={'col1':'num', 'col2':'group'}, inplace=True)
return d
df2 = func(df, 'id','col1','col2')
print(df2)
id num group
0 1 a b
1 1 a b
2 1 c d

Set column in data frame to value when string is found in separate column

I am using pandas module in python . I have table x with columns a,b,c similar to as follows:
a b c
z 4 ''
s 5 ''
u 4 ''
y 3 ''
I need to loop through column a and search for "z". when "z" is found I need c to be set to "123" until "y" is found in column a and then c needs to be set to "321".
The data will not remain constant in first column so indexes will not work. I have tried many things and cant seem to find a solution. Any suggestions?
Notice the difference between replace and map:
map will return no match item as NaN. Later ffill will fill the NaN from previous row's data.
df.assign(c=df.a.map({'z':'123','y':'321'}).ffill())
a b c
0 z 4 123
1 s 5 123
2 u 4 123
3 y 3 321
Replace all non-y or z values by NaN:
df['c'] = df['a'].where(df['a'].isin(['y', 'z']))
Forward fill:
df['c'] = df['c'].ffill()
Replace:
df['c'] = df['c'].map({'y': '321', 'z': '123'})
Numpy where based approach
df['n'] = np.where((df['a'].isin(['z','y']),df['a'],np.nan)
df['n'] = df['n'].ffill()
df['c'] = np.where(df['n'] == 'z' , 123,321)
df.drop('n',1,inplace=True)
Output:
a b c
0 z 4 123
1 s 5 123
2 u 4 123
3 y 3 321

How do I specify a column header for pandas groupby result?

I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

Categories