Pandas group_by multiple columns with conditional count - python

I have a dataframe, for instance:
df = pd.DataFrame({'Host': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: 'N',
5: 'V',
6: 'B'},
'Registration': {0: 'Registered',
1: 'MR',
2: 'Registered',
3: 'Registered',
4: '',
5: 'Registered',
6: 'Registered'},
'Val': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: '',
5: 'V',
6: 'B'},
'Sum': {0: 100.0,
1: 0.0,
2: 300.0,
3: 150.0,
4: 0.0,
5: 0.0,
6: 20.0}})
I want to get the count, for each Host. Something like:
df.groupby("Host").count()
"""
Host Registration Val Sum
B 2 2 2
N 4 4 4
V 1 1 1
"""
But I want it conditional as a function of each column. For example, I want to count in Sum, only those rows that have more than 0.0, and in the others the ones that are not empty. So my expected output would be:
Host Registration Val Sum
B 2 2 1
N 3 3 3
V 1 1 0
"""
Not sure how to do that. My best attempt has been:
df.groupby("Host").agg({'Registration': lambda x: (x != "").count(),
'Val':lambda x: (x != "").count(),
'Sum': lambda x: (x != 0).count()})
But this produces the same output as df.groupby("Host").count()
Any suggestion?

First your solution - for count Trues values use sum:
df = df.groupby("Host").agg({'Registration': lambda x: (x != "").sum(),
'Val':lambda x: (x != "").sum(),
'Sum': lambda x: (x != 0).sum()})
print (df)
Registration Val Sum
Host
B 2 2 1.0
N 3 3 3.0
V 1 1 0.0
Improved solution - create boolean columns before aggregation sum:
df = df.assign(Registration = df['Registration'].ne(""),
Val = df['Val'].ne(""),
Sum = df['Sum'].ne(0)).groupby("Host").sum()
print (df)
Registration Val Sum
Host
B 2 2 1
N 3 3 3
V 1 1 0

Related

Filter for column value and set its other column to an array

I have a df such as
Letter | Stats
B 0
B 1
C 22
B 0
C 0
B 3
How can I filter for a value in the Letter column and also then convert the stats column for that value into an array?
Basically want to filter for B and convert the Stats column to an array, Thanks!
here is one way to do it
# function received, dataframe and letter as parameter
# return stats values as list for the passed Letter
def grp(df, letter):
return df.loc[df['Letter'].eq(letter)]['Stats'].values.tolist()
# pass the dataframe, and the letter
result=grp(df,'B')
print(result)
[0, 1, 0, 3]
data used
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df=pd.DataFrame(data)
Although I believe that solution proposed by #Naveed is enough for this problem one little extension could be suggested.
If you would like to get result as an pandas series and obtain some statistic for the series:
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df = pd.DataFrame(data)
letter = 'B'
ser = pd.Series(name=letter, data=df.loc[df['Letter'].eq(letter)]['Stats'].values)
print(f"Max value: {ser.max()} | Min value: {ser.min()} | Median value: {ser.median()}") etc.
Output:
Max value: 3 | Min value: 0 | Median value: 0.5

Python Pandas group by mean() for a certain count of rows

I need to group by mean() for the first 2 values of each category, how I define that.
df like
category value
-> a 2
-> a 5
a 4
a 8
-> b 6
-> b 3
b 1
-> c 2
-> c 2
c 7
by reading only the arrowed data where the output be like
category mean
a 3.5
b 4.5
c 2
how can I do this
I am trying but do not know where to define the to get only 1st 2 observation from each categrory
output = df.groupby(['category'])['value'].mean().reset_index()
your help is appreciated, thanks in advance
You can also do this via groupby() and agg():
out=df.groupby('category',as_index=False)['value'].agg(lambda x:x.head(2).mean())
Try apply on each group of values and use head(2) to just get the first 2 values then mean:
import pandas as pd
df = pd.DataFrame({
'category': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b',
6: 'b', 7: 'c', 8: 'c', 9: 'c'},
'value': {0: 2, 1: 5, 2: 4, 3: 8, 4: 6, 5: 3, 6: 1, 7: 2,
8: 2, 9: 7}
})
output = df.groupby('category', as_index=False)['value'] \
.apply(lambda a: a.head(2).mean())
print(output)
output:
category value
0 a 3.5
1 b 4.5
2 c 2.0
Or create a boolean index to filter df with:
m = df.groupby('category').cumcount().lt(2)
output = df[m].groupby('category')['value'].mean().reset_index()
print(output)
category value
0 a 3.5
1 b 4.5
2 c 2.0

How to count occurrences based on multiple criteria in a DataFrame

I'm trying to figure out how to count a number of occurrences in the DataFrame using multiple criteria.
In this particular example, I'd like to know the number of female passengers in Pclass 3.
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 female 47.0 1 0 363272 7.0000 NaN S
2 894 2 male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 male 27.0 0 0 315154 8.6625 NaN S
4 896 3 female 22.0 1 1 3101298 12.2875 NaN S
Here's my few failed attempts:
len(test[test["Sex"] == "female", test["Pclass"] == 3])
sum(test.Pclass == 3 & test.Sex == "female")
test.[test["Sex"] == "female", test["Pclass"] == 3].count()
None of them seem to be working.
At the end I've created my own function, but there must be a simpler way to calculate that.
def countif(sex, pclass):
x = 0
for i in range(0,len(test)):
s = test.iloc[i]['Sex']
c = test.iloc[i]['Pclass']
if s == sex and c == pclass:
x = x + 1
return x
Thank you in advance
There are a few ways to do this:
test = pd.DataFrame({'PassengerId': {0: 892, 1: 893, 2: 894, 3: 895, 4: 896},
'Pclass': {0: 3, 1: 3, 2: 2, 3: 3, 4: 3},
'Sex': {0: 'male', 1: 'female', 2: 'male', 3: 'male', 4: 'female'},
'Age': {0: 34.5, 1: 47.0, 2: 62.0, 3: 27.0, 4: 22.0},
'SibSp': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1},
'Parch': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'Ticket': {0: 330911, 1: 363272, 2: 240276, 3: 315154, 4: 3101298},
'Fare': {0: 7.8292, 1: 7.0, 2: 9.6875, 3: 8.6625, 4: 12.2875},
'Cabin': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'Embarked': {0: 'Q', 1: 'S', 2: 'Q', 3: 'S', 4: 'S'}})
You need to put the boolean in round brackets and join with an &
sum((test.Pclass == 3) & (test.Sex == "female"))
len(test[(test.Pclass == 3) & (test.Sex == "female")])
test[(test["Sex"] == "female") & (test["Pclass"] == 3)].shape[0]
Or you can do:
tab = pd.crosstab(df.Pclass,df.Sex)
Sex female male
Pclass
2 0 1
3 2 2
tab.iloc[tab.index==3]['female']

Pandas concatenate levels in multiindex

I do have following excel file:
{0: {0: nan, 1: nan, 2: nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: nan, 1: nan, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
I would like to create following dataframe:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
What I tried:
# read the column index separately to avoid pandas inputting "Unnamed: ..."
# for the nans
df = pd.read_excel(file_path, skiprows=3, index_col=None, header=None)
df.set_index([0, 1], inplace=True)
# the column index
cols = pd.read_excel(file_path, nrows=3, index_col=None, header=None).loc[:, 2:]
cols = cols.fillna('N/A')
idx = pd.MultiIndex.from_arrays(cols.values)
df.columns = idx
The new dataframe:
AA1 AA2 CB1 BB2 CC1 CC2
a a b b c c
ng/mL N/A N/A mL N/A N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
This approach works but is kind of tedious:
df1 = df.T.reset_index()
df1['new'] = df1.loc[:, 'level_1'] + ' ' + df1.loc[:, 'level_2']
df1.set_index(['level_0', 'new']).drop(['level_1', 'level_2'], axis=1).T
Which gives me:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Is there a simpler solution available ?
Use:
#file from sample data
d = {0: {0: np.nan, 1: np.nan, 2: np.nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
df = pd.DataFrame(d)
df.to_excel('file.xlsx', header=False, index=False)
First create MultiIndex DataFrame with header=[0,1,2], then create MultiIndex by first 2 columns with DataFrame.set_index and remove index names by DataFrame.reset_index:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
Then loop by each level in list comprehension and join second with third level if not Unnamed, last use MultiIndex.from_tuples:
tuples = [(a, f'{b} N/A') if c.startswith('Unnamed')
else (a, f'{b} {c}')
for a, b, c in df.columns]
print (tuples)
[('AA1', 'a ng/mL'), ('AA2', 'a N/A'),
('BB1', 'b N/A'), ('BB2', 'b mL'),
('CC1', 'c N/A'), ('CC2', 'c N/A')]
df.columns = pd.MultiIndex.from_tuples(tuples)
print (df)
AA1 AA2 BB1 BB2 CC1 CC2
a ng/mL a N/A b N/A b mL c N/A c N/A
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Another idea is use:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
lv1 = df.columns.get_level_values(0)
lv2 = df.columns.get_level_values(1)
lv3 = df.columns.get_level_values(2)
lv3 = lv3.where(~lv3.str.startswith('Unnamed'),'N/A')
df.columns = [lv1, lv2.to_series() + ' ' + lv3]

pandas slicing multiindex dataframe

I want to slice a multi-index pandas dataframe
here is the code to obtain my test data:
import pandas as pd
testdf = {
'Name': {
0: 'H', 1: 'H', 2: 'H', 3: 'H', 4: 'H'}, 'Division': {
0: 'C', 1: 'C', 2: 'C', 3: 'C', 4: 'C'}, 'EmployeeId': {
0: 14, 1: 14, 2: 14, 3: 14, 4: 14}, 'Amt1': {
0: 124.39, 1: 186.78, 2: 127.94, 3: 258.35000000000002, 4: 284.77999999999997}, 'Amt2': {
0: 30.0, 1: 30.0, 2: 30.0, 3: 30.0, 4: 60.0}, 'Employer': {
0: 'Z', 1: 'Z', 2: 'Z', 3: 'Z', 4: 'Z'}, 'PersonId': {
0: 14, 1: 14, 2: 14, 3: 14, 4: 15}, 'Provider': {
0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B'}, 'Year': {
0: 2012, 1: 2012, 2: 2013, 3: 2013, 4: 2012}}
testdf = pd.DataFrame(testdf)
testdf
grouper_keys = [
'Employer',
'Year',
'Division',
'Name',
'EmployeeId',
'PersonId']
testdf2 = pd.pivot_table(data=testdf,
values='Amt1',
index=grouper_keys,
columns='Provider',
fill_value=None,
margins=False,
dropna=True,
aggfunc=('sum', 'count'),
)
print(testdf2)
gives:
Now I can get only sum for A or B using
testdf2.loc[:, slice(None, ('sum', 'A'))]
which gives
How can I get both sum and count for only A or B
Use xs for cross section
testdf2.xs('A', axis=1, level=1)
Or keep the column level with drop_level=False
testdf2.xs('A', axis=1, level=1, drop_level=False)
You can use:
idx = pd.IndexSlice
df = testdf2.loc[:, idx[['sum', 'count'], 'A']]
print (df)
sum count
Provider A A
Employer Year Division Name EmployeeId PersonId
Z 2012 C H 14 14 311.17 2.0
15 NaN NaN
2013 C H 14 14 386.29 2.0
Another solution:
df = testdf2.loc[:, (slice('sum','count'), ['A'])]
print (df)
sum count
Provider A A
Employer Year Division Name EmployeeId PersonId
Z 2012 C H 14 14 311.17 2.0
15 NaN NaN
2013 C H 14 14 386.29 2.0

Categories