How to count occurrences based on multiple criteria in a DataFrame - python

I'm trying to figure out how to count a number of occurrences in the DataFrame using multiple criteria.
In this particular example, I'd like to know the number of female passengers in Pclass 3.
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 female 47.0 1 0 363272 7.0000 NaN S
2 894 2 male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 male 27.0 0 0 315154 8.6625 NaN S
4 896 3 female 22.0 1 1 3101298 12.2875 NaN S
Here's my few failed attempts:
len(test[test["Sex"] == "female", test["Pclass"] == 3])
sum(test.Pclass == 3 & test.Sex == "female")
test.[test["Sex"] == "female", test["Pclass"] == 3].count()
None of them seem to be working.
At the end I've created my own function, but there must be a simpler way to calculate that.
def countif(sex, pclass):
x = 0
for i in range(0,len(test)):
s = test.iloc[i]['Sex']
c = test.iloc[i]['Pclass']
if s == sex and c == pclass:
x = x + 1
return x
Thank you in advance

There are a few ways to do this:
test = pd.DataFrame({'PassengerId': {0: 892, 1: 893, 2: 894, 3: 895, 4: 896},
'Pclass': {0: 3, 1: 3, 2: 2, 3: 3, 4: 3},
'Sex': {0: 'male', 1: 'female', 2: 'male', 3: 'male', 4: 'female'},
'Age': {0: 34.5, 1: 47.0, 2: 62.0, 3: 27.0, 4: 22.0},
'SibSp': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1},
'Parch': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'Ticket': {0: 330911, 1: 363272, 2: 240276, 3: 315154, 4: 3101298},
'Fare': {0: 7.8292, 1: 7.0, 2: 9.6875, 3: 8.6625, 4: 12.2875},
'Cabin': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'Embarked': {0: 'Q', 1: 'S', 2: 'Q', 3: 'S', 4: 'S'}})
You need to put the boolean in round brackets and join with an &
sum((test.Pclass == 3) & (test.Sex == "female"))
len(test[(test.Pclass == 3) & (test.Sex == "female")])
test[(test["Sex"] == "female") & (test["Pclass"] == 3)].shape[0]
Or you can do:
tab = pd.crosstab(df.Pclass,df.Sex)
Sex female male
Pclass
2 0 1
3 2 2
tab.iloc[tab.index==3]['female']

Related

How to form multiple subsets of dataframes and compare and contrast

I have 2 dataframes, something like this:
data1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd, 2: efg},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 1: 11, 2: 12}})
data2 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs, 2: rst},
'store_number': {0: '1048', 1: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 1: 200, 2: 300}})
with more rows.
I want to take multiple subsets from each dataset and do a comparison of the total amount in each.
For example, take out 2 rows from data1 and data2:
data1_subset1 = pd.DataFrame({'transaction_id': {0: abc, 1: bcd},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 10, 1: 11}})
data1_subset2 = pd.DataFrame({'transaction_id': {0: abc, 2: efg},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 10, 2: 12}})
and so on till I have all possible 2 row combinations of data1.
data2_subset1 = pd.DataFrame({'transaction_id': {0: pqr, 1: qrs},
'store_number': {0: '1048', 1: '1048'},
'activity_code': {0: 'deposit-check',
1: 'deposit-check'},
'amount': {0: 100, 1: 200}})
data2_subset2 = pd.DataFrame({'transaction_id': {0: pqr, 2: rst},
'store_number': {0: '1048', 2: '1048'},
'activity_code': {0: 'deposit-check',
2: 'deposit-check'},
'amount': {0: 100, 2: 300}})
and so on till I have all possible 2 row combinations of data2.
Now for each of these subsets, say data1_subset1 vs data2_subset1, I would like to compare if the store_number and activity_code are matching using inner join and then check the difference between the total amount from data1_subset1 vs data2_subset1.
Further I would also like to extend this to all possible size combinations. In the above example we compared all 2 row combinations. But I would like to extend this to 2 row combinations vs 3 row combinations, 2 rows vs 4, 3 vs 5, and so on till all the possibilities are checked.
Is there an efficient way of doing this in Python / Pandas. The first approach I had in my mind was just a nested loop using indexes.
Use itertools.combinations:
from itertools import combinations
for comb in combinations(data1.index, r=2):
print(f'combination {comb}')
print(data1.loc[list(comb)])
As a function:
def subset(df, r=2):
for comb in combinations(df.index, r=r):
yield df.loc[list(comb)]
for df in subset(data1, r=2):
print(df)
output:
combination (0, 1)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
1 bcd 1048 deposit-check 11
combination (0, 2)
transaction_id store_number activity_code amount
0 abc 1048 deposit-check 10
2 efg 1048 deposit-check 12
combination (1, 2)
transaction_id store_number activity_code amount
1 bcd 1048 deposit-check 11
2 efg 1048 deposit-check 12
If you want more rows in the combination change the r=2 parameter to the number of wanted rows.

Pandas group_by multiple columns with conditional count

I have a dataframe, for instance:
df = pd.DataFrame({'Host': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: 'N',
5: 'V',
6: 'B'},
'Registration': {0: 'Registered',
1: 'MR',
2: 'Registered',
3: 'Registered',
4: '',
5: 'Registered',
6: 'Registered'},
'Val': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: '',
5: 'V',
6: 'B'},
'Sum': {0: 100.0,
1: 0.0,
2: 300.0,
3: 150.0,
4: 0.0,
5: 0.0,
6: 20.0}})
I want to get the count, for each Host. Something like:
df.groupby("Host").count()
"""
Host Registration Val Sum
B 2 2 2
N 4 4 4
V 1 1 1
"""
But I want it conditional as a function of each column. For example, I want to count in Sum, only those rows that have more than 0.0, and in the others the ones that are not empty. So my expected output would be:
Host Registration Val Sum
B 2 2 1
N 3 3 3
V 1 1 0
"""
Not sure how to do that. My best attempt has been:
df.groupby("Host").agg({'Registration': lambda x: (x != "").count(),
'Val':lambda x: (x != "").count(),
'Sum': lambda x: (x != 0).count()})
But this produces the same output as df.groupby("Host").count()
Any suggestion?
First your solution - for count Trues values use sum:
df = df.groupby("Host").agg({'Registration': lambda x: (x != "").sum(),
'Val':lambda x: (x != "").sum(),
'Sum': lambda x: (x != 0).sum()})
print (df)
Registration Val Sum
Host
B 2 2 1.0
N 3 3 3.0
V 1 1 0.0
Improved solution - create boolean columns before aggregation sum:
df = df.assign(Registration = df['Registration'].ne(""),
Val = df['Val'].ne(""),
Sum = df['Sum'].ne(0)).groupby("Host").sum()
print (df)
Registration Val Sum
Host
B 2 2 1
N 3 3 3
V 1 1 0

How to convert two columns to list of values?

I have a dataframe like this below,
A B C D
0 A1 Egypt 10 Yes
1 A1 Morocco 5 No
2 A2 Algeria 4 Yes
3 A3 Egypt 45 No
4 A3 Egypt 17 Yes
5 A3 Tunisia 4 Yes
6 A3 Algeria 32 No
7 A4 Tunisia 7 No
8 A5 Egypt 6 No
9 A5 Morocco 1 No
I want to get the count of yes and no from the column D wrt column B. The expected output needs to be in the lists like this below which can help to plot the multivariable chart.
Exected output:
yes = [1,2,0,1]
no = [1,2,2,1]
country = ['Algeria', 'Egypt', 'Morocco','Tunisia']
I am not sure how to achieve this from the above dataframe. Any help will be appreciated.
Here is the minimum reproducible dataframe sample:
import pandas as pd
df = pd.DataFrame({'A': {0: 'A1',
1: 'A1',
2: 'A2',
3: 'A3',
4: 'A3',
5: 'A3',
6: 'A3',
7: 'A4',
8: 'A5',
9: 'A5'},
'B': {0: 'Egypt',
1: 'Morocco',
2: 'Algeria',
3: 'Egypt',
4: 'Egypt',
5: 'Tunisia',
6: 'Algeria',
7: 'Tunisia',
8: 'Egypt',
9: 'Morocco'},
'C ': {0: 10, 1: 5, 2: 4, 3: 45, 4: 17, 5: 4, 6: 32, 7: 7, 8: 6, 9: 1},
'D': {0: 'Yes',
1: 'No',
2: 'Yes',
3: 'No',
4: 'Yes',
5: 'Yes',
6: 'No',
7: 'No',
8: 'No',
9: 'No'}}
)
Use crosstab:
df1 = pd.crosstab(df.B, df.D)
print (df1)
D No Yes
B
Algeria 1 1
Egypt 2 2
Morocco 2 0
Tunisia 1 1
Then for plot use DataFrame.plot.bar
df1.plot.bar()
If need lists:
yes = df1['Yes'].tolist()
no = df1['No'].tolist()
country = df1.index.tolist()
Create new columns by counting "yes", "no"; then groupby "B" and use sum on the newly created columns:
country, yes, no = df.assign(Yes=df['D']=='Yes', No=df['D']=='No').groupby('B')[['Yes','No']].sum().reset_index().T.to_numpy().tolist()
Output:
['Algeria', 'Egypt', 'Morocco', 'Tunisia']
[1, 2, 0, 1]
[1, 2, 2, 1]

groupby + apply results in a series appearing both in index and column - how to prevent it?

I've got a following data frame:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1).set_index('id')
I wanted to sort with within groups according to var1, so I wrote the following:
ex.groupby('var3').apply(lambda x: x.sort_values('var1'))
However, it results in a data frame which has var3 both in index and in column. How to prevent that and leave it only in a column?
Add optional param to groupby as_index=False
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1'))
Or, if you don't want multiIndex
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1')) \
.reset_index(level=0, drop=True)
You could use:
df_sorted=ex.groupby('var3').apply(lambda x: x.sort_values('var1')).reset_index(drop='var3')
print(df_sorted)
var1 var2 var3
0 20.272109 27.731092 1
1 20.680272 55.672269 1
2 21.088435 43.907563 1
3 21.496599 83.403361 1
4 22.857143 71.638655 1
5 23.945578 62.815126 1
6 34.965986 67.226891 2
7 36.462585 59.243697 2
8 39.183673 43.487395 2
But you only need DataFrame.sort_values
sorting first by var3 and then by var1:
df_sort=ex.sort_values(['var3','var1'])
print(df_sort)
var1 var2 var3
id
11 20.272109 27.731092 1
13 20.680272 55.672269 1
12 21.088435 43.907563 1
16 21.496599 83.403361 1
15 22.857143 71.638655 1
14 23.945578 62.815126 1
17 34.965986 67.226891 2
18 36.462585 59.243697 2
19 39.183673 43.487395 2

Pandas concatenate levels in multiindex

I do have following excel file:
{0: {0: nan, 1: nan, 2: nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: nan, 1: nan, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
I would like to create following dataframe:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
What I tried:
# read the column index separately to avoid pandas inputting "Unnamed: ..."
# for the nans
df = pd.read_excel(file_path, skiprows=3, index_col=None, header=None)
df.set_index([0, 1], inplace=True)
# the column index
cols = pd.read_excel(file_path, nrows=3, index_col=None, header=None).loc[:, 2:]
cols = cols.fillna('N/A')
idx = pd.MultiIndex.from_arrays(cols.values)
df.columns = idx
The new dataframe:
AA1 AA2 CB1 BB2 CC1 CC2
a a b b c c
ng/mL N/A N/A mL N/A N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
This approach works but is kind of tedious:
df1 = df.T.reset_index()
df1['new'] = df1.loc[:, 'level_1'] + ' ' + df1.loc[:, 'level_2']
df1.set_index(['level_0', 'new']).drop(['level_1', 'level_2'], axis=1).T
Which gives me:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Is there a simpler solution available ?
Use:
#file from sample data
d = {0: {0: np.nan, 1: np.nan, 2: np.nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
df = pd.DataFrame(d)
df.to_excel('file.xlsx', header=False, index=False)
First create MultiIndex DataFrame with header=[0,1,2], then create MultiIndex by first 2 columns with DataFrame.set_index and remove index names by DataFrame.reset_index:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
Then loop by each level in list comprehension and join second with third level if not Unnamed, last use MultiIndex.from_tuples:
tuples = [(a, f'{b} N/A') if c.startswith('Unnamed')
else (a, f'{b} {c}')
for a, b, c in df.columns]
print (tuples)
[('AA1', 'a ng/mL'), ('AA2', 'a N/A'),
('BB1', 'b N/A'), ('BB2', 'b mL'),
('CC1', 'c N/A'), ('CC2', 'c N/A')]
df.columns = pd.MultiIndex.from_tuples(tuples)
print (df)
AA1 AA2 BB1 BB2 CC1 CC2
a ng/mL a N/A b N/A b mL c N/A c N/A
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Another idea is use:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
lv1 = df.columns.get_level_values(0)
lv2 = df.columns.get_level_values(1)
lv3 = df.columns.get_level_values(2)
lv3 = lv3.where(~lv3.str.startswith('Unnamed'),'N/A')
df.columns = [lv1, lv2.to_series() + ' ' + lv3]

Categories