Dataframe filtering in pandas - python

How can I filter or subset a particular group within a dataframe (e.g., admitted female from the dataframe below)?
I am trying to sum up admissions/rejection rates based on gender. This dataframe is small, but what if it was much larger, let's say for example tens of thousands of line, where indexing individual values is impossible?
Admit Gender Dept Freq
0 Admitted Male A 512
1 Rejected Male A 313
2 Admitted Female A 89
3 Rejected Female A 19
4 Admitted Male B 353
5 Rejected Male B 207
6 Admitted Female B 17
7 Rejected Female B 8
8 Admitted Male C 120
9 Rejected Male C 205
10 Admitted Female C 202
11 Rejected Female C 391
12 Admitted Male D 138
13 Rejected Male D 279
14 Admitted Female D 131
15 Rejected Female D 244
16 Admitted Male E 53
17 Rejected Male E 138
18 Admitted Female E 94
19 Rejected Female E 299
20 Admitted Male F 22
21 Rejected Male F 351
22 Admitted Female F 24
23 Rejected Female F 317

To filter the data you can use the very comprehensive queryfunction.
# Test data
df = DataFrame({'Admit': ['Admitted', 'Rejected', 'Admitted', 'Rejected', 'Admitted', 'Rejected', 'Admitted'],
'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female'],
'Freq': [512, 313, 89, 19, 353, 207, 17],
'Gender Dept': ['A', 'A', 'A', 'A', 'B', 'B', 'B']})
df.query('Admit == "Admitted" and Gender == "Female"')
Admit Freq Gender Gender Dept
2 Admitted 89 Female A
6 Admitted 17 Female B
To summarize data use groupby.
group = df.groupby(['Admit', 'Gender']).sum()
print(group)
Freq
Admit Gender
Admitted Female 106
Male 865
Rejected Female 19
Male 520
You can the filter the result simply by subsetting on the created MultiIndex.
group.loc[('Admitted', 'Female')]
Freq 106
Name: (Admitted, Female), dtype: int64

Related

Comparing two DataFrames and retrieving modified values

Two separate similar DataFrames with different lengths
df2=
pd.DataFrame([('James',25,'Male',155),
('John',27, 'Male',175),
('Patricia',23,'Female',135),
('Mary',22,'Female',125),
('Martin',30,'Male',185),
('Margaret',29,'Female'141),
('Kevin',22,'Male',198)], columns =(['First Name','Age','Gender','Weight']))
Index
First Name
Age
Gender
Weight
0
James
25
Male
155
1
John
27
Male
175
2
Patricia
23
Female
135
3
Mary
22
Female
125
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
df1=
pd.DataFrame([('James',25,'Male',165,"5'10"),
('John',27, 'Male',175,"5'9"),
('Matthew',29,'Male',183,"6'0"),
('Patricia',23,'Female',135,"5'3"),
('Mary',22,'Female',125,"5'4"),
('Rachel',29,'Female',123,"5'3"),
('Jose',20,'Male',175,"5'11"),
('Kevin',22,'Male',192,"6'2")], columns =(['First Name','Age','Gender','Weight','Height']))
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
165
5'10
1
John
27
Male
175
5'9
2
Matthew
29
Male
183
6'0
3
Patricia
23
Female
135
5'3
4
Mary
22
Female
125
5'4
5
Rachel
29
Female
123
5'3
6
Jose
20
Male
175
5'11
7
Kevin
22
Male
192
6'2
df2 has some rows which are not in df1 and df1 has some values which are not in df2.
I need to calculate the modified values, if the First Name is same, I need to check for modified values; for example in df1, the weight of James is 155 however in df2 the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
155
5'10
1
John
27
Male
175
5'9
2
Patricia
23
Female
135
5'3
3
Mary
22
Female
125
5'4
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
6'2
Martin's and Margaret's heights are not there in df1, so their heights are not updated in df2
Desired Output
modval=
Index
First Name
Age
Gender
Weight
Height
0
James
165
7
Kevin
192

Copy contents from one Dataframe to another based on column values in Pandas

Two seperate similar DataFrames with different lengths
df2=
Index
First Name
Age
Gender
Weight
0
James
25
Male
155
1
John
27
Male
175
2
Patricia
23
Female
135
3
Mary
22
Female
125
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
df1=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
165
5'10
1
John
27
Male
175
5'9
2
Matthew
29
Male
183
6'0
3
Patricia
23
Female
135
5'3
4
Mary
22
Female
125
5'4
5
Rachel
29
Female
123
5'3
6
Jose
20
Male
175
5'11
7
Kevin
22
Male
192
6'2
df2 has some rows which are not in df1 and df1 has some values which are not in df2.
I am comparing df1 against df2. I have calculated the newentries with the following code
newentries = df2.loc[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1.loc[~df1['First Name'].isin(df2['First Name'])]
where newentries denote the rows/entries that are there in df2 but not in df1; deletedentries denote the rows/entries that are there in df1 but not in df2. The above code works perfectly fine.
I need to copy the height from df1 to df2 when the first names are equal.
df2.loc[df2['First Name'].isin(df1['First Name']),"Height"] = df1.loc[df1['First Name'].isin(df2['First Name']),"Height"]
The above code copies the values however indexing is causing an issue and the values are not copied to the corresponding rows, I tried to promote First Name as the Index but that doesn't solve the issue. Please help me with a solution
Also, I need to calculate the modified values, if the First Name is same, I need to check for modified values; for example in df1, the weight of James is 155 however in df2 the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
155
5'10
1
John
27
Male
175
5'9
2
Patricia
23
Female
135
5'3
3
Mary
22
Female
125
5'4
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
6'2
Martin's and Margaret's heights are not there in df1, so their heights are not updated in df2
newentries=
Index
First Name
Age
Gender
Weight
Height
4
Martin
30
Male
185
5
Margaret
29
Female
141
deletedentries=
Index
First Name
Age
Gender
Weight
Height
2
Matthew
29
Male
183
6'0
5
Rachel
29
Male
123
5'3
6
Jose
20
Male
175
5'11
modval=
Index
First Name
Age
Gender
Weight
Height
0
James
165
7
Kevin
192
Building off of Rabinzel's answer:
output = df2.merge(df1, how='left', on='First Name', suffixes=[None, '_old'])
df3 = output[['First Name', 'Age', 'Gender', 'Weight', 'Height']]
cols = df1.columns[1:-1]
modval = pd.DataFrame()
for col in cols:
modval = pd.concat([modval, output[['First Name', col + '_old']][output[col] != output[col + '_old']].dropna()])
modval.rename(columns={col +'_old':col}, inplace=True)
newentries = df2[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1[~df1['First Name'].isin(df2['First Name'])]
print(df3, newentries, deletedentries, modval, sep='\n\n')
Output:
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
First Name Age Gender Weight
4 Martin 30 Male 185
5 Margaret 29 Female 141
First Name Age Gender Weight Height
2 Matthew 29 Male 183 6'0
5 Rachel 29 Male 123 5'3
6 Jose 20 Male 175 5'11
First Name Age Gender Weight
0 James NaN NaN 165.0
6 Kevin NaN NaN 192.0
for your desired output for df2 you can try this:
desired_df2 = df2.merge(df1[['First Name','Height']], on='First Name', how='left')
#if you want to change the "NaN" values just add ".fillna(fill_value=0)" for e.g 0 after the merge
print(desired_df2)
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
new and deleted entries is already right. for the moment I'm a bit stuck how to get the modval dataframe. I'll update my answer if I get a solution.

How to build a data frame using pandas where attributes are arranged?

I want to make a data frame in pandas that look like this:
Id Name Gender Math Science English
1 Ram Male 98 92 80
2 Hari Male 30 40 23
3 Gita Female 60 65 77
4 Sita Female 50 45 55
5 Shyam Male 80 88 82
I wrote quote in python like this:
import pandas as pd
d = {'Id':[1,2,3,4,5], 'Name':['Ram','Hari','Gita','Sita','Shyam'],'Gender':['Male','Male','Female','Female','Male'],'Math':[98,30,60,50,80],'Science':[92,40,65,45,88],'English':[80,23,77,55,82]}
df = pd.DataFrame(data=d)
print (df)
It gave me output like this:
English Gender Id Math Name Science
0 80 Male 1 98 Ram 92
1 23 Male 2 30 Hari 40
2 77 Female 3 60 Gita 65
3 55 Female 4 50 Sita 45
4 82 Male 5 80 Shyam 88
How do I remove the first column with no attribute and also arrange attributes in such a way that is given in the question?
I want Id, Name, Gender, Math, Science, English. Thanks
If you don't want index, you can set it by unique column like Id.
import pandas as pd
d = {'Id':[1,2,3,4,5], 'Name':['Ram','Hari','Gita','Sita','Shyam'],'Gender':['Male','Male','Female','Female','Male'],'Math':[98,30,60,50,80],'Science':[92,40,65,45,88],'English':[80,23,77,55,82]}
df = pd.DataFrame(data=d)
df.set_index('Id', inplace=True)
print (df)
Output:
Name Gender Math Science English
Id
1 Ram Male 98 92 80
2 Hari Male 30 40 23
3 Gita Female 60 65 77
4 Sita Female 50 45 55
5 Shyam Male 80 88 82
Try to create directly the DataFrame instead of passing by "d"
df = pd.DataFrame({'Id': [1, 4, 7, 10], etc...})
Then use set_index to fix your Id as it :
df.set_index('Id')

How to separate female and male without groupby() in python?

It's a dataframes that each has Id, sex, age and so I.
I first seperate the age with id and sex.`
import numpy as np
import pandas as pd
age_distinct = titanic_df[['Sex','Age']].dropna()
print age_distinct
get the result like this:
Sex Age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
6 male 54.0
7 male 2.0
8 female 27.0
9 female 14.0
10 female 4.0
11 female 58.0
12 male 20.0
13 male 39.0
14 female 14.0
15 female 55.0
16 male 2.0
18 female 31.0
20 male 35.0
21 male 34.0
22 female 15.0
23 male 28.0
24 female 8.0
25 female 38.0
27 male 19.0
30 male 40.0
33 male 66.0
34 male 28.0
35 male 42.0
37 male 21.0
38 female 18.0
.. ... ...
856 female 45.0
857 male 51.0
But I don't know the next step.
How can I get a two set of data only include male and female
What you're looking for is:
titanic_df[titanic_df['Sex'] == 'male']
This is basically a SELECT * FROM titanic_df WHERE Sex == 'male', if you're familiar with SQL.
Edit: If you want to create two different pandas.DataFrame objects from each level of Sex, you can store each DataFrame in a dictionary, like this:
distinct_dfs = {}
for level in set(titanic_df['Sex']):
level_df = titanic_df[titanic_df['Sex'] == level]
distinct_dfs[level] = level_df
That's just one approach you could take, and would be advantageous with many different values for Sex. But, since you only have two values, this would be easiest:
female_df = titanic_df[titanic_df['Sex'] == 'female']
male_df = titanic_df[titanic_df['Sex'] == 'male']
I think you need boolean indexing or query:
print age_distinct[age_distinct.Sex == 'male']
print age_distinct.query('Sex == "male"')
Sample:
titanic_df = pd.DataFrame({'Sex':['male','female',np.nan],
'Age':[40,50,60]})
print (titanic_df)
Age Sex
0 40 male
1 50 female
2 60 NaN
age_distinct = titanic_df[['Sex','Age']].dropna()
print (age_distinct[age_distinct.Sex == 'male'])
Sex Age
0 male 40
print (age_distinct.query('Sex == "male"') )
Sex Age
0 male 40

Pandas: Dict of data frames to unbalanced Panel

I have a dictionary of DataFrame objects:
dictDF={0:df0,1:df1,2:df2}
Each DataFrame df0,df1,df2 represents a table in a specific date of time, where the first column identifies (like social security number) a person and the other columns are characteristics of this person such as
DataFrame df0
id Name Age Gender Job Income
10 Daniel 40 Male Scientist 100
5 Anna 39 Female Doctor 250
DataFrame df1
id Name Age Gender Job Income
67 Guto 35 Male Engineer 100
7 Anna 39 Female Doctor 300
9 Melissa 26 Female Student 36
DataFrame df2
id Name Age Gender Job Income
77 Patricia 30 Female Dentist 300
9 Melissa 27 Female Dentist 250
Note that the id (social security number) identifies exactly the person. For instance, the same "Melissa" arises in two different DataFrames. However, there are two different "Annas".
In these dataFrames the number of people and the people vary over time. Some people is represented in all dates and others are represented only in a specific date of time.
Is there a simple way to transform this dictionary of data frames in an (unbalanced) Panel object, where the ids arise in all dates and if the data a given id is not available it is replaced by NaN?
Off course, I can do that making a list of all ids and then checking in each date if a given id is represented. If it is represented, then I copy the data. Otherwise, I just write NaN.
I wonder if there an easy way using pandas tools.
I would recommend using a MultiIndex instead of a Panel.
First, add the period to each dataframe:
for n, df in dictDF.iteritems():
df['period'] = n
Then concatenate into a big dataframe:
big_df = pd.concat([df for df in dictDF.itervalues()], ignore_index=True)
Now set your index to period and id and you are guaranteed to have a unique index:
>>> big_df.set_index(['period', 'id'])
Name Age Gender Job Income
period id
0 10 Daniel 40 Male Scientist 100
5 Anna 39 Female Doctor 250
1 67 Guto 35 Male Engineer 100
7 Anna 39 Female Doctor 300
9 Melissa 26 Female Student 36
2 77 Patricia 30 Female Dentist 300
9 Melissa 27 Female Dentist 250
You can also reverse that order:
>>> big_df.set_index(['id', 'period']).sort_index()
Name Age Gender Job Income
id period
5 0 Anna 39 Female Doctor 250
7 1 Anna 39 Female Doctor 300
9 1 Melissa 26 Female Student 36
2 Melissa 27 Female Dentist 250
10 0 Daniel 40 Male Scientist 100
67 1 Guto 35 Male Engineer 100
77 2 Patricia 30 Female Dentist 300
You can even unstack the data quite easily:
big_df.set_index(['id', 'period'])[['Income']].unstack('period')
Income
period 0 1 2
id
5 250 NaN NaN
7 NaN 300 NaN
9 NaN 36 250
10 100 NaN NaN
67 NaN 100 NaN
77 NaN NaN 300

Categories