How to separate female and male without groupby() in python? - python

It's a dataframes that each has Id, sex, age and so I.
I first seperate the age with id and sex.`
import numpy as np
import pandas as pd
age_distinct = titanic_df[['Sex','Age']].dropna()
print age_distinct
get the result like this:
Sex Age
0 male 22.0
1 female 38.0
2 female 26.0
3 female 35.0
4 male 35.0
6 male 54.0
7 male 2.0
8 female 27.0
9 female 14.0
10 female 4.0
11 female 58.0
12 male 20.0
13 male 39.0
14 female 14.0
15 female 55.0
16 male 2.0
18 female 31.0
20 male 35.0
21 male 34.0
22 female 15.0
23 male 28.0
24 female 8.0
25 female 38.0
27 male 19.0
30 male 40.0
33 male 66.0
34 male 28.0
35 male 42.0
37 male 21.0
38 female 18.0
.. ... ...
856 female 45.0
857 male 51.0
But I don't know the next step.
How can I get a two set of data only include male and female

What you're looking for is:
titanic_df[titanic_df['Sex'] == 'male']
This is basically a SELECT * FROM titanic_df WHERE Sex == 'male', if you're familiar with SQL.
Edit: If you want to create two different pandas.DataFrame objects from each level of Sex, you can store each DataFrame in a dictionary, like this:
distinct_dfs = {}
for level in set(titanic_df['Sex']):
level_df = titanic_df[titanic_df['Sex'] == level]
distinct_dfs[level] = level_df
That's just one approach you could take, and would be advantageous with many different values for Sex. But, since you only have two values, this would be easiest:
female_df = titanic_df[titanic_df['Sex'] == 'female']
male_df = titanic_df[titanic_df['Sex'] == 'male']

I think you need boolean indexing or query:
print age_distinct[age_distinct.Sex == 'male']
print age_distinct.query('Sex == "male"')
Sample:
titanic_df = pd.DataFrame({'Sex':['male','female',np.nan],
'Age':[40,50,60]})
print (titanic_df)
Age Sex
0 40 male
1 50 female
2 60 NaN
age_distinct = titanic_df[['Sex','Age']].dropna()
print (age_distinct[age_distinct.Sex == 'male'])
Sex Age
0 male 40
print (age_distinct.query('Sex == "male"') )
Sex Age
0 male 40

Related

Update columns with duplicate values from the DataFrame in Pandas

I have a data set which has values for different columns as different entries with first name to identify the respective columns.
For instance James's gender is in first row and James's age is in 5th row.
DataFrame
df1=
Index
First Name
Age
Gender
Weight in lb
Height in cm
0
James
Male
1
John
175
2
Patricia
23
5
James
22
4
James
185
5
John
29
6
John
176
I am trying to make it combined into one DataFrame as below
df1=
Index
First Name
Age
Gender
Weight
Height
0
James
22
Male
185
1
John
29
175
176
2
Patricia
23
I tried to do groupby but it is not working.
Assuming NaN in the empty cells, you can use groupby.first:
df.groupby('First Name', as_index=False).first()
output:
First Name Age Gender Weight in lb Height in cm
0 James 22.0 Male 185.0 NaN
1 John 29.0 None 175.0 176.0
2 Patricia 23.0 None NaN NaN

Copy contents from one Dataframe to another based on column values in Pandas

Two seperate similar DataFrames with different lengths
df2=
Index
First Name
Age
Gender
Weight
0
James
25
Male
155
1
John
27
Male
175
2
Patricia
23
Female
135
3
Mary
22
Female
125
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
df1=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
165
5'10
1
John
27
Male
175
5'9
2
Matthew
29
Male
183
6'0
3
Patricia
23
Female
135
5'3
4
Mary
22
Female
125
5'4
5
Rachel
29
Female
123
5'3
6
Jose
20
Male
175
5'11
7
Kevin
22
Male
192
6'2
df2 has some rows which are not in df1 and df1 has some values which are not in df2.
I am comparing df1 against df2. I have calculated the newentries with the following code
newentries = df2.loc[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1.loc[~df1['First Name'].isin(df2['First Name'])]
where newentries denote the rows/entries that are there in df2 but not in df1; deletedentries denote the rows/entries that are there in df1 but not in df2. The above code works perfectly fine.
I need to copy the height from df1 to df2 when the first names are equal.
df2.loc[df2['First Name'].isin(df1['First Name']),"Height"] = df1.loc[df1['First Name'].isin(df2['First Name']),"Height"]
The above code copies the values however indexing is causing an issue and the values are not copied to the corresponding rows, I tried to promote First Name as the Index but that doesn't solve the issue. Please help me with a solution
Also, I need to calculate the modified values, if the First Name is same, I need to check for modified values; for example in df1, the weight of James is 155 however in df2 the weight is 165, so I need to store the modified weight of James(165) and index(0) in a new dataframe without iteration; the iteration takes a long time because this is a sample of a big dataframe with a lot more rows and columns.
Desired output:
df2=
Index
First Name
Age
Gender
Weight
Height
0
James
25
Male
155
5'10
1
John
27
Male
175
5'9
2
Patricia
23
Female
135
5'3
3
Mary
22
Female
125
5'4
4
Martin
30
Male
185
5
Margaret
29
Female
141
6
Kevin
22
Male
198
6'2
Martin's and Margaret's heights are not there in df1, so their heights are not updated in df2
newentries=
Index
First Name
Age
Gender
Weight
Height
4
Martin
30
Male
185
5
Margaret
29
Female
141
deletedentries=
Index
First Name
Age
Gender
Weight
Height
2
Matthew
29
Male
183
6'0
5
Rachel
29
Male
123
5'3
6
Jose
20
Male
175
5'11
modval=
Index
First Name
Age
Gender
Weight
Height
0
James
165
7
Kevin
192
Building off of Rabinzel's answer:
output = df2.merge(df1, how='left', on='First Name', suffixes=[None, '_old'])
df3 = output[['First Name', 'Age', 'Gender', 'Weight', 'Height']]
cols = df1.columns[1:-1]
modval = pd.DataFrame()
for col in cols:
modval = pd.concat([modval, output[['First Name', col + '_old']][output[col] != output[col + '_old']].dropna()])
modval.rename(columns={col +'_old':col}, inplace=True)
newentries = df2[~df2['First Name'].isin(df1['First Name'])]
deletedentries = df1[~df1['First Name'].isin(df2['First Name'])]
print(df3, newentries, deletedentries, modval, sep='\n\n')
Output:
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
First Name Age Gender Weight
4 Martin 30 Male 185
5 Margaret 29 Female 141
First Name Age Gender Weight Height
2 Matthew 29 Male 183 6'0
5 Rachel 29 Male 123 5'3
6 Jose 20 Male 175 5'11
First Name Age Gender Weight
0 James NaN NaN 165.0
6 Kevin NaN NaN 192.0
for your desired output for df2 you can try this:
desired_df2 = df2.merge(df1[['First Name','Height']], on='First Name', how='left')
#if you want to change the "NaN" values just add ".fillna(fill_value=0)" for e.g 0 after the merge
print(desired_df2)
First Name Age Gender Weight Height
0 James 25 Male 155 5'10
1 John 27 Male 175 5'9
2 Patricia 23 Female 135 5'3
3 Mary 22 Female 125 5'4
4 Martin 30 Male 185 NaN
5 Margaret 29 Female 141 NaN
6 Kevin 22 Male 198 6'2
new and deleted entries is already right. for the moment I'm a bit stuck how to get the modval dataframe. I'll update my answer if I get a solution.

Select rows from a Pandas DataFrame with exactly the same column values in another DataFrame

Say I have the first pandas DataFrame below:
A B ID
0 22.0 male 12
1 38.0 female 34
2 26.0 female 44
3 35.0 female 04
4 35.0 male 78
The second pandas DataFrame is:
C D ID
0 xx xx 12
2 xx xx 44
4 xx xx 78
I want the output be like:
A B ID
0 22.0 male 12
2 26.0 female 44
4 35.0 male 78
which I only want to select rows from the first DataFrame that has the same ID appeared in the second DataFrame.
What is the most efficient way to do this?
Just use isin:
>>> df1[df1['ID'].isin(df2['ID'])]
A B ID
0 22.0 male 12
2 26.0 female 44
4 35.0 male 78
Or merge: (prefer isin)
>>> df1.merge(df2['ID'])
A B ID
0 22.0 male 12
1 26.0 female 44
2 35.0 male 78

Assigning value to pandas dataframe values for unique values in another column

I have the following dataframe:
df = pd.DataFrame({"marks": [40, 60, 90, 20, 100, 10, 30, 70 ], "students":
["Jack", "Jack", "Jack", "Jack", "John", "John", "John", "John"]}
)
marks students
0 40 Jack
1 60 Jack
2 90 Jack
3 20 Jack
4 100 John
5 10 John
6 30 John
7 70 John
I am attempting to assign a student's average to his marks below 40 (the average will include the lowest mark).
I am aware of assigning a mark based on the < 40 condition (in this case I assigned the lowest mark of the df to all marks below 40), like so:
df.loc[df["marks"] < 40, "marks"] = df["marks"].min()
But I am confused on how to potentially apply a lambda function on unique student names. Any help would be appreciated.
Try with np.where
df['marks'] = np.where(df['marks'] <40,
df.groupby('students')['marks'].transform('mean'),
df['marks'])
df
Out[18]:
marks students
0 40.0 Jack
1 60.0 Jack
2 90.0 Jack
3 52.5 Jack
4 100.0 John
5 52.5 John
6 52.5 John
7 70.0 John
You can combine a groupby and where:
df['corrected_marks'] = df['marks'].where(df['marks']>=40,
(df.groupby('students')
['marks']
.transform('mean'))
)
output:
marks students corrected_marks
0 40 Jack 40.0
1 60 Jack 60.0
2 90 Jack 90.0
3 20 Jack 52.5
4 100 John 100.0
5 10 John 52.5
6 30 John 52.5
7 70 John 70.0
#mozway answer is correct. You could do it in two steps:
df['mean'] = df.groupby('students')['marks'].transform('mean')
df['final_marks'] = df.apply(lambda x: x['mean'] if (x['marks'] < 40) else x['marks'], axis=1)
print(df)
output:
marks students mean final_marks
0 40 Jack 52.5 40.0
1 60 Jack 52.5 60.0
2 90 Jack 52.5 90.0
3 20 Jack 52.5 52.5
4 100 John 52.5 100.0
5 10 John 52.5 52.5
6 30 John 52.5 52.5
7 70 John 52.5 70.0
To apply your logics on unique student names, you can group by student names by .groupby() and get the average of each student (each group) by transform() on 'mean'. Then, you can assign the mean values to marks using the same mechanism in the code you tried, like below:
df.loc[df["marks"] < 40, "marks"] = df.groupby('students')['marks'].transform('mean')
Result:
print(df)
marks students
0 40.0 Jack
1 60.0 Jack
2 90.0 Jack
3 52.5 Jack
4 100.0 John
5 52.5 John
6 52.5 John
7 70.0 John
If you actually want to assign the lowest mark (instead of 'mean' mark) of each student to all marks below 40 for that student, you can use transform() on 'min' instead:
df.loc[df["marks"] < 40, "marks"] = df.groupby('students')['marks'].transform('min')
Result:
print(df)
marks students
0 40 Jack
1 60 Jack
2 90 Jack
3 20 Jack
4 100 John
5 10 John
6 10 John
7 70 John

float object has no attribute __getitem__ [Looked elsewhere but haven't been able to find anything applicable]

Here is the data frame i'm working with:
patient_id marker_1 marker_2 subtype patient_age patient_gender
0 619681 21.640523 144.001572 0.0 3 female
1 619711 13.787380 162.408932 0.0 15 female
2 619595 22.675580 130.227221 0.0 6 female
3 619990 13.500884 138.486428 0.0 17 male
4 619157 2.967811 144.105985 0.0 6 female
5 619320 5.440436 154.542735 0.0 9 female
6 619663 11.610377 141.216750 0.0 7 female
7 619910 8.438632 143.336743 0.0 5 female
8 619199 18.940791 137.948417 0.0 7 male
9 619430 7.130677 131.459043 0.0 17 female
10 619766 -21.529898 146.536186 0.0 17 female
11 619018 12.644362 132.578350 0.0 12 female
12 619864 26.697546 125.456343 0.0 4 male
13 619273 4.457585 138.128162 0.0 8 female
14 619846 19.327792 154.693588 0.0 12 male
15 619487 5.549474 143.781625 0.0 8 male
16 619311 -4.877857 120.192035 0.0 7 female
17 619804 0.520879 141.563490 0.0 12 female
18 619331 16.302907 152.023798 0.0 16 female
19 619880 0.126732 136.976972 0.0 15 male
20 619428 -6.485530 125.799821 0.0 4 female
21 619554 -13.062702 159.507754 0.0 6 male
22 619072 -1.096522 135.619257 0.0 6 female
23 619095 -8.527954 147.774904 0.0 6 male
24 619706 -12.138978 137.872597 0.0 14 male
25 619708 -4.954666 143.869025 0.0 7 male
26 619693 -1.108051 128.193678 0.0 13 male
27 619975 3.718178 144.283319 0.0 7 female
28 619289 4.665172 143.024719 0.0 9 male
29 619911 -2.343221 136.372588 0.0 7 female
.. ... ... ... ... ...
Now, I'm calculating basic statistics of the entire data frame and plan to extract specific values later on
#mean, median, sd of subset data
mean_children = np.mean(children)
med_children = np.median(children)
sd_children = np.std(children)
children_mark1 = [mean_children['marker_1'], med_children['marker_1'], sd_children['marker_1']]
children_mark2 = [mean_children['marker_2'], med_children['marker_2'], sd_children['marker_2']]
children_age = [mean_children['patient_age'], med_children['patient_age'], sd_children['patient_age']]
This is where I get the error. When I print out mean_children['marker_2'] I get 121.396907126 so i'm not quite understanding why it won't allow me to add it to this vector.
np.mean
Sums entire dataframe and returns float
Use
children.mean()
With children being your dataframe, I think if you use:
children.describe()
or
children.describe().transpose()
You will save yourself some time.

Categories