How can I create new column based on the odd even flag in Pandas
This is my data:
id Flag
001 1
002 2
003 3
004 4
I would like to have this output if flag is even number then female, if flag is odd number then male:
id Flag Gender
001 1 Male
002 2 Female
003 3 Male
004 4 Female
Use numpy.where with modulo 2 for check even and odd numbers:
df['Gender'] = np.where(df['Flag'] % 2,'Male','Female')
print (df)
id Flag Gender
0 1 1 Male
1 2 2 Female
2 3 3 Male
3 4 4 Female
try apply
Id =['001','002','003','004']
Flag=[1,2,3,4]
df=pd.DataFrame({'id':Id,'flag':Flag})
df['gender']=df['flag'].apply(lambda x: 'Male' if x%2 else 'Female')
output:
id flag gender
0 001 1 Male
1 002 2 Female
2 003 3 Male
3 004 4 Female
df read from an xlsx: df = pd.read_excel('file.xlsx') arrives like this:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
Note Pandas suffixed duplicate columns .1, which was not desired. I'd like to unstack / melt to get this or similar:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
Renaming columns and unstacking gets close but no cigar:
df= df.rename(columns={'Male.1': 'Male', 'Female.1':'Female'})
df= df.set_index(['Age']).unstack()
How can I set the 1st row to be the 2nd index level of columns as shown here? What am I missing?
Instead of .unstack(), another approach would be .melt().
You can transpose the dataframe with .T and take everything after the first row with .iloc[1:]. Then, .rename the columns, .replace the .1 with some regex, .melt the dataframe and .sort_values.
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
df = (df.T.reset_index().iloc[1:]
.rename({'index' : 'Gender', 0 : 'Size'}, axis=1)
.replace(r'\.\d+$', '', regex=True)
.melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
.sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5
If possible, create with first 2 rows MultiIndex and also first column to index by header and index_col parameter in read_excel:
df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male', 'Big'),
('Female', 'Small'),
( 'Male', 'Small'),
('Female', 'Big')],
names=['Age', None])
print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')
So is possible use DataFrame.unstack:
df = (df.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
If not possible use:
You can create MultiIndex by MultiIndex.from_arrays and remove last . with digit by replace, then filter out first row by DataFrame.iloc and reshape by DataFrame.melt by first column, last set new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
Or solution with DataFrame.unstack is possible, only set first column to index by DataFrame.set_index and set levels of MultiIndex by Series.rename_axis for new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
Create a multiindex column by combining row 0 with the column :
df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']
df.columns
MultiIndex([( 'Age', nan),
( 'Male', 'Big'),
( 'Female', 'Small'),
( 'Male.1', 'Small'),
('Female.1', 'Big')],
names=['gender', 'size'])
Now you can reshape and rename :
(df
.dropna()
.melt([('Age', np.NaN)], value_name='measure')
.replace(r'\.\d+$', '', regex=True)
.rename(columns={("Age", np.NaN) : "Age"}))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3
You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]
Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.
Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0
I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.
sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1
pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1