How to add multiple calculated columns to dataframe at once

How to add multiple calculated columns to dataframe at once - python

I have dataframe as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y
0 1 A 27 Male A 32 Male
1 2 B 28 Female B 28 Female
2 3 C 8 Female C 1 Female
3 4 D 28 Male D 72 Male
4 5 E 25 Female E 64 Female
I need to create calculated column , difference between age, check gender match and to achieve this in one go I am using
DF3.loc[:,["Gendermatch","Agematch"]]= pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)])
and the resultant dataframe looks like as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male NaN NaN
1 2 B 28 Female B 28 Female NaN NaN
2 3 C 8 Female C 1 Female NaN NaN
3 4 D 28 Male D 72 Male NaN NaN
4 5 E 25 Female E 64 Female NaN NaN
Resultant columns shows not a number , what wrong am I doing here?

DF3[["Gendermatch","Agematch"]]= np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)

DF3[["Gendermatch","Agematch"]] = pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)]).T

np.where is useless, Series comparison already returns boolean Series
DF3["Gendermatch"] = DF3["Name_x"]==DF3["Name_y"]
DF3["Agematch"] = DF3["Age_x"]==DF3["Age_y"]
# or in one line
DF3["Gendermatch"], DF3["Agematch"] = (DF3["Name_x"]==DF3["Name_y"]), (DF3["Age_x"]==DF3["Age_y"])
print(DF3)
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male True False
1 2 B 28 Female B 28 Female True True
2 3 C 8 Female C 1 Female True False
3 4 D 28 Male D 72 Male True False
4 5 E 25 Female E 64 Female True False

Related

Comparing multiple columns of a massive DataFrame with complex duplicate rows

I have a massive dataframe df with around 10 million rows:
df.sort_values(['pair','x1','x2'])
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
-------------------------------------------------------------------------------
A male H female a male d male 0
A male W male a male d male 0 (*)
A male KK female a male d male 0 (**)
B female C male a male d male 0 (-)
B female W male a male d male 0 (*)
B female BB female a male d male 0
B female KK female a male d male 0 (**)
F male W male a male d male 0 (*)
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1 (-)
D male E male b female d male 1
A male C male b female e female 2
...
Each column can be explained by the following:
x1gen is a gender data of x1, x2gen is of x2, and so on.
x1 cites y1 and x2 cites y2.
Each pair of y1 and y2 is assigned a unique pair value.
My objective is to find four values per unique pair:
male citing male
male citing female
female citing male
female citing female
where, each citation network should not be counted more than once.
For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)
Hence, for the first pair pair = 0, the objective values are:
male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
male citing female = 0
female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
female citing female = 0
What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:
mm = [1]
mf = [0]
fm = [0]
ff = [0]
mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0
for i in range(1, len(df)):
if df['pair'][i] == df['pair'][i-1]:
if df['x1'][i] != df['x1'][i-1]:
if df['x1gen'][i] == 'male':
if df['y1gen'][i] == 'male':
mm1 += 1
else:
mf1 += 1
else:
if df['y1gen'][i] == 'male':
fm1 += 1
else:
ff1 += 1
...
and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).
What is the optimal way to find such values without having to do a highly inefficient string-matching?

You can try the following:
import io
import re
import pandas as pd
# this just recreates the dataframe
s = '''
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
A male H female a male d male 0
A male W male a male d male 0
A male KK female a male d male 0
B female C male a male d male 0
B female W male a male d male 0
B female BB female a male d male 0
B female KK female a male d male 0
F male W male a male d male 0
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1
D male E male b female d male 1
A male C male b female e female 2
'''
s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)
It gives:
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
0 A male H female a male d male 0
1 A male W male a male d male 0
2 A male KK female a male d male 0
3 B female C male a male d male 0
4 B female W male a male d male 0
5 B female BB female a male d male 0
6 B female KK female a male d male 0
7 F male W male a male d male 0
8 A male T female b female d male 1
9 A male BB female b female d male 1
10 B female C male b female d male 1
11 D male E male b female d male 1
12 A male C male b female e female 2
Counting citation pairs:
# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
c1.add(c2, fill_value=0).astype(int)
This gives:
female_female 1
female_male 6
male_female 4
male_male 6
Computing results for each pair separately:
def cit_count(g):
# count x2-> y2 pairs
df1 = g.drop_duplicates(subset=['x1', 'y1'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = g.drop_duplicates(subset=['x2', 'y2'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
return c1.add(c2, fill_value=0)
print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))
It gives:
female_female female_male male_female male_male
pair
0 0 4 0 4
1 1 2 2 2
2 0 0 2 0

How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index?

df read from an xlsx: df = pd.read_excel('file.xlsx') arrives like this:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
Note Pandas suffixed duplicate columns .1, which was not desired. I'd like to unstack / melt to get this or similar:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
Renaming columns and unstacking gets close but no cigar:
df= df.rename(columns={'Male.1': 'Male', 'Female.1':'Female'})
df= df.set_index(['Age']).unstack()
How can I set the 1st row to be the 2nd index level of columns as shown here? What am I missing?

Instead of .unstack(), another approach would be .melt().
You can transpose the dataframe with .T and take everything after the first row with .iloc[1:]. Then, .rename the columns, .replace the .1 with some regex, .melt the dataframe and .sort_values.
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
df = (df.T.reset_index().iloc[1:]
.rename({'index' : 'Gender', 0 : 'Size'}, axis=1)
.replace(r'\.\d+$', '', regex=True)
.melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
.sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5

If possible, create with first 2 rows MultiIndex and also first column to index by header and index_col parameter in read_excel:
df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male', 'Big'),
('Female', 'Small'),
( 'Male', 'Small'),
('Female', 'Big')],
names=['Age', None])
print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')
So is possible use DataFrame.unstack:
df = (df.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
If not possible use:
You can create MultiIndex by MultiIndex.from_arrays and remove last . with digit by replace, then filter out first row by DataFrame.iloc and reshape by DataFrame.melt by first column, last set new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
Or solution with DataFrame.unstack is possible, only set first column to index by DataFrame.set_index and set levels of MultiIndex by Series.rename_axis for new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5

Create a multiindex column by combining row 0 with the column :
df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']
df.columns
MultiIndex([( 'Age', nan),
( 'Male', 'Big'),
( 'Female', 'Small'),
( 'Male.1', 'Small'),
('Female.1', 'Big')],
names=['gender', 'size'])
Now you can reshape and rename :
(df
.dropna()
.melt([('Age', np.NaN)], value_name='measure')
.replace(r'\.\d+$', '', regex=True)
.rename(columns={("Age", np.NaN) : "Age"}))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5

Pandas DataFrame removing NaN rows based on condition?

Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3

You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]

Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.

Pandas groupby on one column witout losing others columns?

I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's

There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23

If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23

Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris

If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN

You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want

You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31

Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add multiple calculated columns to dataframe at once - python

DF3[["Gendermatch","Agematch"]]= np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)

DF3[["Gendermatch","Agematch"]] = pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)]).T

Related

Comparing multiple columns of a massive DataFrame with complex duplicate rows

How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index?

Pandas DataFrame removing NaN rows based on condition?

Pandas groupby on one column witout losing others columns?

Compare two pandas dataframe with different size

Categories

Resources