How to add multiple calculated columns to dataframe at once - python

I have dataframe as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y
0 1 A 27 Male A 32 Male
1 2 B 28 Female B 28 Female
2 3 C 8 Female C 1 Female
3 4 D 28 Male D 72 Male
4 5 E 25 Female E 64 Female
I need to create calculated column , difference between age, check gender match and to achieve this in one go I am using
DF3.loc[:,["Gendermatch","Agematch"]]= pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)])
and the resultant dataframe looks like as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male NaN NaN
1 2 B 28 Female B 28 Female NaN NaN
2 3 C 8 Female C 1 Female NaN NaN
3 4 D 28 Male D 72 Male NaN NaN
4 5 E 25 Female E 64 Female NaN NaN
Resultant columns shows not a number , what wrong am I doing here?

DF3[["Gendermatch","Agematch"]]= np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)

DF3[["Gendermatch","Agematch"]] = pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)]).T

np.where is useless, Series comparison already returns boolean Series
DF3["Gendermatch"] = DF3["Name_x"]==DF3["Name_y"]
DF3["Agematch"] = DF3["Age_x"]==DF3["Age_y"]
# or in one line
DF3["Gendermatch"], DF3["Agematch"] = (DF3["Name_x"]==DF3["Name_y"]), (DF3["Age_x"]==DF3["Age_y"])
print(DF3)
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male True False
1 2 B 28 Female B 28 Female True True
2 3 C 8 Female C 1 Female True False
3 4 D 28 Male D 72 Male True False
4 5 E 25 Female E 64 Female True False

Related

Comparing multiple columns of a massive DataFrame with complex duplicate rows

I have a massive dataframe df with around 10 million rows:
df.sort_values(['pair','x1','x2'])
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
-------------------------------------------------------------------------------
A male H female a male d male 0
A male W male a male d male 0 (*)
A male KK female a male d male 0 (**)
B female C male a male d male 0 (-)
B female W male a male d male 0 (*)
B female BB female a male d male 0
B female KK female a male d male 0 (**)
F male W male a male d male 0 (*)
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1 (-)
D male E male b female d male 1
A male C male b female e female 2
...
Each column can be explained by the following:
x1gen is a gender data of x1, x2gen is of x2, and so on.
x1 cites y1 and x2 cites y2.
Each pair of y1 and y2 is assigned a unique pair value.
My objective is to find four values per unique pair:
male citing male
male citing female
female citing male
female citing female
where, each citation network should not be counted more than once.
For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)
Hence, for the first pair pair = 0, the objective values are:
male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
male citing female = 0
female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
female citing female = 0
What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:
mm = [1]
mf = [0]
fm = [0]
ff = [0]
mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0
for i in range(1, len(df)):
if df['pair'][i] == df['pair'][i-1]:
if df['x1'][i] != df['x1'][i-1]:
if df['x1gen'][i] == 'male':
if df['y1gen'][i] == 'male':
mm1 += 1
else:
mf1 += 1
else:
if df['y1gen'][i] == 'male':
fm1 += 1
else:
ff1 += 1
...
and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).
What is the optimal way to find such values without having to do a highly inefficient string-matching?
You can try the following:
import io
import re
import pandas as pd
# this just recreates the dataframe
s = '''
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
A male H female a male d male 0
A male W male a male d male 0
A male KK female a male d male 0
B female C male a male d male 0
B female W male a male d male 0
B female BB female a male d male 0
B female KK female a male d male 0
F male W male a male d male 0
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1
D male E male b female d male 1
A male C male b female e female 2
'''
s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)
It gives:
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
0 A male H female a male d male 0
1 A male W male a male d male 0
2 A male KK female a male d male 0
3 B female C male a male d male 0
4 B female W male a male d male 0
5 B female BB female a male d male 0
6 B female KK female a male d male 0
7 F male W male a male d male 0
8 A male T female b female d male 1
9 A male BB female b female d male 1
10 B female C male b female d male 1
11 D male E male b female d male 1
12 A male C male b female e female 2
Counting citation pairs:
# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
c1.add(c2, fill_value=0).astype(int)
This gives:
female_female 1
female_male 6
male_female 4
male_male 6
Computing results for each pair separately:
def cit_count(g):
# count x2-> y2 pairs
df1 = g.drop_duplicates(subset=['x1', 'y1'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = g.drop_duplicates(subset=['x2', 'y2'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
return c1.add(c2, fill_value=0)
print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))
It gives:
female_female female_male male_female male_male
pair
0 0 4 0 4
1 1 2 2 2
2 0 0 2 0

How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index?

df read from an xlsx: df = pd.read_excel('file.xlsx') arrives like this:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
Note Pandas suffixed duplicate columns .1, which was not desired. I'd like to unstack / melt to get this or similar:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
Renaming columns and unstacking gets close but no cigar:
df= df.rename(columns={'Male.1': 'Male', 'Female.1':'Female'})
df= df.set_index(['Age']).unstack()
How can I set the 1st row to be the 2nd index level of columns as shown here? What am I missing?
Instead of .unstack(), another approach would be .melt().
You can transpose the dataframe with .T and take everything after the first row with .iloc[1:]. Then, .rename the columns, .replace the .1 with some regex, .melt the dataframe and .sort_values.
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
df = (df.T.reset_index().iloc[1:]
.rename({'index' : 'Gender', 0 : 'Size'}, axis=1)
.replace(r'\.\d+$', '', regex=True)
.melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
.sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5
If possible, create with first 2 rows MultiIndex and also first column to index by header and index_col parameter in read_excel:
df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male', 'Big'),
('Female', 'Small'),
( 'Male', 'Small'),
('Female', 'Big')],
names=['Age', None])
print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')
So is possible use DataFrame.unstack:
df = (df.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
If not possible use:
You can create MultiIndex by MultiIndex.from_arrays and remove last . with digit by replace, then filter out first row by DataFrame.iloc and reshape by DataFrame.melt by first column, last set new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
Or solution with DataFrame.unstack is possible, only set first column to index by DataFrame.set_index and set levels of MultiIndex by Series.rename_axis for new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
Create a multiindex column by combining row 0 with the column :
df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']
df.columns
MultiIndex([( 'Age', nan),
( 'Male', 'Big'),
( 'Female', 'Small'),
( 'Male.1', 'Small'),
('Female.1', 'Big')],
names=['gender', 'size'])
Now you can reshape and rename :
(df
.dropna()
.melt([('Age', np.NaN)], value_name='measure')
.replace(r'\.\d+$', '', regex=True)
.rename(columns={("Age", np.NaN) : "Age"}))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5

Pandas DataFrame removing NaN rows based on condition?

Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3
You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]
Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.

Pandas groupby on one column witout losing others columns?

I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Categories