create another column based on matching of two data frames columns - python

I have two data frames df1 and df2 with a common column ID. Both the data frames have different number of rows and columns.
I want to compare these two dataframe ID’s. I want to create another column y in df1 and for all the common id’s present in df1 and df2 the value of y should be 0 else 1.
For example df1 is
Id col1 col2
1 Abc def
2 Geh ghk
3 Abd fg
1 Dfg abc
And df2 is
Id col3 col4
1 Dgh gjs
2 Gsj aie
The final dataframe should be
Id col1 col2 y
1 Abc def 0
2 Geh ghk 0
3 Abd fg 1
1 Dfg abc 0

Lets create df1 and df2 first:
df1=pd.DataFrame({'ID':[1,2,3,1], 'col1':['A','B','C', 'D'], 'col2':['C','D','E', 'F']})
df2=pd.DataFrame({'ID':[1,2], 'col3':['AA','BB'], 'col4':['CC','DD']})
Here, pandas lambda function comes handy:
df1['y'] = df1['ID'].map(lambda x:0 if x in df2['ID'].values else 1)
df1
ID col1 col2 y
0 1 A C 0
1 2 B D 0
2 3 C E 1
3 1 D F 0

Related

How to add interleaving rows as result of sort / groups?

I have the following sample input data:
import pandas as pd
df = pd.DataFrame({'col1': ['x', 'y', 'z'], 'col2': [1, 2, 3], 'col3': ['a', 'a', 'b']})
I would like to sort and group by col3 while interleaving the summaries on top of the corresponding group in col1 and get the following output:
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
I can of course do the part:
df.sort_values(by=['col3']).groupby(by=['col3']).sum()
col2
col3
a 3
b 3
but I am not sure how to interleave the group labels on top of col1.
Use custom function for top1 row for each group:
def f(x):
return pd.DataFrame({'col1': x.name, 'col2': x['col2'].sum()}, index=[0]).append(x)
df = (df.sort_values(by=['col3'])
.groupby(by=['col3'], group_keys=False)
.apply(f)
.drop('col3', 1)
.reset_index(drop=True))
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
More performant solution is use GroupBy.ngroup for indices, aggregate sum amd last join values by concat with only stable sorting by mergesort:
df = df.sort_values(by=['col3'])
df1 = df.groupby(by=['col3'])['col2'].sum().rename_axis('col1').reset_index()
df2 = df.set_index(df.groupby(by=['col3']).ngroup())
df = pd.concat([df1, df2]).sort_index(kind='mergesort', ignore_index=True).drop('col3', 1)
print (df)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
What about:
(df.melt(id_vars='col2')
.rename(columns={'value': 'col1'})
.groupby('col1').sum()
.reset_index()
)
output:
col1 col2
0 a 3
1 b 3
2 x 1
3 y 2
4 z 3
def function1(dd:pd.DataFrame):
df.loc[dd.index.min()-0.5,['col1','col2']]=[dd.name,dd.col2.sum()]
df.groupby('col3').apply(function1).pipe(lambda dd:df.sort_index(ignore_index=True)).drop('col3',axis=1)
output
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3
or use pandasql library
def function1(dd:pd.DataFrame):
return dd.sql("select '{}' as col1,{} as col2 union select col1,col2 from self".format(dd.name,dd.col2.sum()))
df.groupby('col3').apply(function1).reset_index(drop=False)
col1 col2
0 a 3
1 x 1
2 y 2
3 b 3
4 z 3

Adding new rows with default value based on dataframe values into dataframe

I have data with a large number of columns:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
4 a z 1 0
...
98 a z 1 1
100 a x 1 0
I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100
ID col1 col2 col3 ... col100
3 a x 1 0
99 a x 1 0
Expected output:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
3 a x 1 0
4 a z 1 0
...
98 a z 1 1
99 a x 1 0
100 a x 1 0
I'm also ok with the 3 and 99 being at the bottom.
I have tried several ways of appending new rows:
noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
df.append(temp, ignore_index = True)
This method doesn't seem to append anything.
I have also tried
pd.concat([df, temp], ignore_index = True)
instead of df.append
I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:
noresponserows = []
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
noresponserows.append(temp)
But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.
I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.
I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.
Say you have a dataframe like this:
>>> df
col1 col2 col100 ID
0 a x 0 1
1 a y 3 2
2 a z 1 4
First, set the ID column to be the index:
>>> df = df.set_index('ID')
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
Now you can use df.loc to easily add rows.
Let's select the last row as the default row:
>>> default_row = df.iloc[-1]
>>> default_row
col1 a
col2 z
col100 1
Name: 4, dtype: object
We can add it right into the dataframe at ID 3:
>>> df.loc[3] = default_row
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
3 a z 1
Then use sort_index to sort the rows lexicographically by index:
>>> df = df.sort_index()
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
3 a z 1
4 a z 1
And, optionally, reset the index:
>>> df = df.reset_index()
>>> df
ID col1 col2 col100
0 1 a x 0
1 2 a y 3
2 3 a z 1
3 4 a z 1

Fetch row data from known index in pandas

df1:
col1 col2
0 a 5
1 b 2
2 c 1
df2:
col1
0 qa0
1 qa1
2 qa2
3 qa3
4 qa4
5 qa5
final output:
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
Basically , in df1, I have index stored for another df data. I have to fetch data from df2 and append it in df1.
I don't know how to fetch data via index number.
Use Series.map by another Series:
df1['col3'] = df1['col2'].map(df2['col1'])
Or use DataFrame.join with rename column:
df1 = df1.join(df2.rename(columns={'col1':'col3'})['col3'], on='col2')
print (df1)
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
You can use iloc to get data and then to_numpy for values
df1["col3"] = df2.iloc[df1.col2].to_numpy()
df1
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1

Transform multiple categorical columns

In my dataset I have two categorical columns which I would like to numerate. The two columns both contain countries, some overlap (appear in both columns). I would like to give the same number in column1 and column2 for the same country.
My data looks somewhat like:
import pandas as pd
d = {'col1': ['NL', 'BE', 'FR', 'BE'], 'col2': ['BE', 'NL', 'ES', 'ES']}
df = pd.DataFrame(data=d)
df
Currenty I am transforming the data like:
from sklearn.preprocessing import LabelEncoder
df.apply(LabelEncoder().fit_transform)
However this makes no distinction between FR and ES. Is there another simple way to come to the following output?
o = {'col1': [2,0,1,0], 'col2': [0,2,4,4]}
output = pd.DataFrame(data=o)
output
Here is one way
df.stack().astype('category').cat.codes.unstack()
Out[190]:
col1 col2
0 3 0
1 0 3
2 2 1
3 0 1
Or
s=df.stack()
s[:]=s.factorize()[0]
s.unstack()
Out[196]:
col1 col2
0 0 1
1 1 0
2 2 3
3 1 3
You can fit the LabelEncoder() with the unique values in your dataframe first and then transform.
le = LabelEncoder()
le.fit(pd.concat([df.col1, df.col2]).unique()) # or np.unique(df.values.reshape(-1,1))
df.apply(le.transform)
Out[28]:
col1 col2
0 3 0
1 0 3
2 2 1
3 0 1
np.unique with return_invesere. Though you then need to reconstruct the DataFrame.
pd.DataFrame(np.unique(df, return_inverse=True)[1].reshape(df.shape),
index=df.index,
columns=df.columns)
col1 col2
0 3 0
1 0 3
2 2 1
3 0 1

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

Categories