Check if value from one dataframe exists in another dataframe - python

I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.

Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32

This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))

By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0

This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)

Related

Get count of particular values and the total based on another column value in dataframe using Pandas

I have the following dataframe df:
names
status
John
Completed
James
To Do
Jill
To Do
Robert
In Progress
Jill
To Do
Jill
To Do
Marina
Completed
Evy
Completed
Evy
Completed
Now I want the count of each type of status for each user. I can get it like this for all types of statuses.
df = pd.crosstab(df.names,df.status).reset_index("names")
So now the resulting df is
status
names
Completed
In Progress
To Do
0
James
0
0
1
1
Robert
0
1
0
2
John
1
0
0
3
Marina
1
0
0
4
Jill
0
0
3
5
Evy
2
0
0
So my problem is how can I specify only a particular type of status value to be counted? For eg: I want only the values of In Progress and Completed and not To Do. And how can I add a extra column to the above called as Total Statuses, that will actually be the total number of rows for each name in the original dataframe?
Desired Dataframe:
status
names
Completed
In Progress
Total
0
James
0
0
1
1
Robert
0
1
1
2
John
1
0
1
3
Marina
1
0
1
4
Jill
0
0
3
5
Evy
2
0
2
Another way:
pass margins and margins_name parameters in pd.crosstab():
df=(pd.crosstab(df.names,df.status,margins=True,margins_name='Total').iloc[:-1]
.reset_index().drop('To Do',1))
OR
via crosstab()+assign()
df=(pd.crosstab(df.names,df.status).assign(Total=lambda x:x.sum(1))
.reset_index().drop('To Do',1))
OR
In 2 steps:
df=pd.crosstab(df.names,df.status)
df=df.assign(Total=df.sum(1)).drop('To Do',1).reset_index()
You can create the total from the addition of the three previous columns:
df['Total'] = (df['Completed'] + df['In Progress'] + df['To Do'])
Then you can drop the 'to-do' from your new data frame as follows :
df = df.drop(columns=['To Do'])
df = pd.DataFrame({'names': ['John', 'James', 'Jill', 'Robert', 'Jill', 'Jill', 'Marina', 'Evy', 'Evy'],
'status':['Completed', 'To Do', 'To Do', 'In Progress', 'To Do', 'To Do', 'Completed', 'Completed', 'Completed']})
df = pd.crosstab(df.names,df.status).reset_index("names")
df['Total'] = df['Completed'] + df['In Progress'] + df['To Do']
df = df.drop(columns=['To Do'])
print(df)
Output:
status names Completed In Progress Total
0 Evy 2 0 2
1 James 0 0 1
2 Jill 0 0 3
3 John 1 0 1
4 Marina 1 0 1
5 Robert 0 1 1
I can't comprehend what kind of sorting system you are using. But I think you will manage to do that yourself.

Python Transpose data and change to 0,1's

I have some data in the following format, thousands of rows.
I want to transpose the data and also change the format to 1 and 0's
Name Codes
Dave DSFFS
Dave SDFDF
stu SDFDS
stu DSGDSG
I want to retain the Name column in row format, but have the codes column go into Column format instead and have 1 and 0's
u can use df.transpose() in pandas!
pd.get_dummies() should help here.
df = pd.DataFrame({'Name': ['Dave','Dave','stu','stu'],
'Codes': ['DSFFS','SDFDF','SDFDS','DSGDSG']})
print(df)
Codes Name
0 DSFFS Dave
1 SDFDF Dave
2 SDFDS stu
3 DSGDSG stu
print(pd.get_dummies(df, columns=['Codes']))
Name Codes_DSFFS Codes_DSGDSG Codes_SDFDF Codes_SDFDS
0 Dave 1 0 0 0
1 Dave 0 0 1 0
2 stu 0 0 0 1
3 stu 0 1 0 0

How can I generate a new column to group by membership in Pandas?

I have a dataframe:
df = pd.DataFrame({'name':['John','Fred','John','George','Fred']})
How can I transform this to generate a new column giving me group membership by value? Such that:
new_df = pd.DataFrame({'name':['John','Fred','John','George','Fred'], 'group':[1,2,1,3,2]})
Use factorize:
df['group'] = pd.factorize(df['name'])[0] + 1
print (df)
name group
0 John 1
1 Fred 2
2 John 1
3 George 3
4 Fred 2

Split single column into two based on column values

I have a dataframe that looks like this:
Supervisor Score
Bill Pass
Bill Pass
Susan Fail
Susan Fail
Susan Fail
I would like to do some aggregates (such as getting the % of pass by supervisor) and would like to split up the Score column so all the Pass are in one column and all the Fail are in another column. Like this:
Supervisor Pass Fail
Bill 0 1
Bill 0 1
Susan 1 0
Susan 1 0
Susan 1 0
Any ideas? Would a simple groupby work by grouping both the supervisor and score columns and getting a count of Score?
pd.get_dummies
Removes any columns you specify from your DataFrame in favor of N dummy columns with the default naming convention 'OrigName_UniqueVal'. Specifying empty strings for the prefix and separator gives you column headers of only the unique values.
pd.get_dummies(df, columns=['Score'], prefix_sep='', prefix='')
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
If in the end you just want % of each category by supervisor then you don't really need the dummies. You can groupby. I use a reindex to ensure the resulting DataFrame has each category represented for each Supervisor.
(df.groupby(['Supervisor']).Score.value_counts(normalize=True)
.reindex(pd.MultiIndex.from_product([df.Supervisor.unique(), df.Score.unique()]))
.fillna(0))
#Bill Pass 1.0
# Fail 0.0
#Susan Pass 0.0
# Fail 1.0
#Name: Score, dtype: float64
IIUC, you want DataFrame.pivot_table + DataFrmae.join
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0))
print(new_df)
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
For the output expect:
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0)
.eq(0)
.astype(int))
print(new_df)
Supervisor Fail Pass
0 Bill 1 0
1 Bill 1 0
2 Susan 0 1
3 Susan 0 1
4 Susan 0 1
**Let's try this one**
df=pd.DataFrame({'Supervisor':['Bill','Bill','Susan','Susan','Susan'],
'Score':['Pass','Pass','Fail','Fail','Fail']}).set_index('Supervisor')
pd.get_dummies(df['Score'])
PANDAS 100 tricks
For More Pandas trick refer following : https://www.kaggle.com/python10pm/pandas-100-tricks
To get the df you want you can do it like this:
df["Pass"] = df["Score"].apply(lambda x: 0 if x == "Pass" else 1)
df["Fail"] = df["Score"].apply(lambda x: 0 if x == "Fail" else 1)

Python pandas: map and return Nan

I have two data frame, the first one is:
id code
1 2
2 3
3 3
4 1
and the second one is:
id code name
1 1 Mary
2 2 Ben
3 3 John
I would like to map the data frame 1 so that it looks like:
id code name
1 2 Ben
2 3 John
3 3 John
4 1 Mary
I try to use this code:
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
My mapping is correct, but the mapping value are all NAN:
mapping = {1:"Mary", 2:"Ben", 3:"John"}
id code name
1 2 NaN
2 3 NaN
3 3 NaN
4 1 NaN
Can anyone know why an how to solve?
Problem is different type of values in column code so necessary converting to integers or strings by astype for same types in both:
print (df1['code'].dtype)
object
print (df2['code'].dtype)
int64
print (type(df1.loc[0, 'code']))
<class 'str'>
print (type(df2.loc[0, 'code']))
<class 'numpy.int64'>
mapping = dict(df2[['code','name']].values)
#same dtypes - integers
df1['name'] = df1['code'].astype(int).map(mapping)
#same dtypes - object (obviously strings)
df2['code'] = df2['code'].astype(str)
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
print (df1)
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mary
Alternate way is using dataframe.merge
df.merge(df2.drop(['id'],1), how='left', on=['code'])
Output:
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mery

Categories