Series Mismatch (Find value in "mismatched" Series) - python

Feel like I should know this. I am attempting to compare two DataFrames and find those individuals who are not included:
First df
data_x = {'Num':[321654,654987, 654321], 'Name':['Tim', 'Jake', 'Sam']}
x = pd.DataFrame(data_x)
x =
0 321654 Tim
1 654987 Jake
2 654321 Sam
Second df
data_z = {'Num':[321654,123456, 654987,894523], 'Name':['Tim', 'Jim', 'Jake', 'Bob']}
z = pd.DataFrame(data_z)
z =
0 321654 Tim
1 123456 Jim
2 654987 Jake
3 894523 Bob
Requested Results =
0 123456 Jim
1 894523 Bob

You can do a outer .merge() on x and z with parameter indicator=True and check for those merge result with right_only, as follows:
out = x.merge(z, how='outer', indicator=True)
# You can also specify the 2 columns if there are other columns in real situation
# out = x.merge(z, on=['Num', 'Name'] how='outer', indicator=True)
Result:
print(out)
Num Name _merge
0 321654 Tim both
1 654987 Jake both
2 654321 Sam left_only
3 123456 Jim right_only
4 894523 Bob right_only
and then filter the result by:
out.loc[out['_merge'] == 'right_only']
Output:
Num Name _merge
3 123456 Jim right_only
4 894523 Bob right_only
Of course you can remove the merge result column and reset the index, if you like:
out_filtered = out.loc[out['_merge'] == 'right_only']
out_filtered = out_filtered.drop(columns='_merge').reset_index(drop=True)
print(out_filtered)
Num Name
0 123456 Jim
1 894523 Bob

Related

Pandas Concat dataframes with Duplicates

I am facing issues in concatenating two Dataframes of different lengths. Below is the issue:
df1 =
emp_id emp_name counts
1 sam 0
2 joe 0
3 john 0
df2 =
emp_id emp_name counts
1 sam 0
2 joe 0
2 joe 1
3 john 0
My Expected output is:
Please Note that my expectation is not to merge the 2 dataframes into one but I would like to concat two dataframes side by side and highlight the differences in such a way that, if there is a duplicate row in one df, example df2, the respective row of df1 should show as NaN/blank/None any Null kind of values
Expected_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
NaN NaN NaN 2 joe 1
3 john 0 3 john 0
whereas am getting output as below:
actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])
the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output
actual_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
3 john 0 2 joe 1
NaN NaN NaN 3 john 0
Tried pd.concat by passing different parameters but not getting expected result.
The main issue I have in concat is, I am not able to move the duplicate rows one row down.
Can anyone please help me on this? Thanks in advance
This does not give the exact output you asked, but it could solve your problem anyway:
df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
Output:
emp_id emp_name counts _merge
0 1 sam 0 both
1 2 joe 0 both
2 3 john 0 both
3 2 joe 1 right_only
You don't have rows with NaNs as you wanted, but in this way you can check whether a row is in the left df, right df or both by looking at the _merge column. You can also give a custom name to that columns using indicator='name'.
Update
To get the exact output you want you can do the following:
output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]
output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)
output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'),
('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])
Output:
df1 df2
emp_id emp_name counts emp_id emp_name counts
0 1.0 sam 0.0 1.0 sam 0.0
1 2.0 joe 0.0 2.0 joe 0.0
2 3.0 john 0.0 3.0 john 0.0
3 NaN NaN NaN 2.0 joe 1.0

Compared two dataframes on strings. I would like to assign the matched strings to one of the dataframes in the right column and row

for country in df1['country ']:
for street,City in zip(df2.street, df2.City):
if re.match(r'[A-Za-z]+\:'+ street + r'\.'+ City,country ):
s = (re.match(r'[A-Za-z]+\:'+ street + r'\.'+ TR +
r'\_(VS).+',country))
Matches += 1
print(s)
print(Matches)
df1:
UID country
0 1 Gervais Philippon:France.PARISPenthièvre25
1 2 Jed Turner:England.LONDONQueensway69
2 3 Lino Jimenez:Spain.MADRIDChavela33
df2:
UID country City
0 1 France PARIS
1 2 Spain MADRID
2 3 England LONDON
Expected output:
UID country UID_df2
0 1 Gervais Philippon:France.PARISPenthièvre25 1
1 2 Jed Turner:England.LONDONQueensway69 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 2
The matches are shown correctly. How can i link the dataframes by assigning the matched string to the other dataframe ? I would like the ideal format:
Thank you.
First I would renamed country in df1 to data or something else so it doesn't get confused with country in df2
df1 = df1.rename(columns={'country': 'data'})
Get the country and City data
df1[['country', 'City']] = df1['data'].str.extract('(:([A-Z]+[a-z]*)).([A-Z]+)', expand=True)[[1, 2]]
Fix the regex in the City name, this step can be removed by updating the regex above
df1['City'] = df1['City'].map(lambda x: x[:-1])
Finally merge with df2
df1.merge(df2, on=['country', 'City'])
UID_x place country City UID_y
0 1 Gervais Philippon:France.PARISPenthièvre25 France PARIS 1
1 2 Jed Turner:England.LONDONQueensway69 England LONDON 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 Spain MADRID 2

Efficient method for generating a list of values from a column in a data frame based on common secondary columns

I have a data frame (df) in Python with 4 columns (ID, Status, Person, Output). Each ID is repeated 6 times and the Output is the same for each ID. For each ID, the Status will be On/Off (3 of each).
I need to generate a new column with a list of people for each unique ID/Status combination. I also need a second new column with a group ID for each unique list of people.
This is my current code which works but is very slow when working with a large data frame due to the apply(list) function. Is there a more efficient way to do this?
people = df.groupby(['ID','Status'])['Person'].apply(list).reset_index(name='Names_ID')
people['Group_ID'] = people['Names_ID'].rank(method='dense')
df = df.drop_duplicates(subset=['ID','Status'])
df = df.merge(people, on = ('ID', 'Status'))
Here is an example input data frame:
df=
ID Status Person Output
0 On John 1
0 On Mark 1
0 On Michael 1
0 Off Peter 1
0 Off Tim 1
0 Off Jake 1
1 On Peter 0.5
1 On Dennis 0.5
1 On Jasper 0.5
1 Off John 0.5
1 Off Mark 0.5
1 Off Michael 0.5
2 On John 2
2 On Mark 2
2 On Larry 2
2 Off Peter 2
2 Off Dennis 2
2 Off Jasper 2
The desired output is:
df =
ID Status People Group_ID Output
0 On [John, Mark, Michael ] 0 1
0 Off [Peter, Tim, Jake ] 1 1
1 On [Peter, Dennis, Jasper ] 2 0.5
1 Off [John, Mark, Michael ] 0 0.5
2 On [John, Mark, Larry ] 3 2
2 Off [Peter, Dennis, Jasper ] 2 2
Try this:
df_out = df.groupby(['ID', 'Status'])['Person'].apply(list).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'].apply(tuple))[0]
df_out
Output:
ID Status Person Group_ID
0 0 Off [Peter, Tim, Jake] 0
1 0 On [John, Mark, Michael] 1
2 1 Off [John, Mark, Michael] 1
3 1 On [Peter, Dennis, Jasper] 2
4 2 Off [Peter, Dennis, Jasper] 2
5 2 On [John, Mark, Larry] 3
OR
df_out = df.groupby(['ID', 'Status'])['Person'].apply(', '.join).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'])[0]
df_out
import pandas as pd
df = pd.read_clipboard()
df
One method is to use shift twice and join the three columns into a list. Then use groupby to figure out the Group_ID and merge it back into the dataframe.
df['Person1'] = df['Person'].shift(-1)
df['Person2'] = df['Person'].shift(-2)
df['People'] = '[' + df['Person'] + ',' + df['Person1'] + ',' + df['Person2'] + ']'
mult_3 = []
for i in df.index:
if i==0 or i%3 == 0:
mult_3.append(i)
df = df.loc[df.index.isin(mult_3)].drop(['Person', 'Person1', 'Person2'], axis=1)
df_people = df.groupby('People').Status.count().reset_index().drop(['Status'], axis=1).reset_index()
df = df.merge(df_people, how='left', on='People').rename(columns={'index':'Group_ID'})
df = df[['ID', 'Status', 'People', 'Group_ID', 'Output']]
df
Python 3.7.6 and Pandas 1.0.3: The bottleneck here are probably the apply calls.
people = df.groupby(['ID','Status', "Output"])['Person'].apply(list).reset_index(name = 'People')
people['Group_ID'] = people["People"].apply(str).astype('category').cat.codes
Output:
ID Status Output People Group_ID
0 0 Off 1 [Peter, Tim, Jake] 3
1 0 On 1 [John, Mark, Michael] 1
2 1 Off 0.5 [John, Mark, Michael] 1
3 1 On 0.5 [Peter, Dennis, Jasper] 2
4 2 Off 2 [Peter, Dennis, Jasper] 2
5 2 On 2 [John, Mark, Larry] 0

Split single column into two based on column values

I have a dataframe that looks like this:
Supervisor Score
Bill Pass
Bill Pass
Susan Fail
Susan Fail
Susan Fail
I would like to do some aggregates (such as getting the % of pass by supervisor) and would like to split up the Score column so all the Pass are in one column and all the Fail are in another column. Like this:
Supervisor Pass Fail
Bill 0 1
Bill 0 1
Susan 1 0
Susan 1 0
Susan 1 0
Any ideas? Would a simple groupby work by grouping both the supervisor and score columns and getting a count of Score?
pd.get_dummies
Removes any columns you specify from your DataFrame in favor of N dummy columns with the default naming convention 'OrigName_UniqueVal'. Specifying empty strings for the prefix and separator gives you column headers of only the unique values.
pd.get_dummies(df, columns=['Score'], prefix_sep='', prefix='')
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
If in the end you just want % of each category by supervisor then you don't really need the dummies. You can groupby. I use a reindex to ensure the resulting DataFrame has each category represented for each Supervisor.
(df.groupby(['Supervisor']).Score.value_counts(normalize=True)
.reindex(pd.MultiIndex.from_product([df.Supervisor.unique(), df.Score.unique()]))
.fillna(0))
#Bill Pass 1.0
# Fail 0.0
#Susan Pass 0.0
# Fail 1.0
#Name: Score, dtype: float64
IIUC, you want DataFrame.pivot_table + DataFrmae.join
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0))
print(new_df)
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
For the output expect:
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0)
.eq(0)
.astype(int))
print(new_df)
Supervisor Fail Pass
0 Bill 1 0
1 Bill 1 0
2 Susan 0 1
3 Susan 0 1
4 Susan 0 1
**Let's try this one**
df=pd.DataFrame({'Supervisor':['Bill','Bill','Susan','Susan','Susan'],
'Score':['Pass','Pass','Fail','Fail','Fail']}).set_index('Supervisor')
pd.get_dummies(df['Score'])
PANDAS 100 tricks
For More Pandas trick refer following : https://www.kaggle.com/python10pm/pandas-100-tricks
To get the df you want you can do it like this:
df["Pass"] = df["Score"].apply(lambda x: 0 if x == "Pass" else 1)
df["Fail"] = df["Score"].apply(lambda x: 0 if x == "Fail" else 1)

Check if value from one dataframe exists in another dataframe

I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)

Categories