Find Pairs in Pandas Data Frames and Perform operations on them - python

I have a dataframe like this. And i want to find the first pair and do operations with that.
fname lname time of entry ............... other columns
Adrian Peter 1
Jhon Adrian 3
Peter Rusk 4
Rusk Anton 10
Gile John 12
Angela Gomes 13
Gomes Angela 14
Now i want something like this where culprit is a value that is both in fname and lname. if for example both values are in fname and lname as the Angela Gomes case below then the culprit has to have one line with Angela and other with Gomes.
pair fname lname Culprit time diff ...... other columns
1 Adrian Peter Adrian -2
1 John Adrian Adrian 2
2 Peter Rusk Rusk -6
2 Rusk Anton Rusk 6
3 Angela Gomes Angela -1
3 Gomes Angela Gomes 1
From the above i know that in number 3 both Angela and Gomes are culprits. Time should also be sorted by ascending order.

I'm not in love with this, there's probably a better way, but it works and doesn't use any python iteration / lists.
Code:
# Find and number the pairs and filter out the rows that don't belong
df = df.loc[(df['fname'].isin(df['lname'].shift())) | (df['fname'].isin(df['lname'].shift(-1)))].reset_index(drop = True)
df['pair'] = (df.index / 2.0).astype(int) + 1
# Find the culprit
df['culprit'] = df.loc[(df['fname'] == df['lname'].shift(-1)) | (df['fname'] == df['lname'].shift(1)), 'fname']
df.sort_values(by = ['pair','culprit'], inplace = True)
df.fillna(method = 'ffill', inplace = True)
# Calculate the time difference
df['time_diff'] = df.loc[df['pair'] == df['pair'].shift(1), 'ToE'] - df['ToE'].shift(1)
df['time_diff'] = df['time_diff'].fillna(df['time_diff'].shift(-1) * -1).astype(int)
# Sort
df.sort_values(by = ['pair','time_diff'], inplace = True)
print df[['pair','fname','lname','culprit','time_diff']].to_string(index = False)
Output:
pair fname lname culprit time_diff
1 Adrian Peter Adrian -2
1 John Adrian Adrian 2
2 Peter Rusk Rusk -6
2 Rusk Anton Rusk 6
3 Angela Gomes Angela -1
3 Gomes Angela Gomes 1

Related

How to slice pandas column with index list?

I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.

How to get the SQL SELF JOIN equivalent in pandas? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a simple data frame with first and last names. I would like to get the equivalent of SQL SELF JOIN equivalent in pandas.
Here goes the full example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'first_name': ['Rose','Summer','Jane','Kim','Jack'],
'last_name': ['Howard','Solstice','Kim','Cruz','Rose'],
'customer_id': [1,2,3,4,5]})
df
first_name last_name customer_id
0 Rose Howard 1
1 Summer Solstice 2
2 Jane Kim 3
3 Kim Cruz 4
4 Jack Rose 5
REQUIRED OUTPUT
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
1 Rose Howard 5 Jack Rose
4 Kim Cruz 3 Jane Kim
Using SQL
select a.first_name, a.last_name, b.first_name, b.last_name
from df as a, df as b
where a.first_name = b.last_name
My attempt
(pd.concat( [ df[['first_name','last_name']],
df[['first_name','last_name']].add_suffix('_1')
], axis=1, ignore_index=False)
)
first_name last_name first_name_1 last_name_1
0 Rose Howard Rose Howard
1 Summer Solstice Summer Solstice
2 Jane Rose Jane Rose
But,
(pd.concat( [ df,df.add_suffix('_1')], axis=1)
.query(" first_name == last_name_1 ")
)
This gives empty output to my surprise!!
I want two rows and fours columns as given by SQL.
Use left_on and right_on
df.merge(df, left_on='first_name', right_on='last_name')
Result:
first_name_x last_name_x customer_id_x first_name_y last_name_y \
0 Rose Howard 1 Jack Rose
1 Kim Cruz 4 Jane Kim
customer_id_y
0 5
1 3

Amend row in a data-frame if it exists in another data-frame

I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.

Adding a function to a string split command in Pandas

I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

Pandas: Concatenate two dataframes with different column names

I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?

Categories