Partial match string in two dataframe columns - python

I have two dataframes. df1 contains partial names of persons and df2 contains names of persons, their dob, etc. I want to partially match df1['Partial_names'] column with df2['Full_names'] column. For example, to match Martin Lues all rows in Full_names Having Martin in them should be fetched. In Resulting dataframe should be
df1 = pd.DataFrame()
df2 = pd.Dataframe()
df1['Partial_names'] = ['John Smith','Leo Lauana' ,'Adam Marry', 'Martin Lues']
df2['Full_names'] = ['John Smith Wade', 'Adam Blake Marry', 'Riley
Leo Lauana', 'Martin Smith', 'Martin Flex Leo']
Partial_names
1 John Smith
2 Leo Lauana
3 Adam Marry
4 Martin Lues
5 Martin Author
Full_names
1 Martin Smith
2 Riley Leo Lauana
3 Adam Blake Marry
4 Jeff Hard Jin
5 Martin Flex Leo
Partial_names Resulting_Column_with_full_names
1 John Smith John Smith Wade
2 Leo Lauana Riley Leo Lauana
3 Adam Marry Adam Blake Marry
4 Martin Lues Martin Smith
Martin Flex Leo
In actual, both dataframe has more rows

Related

Relationship based on time

I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4

How to slice pandas column with index list?

I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.

Can you make value_counts on a specific interval of characters with pandas?

So, I have a column "Names". If I do:
df['Names'].value_counts()
I get this:
Mr. Richard Vance 1
Mrs. Angela Bell 1
Mr. Stewart Randall 1
Mr. Andrew Ogden 1
Mrs. Maria Berry 1
..
Mrs. Lillian Wallace 1
Mr. William Bailey 1
Mr. Paul Ball 1
Miss Pippa Bond 1
Miss Caroline Gray 1
It's ok... Thera are lots of DISTINCT names. But what I want is to do this value_counts() only for the first characters until it get's to the empty character (i.e. space that devides, for instance Miss or Mrs. from Lillian Wallace) So that the output would be, for example:
Mrs. 1000
Mr. 2000
Miss 2000
Just to know how many distinct variants there are in the column names so that, in a 2nd stage create another variable (namely gender) based on those variants.
You can use value_counts(dropna=False) on str[0] after a str.split():
df = pd.DataFrame({'Names': ['Mr. Richard Vance','Mrs. Angela Bell','Mr. Stewart Randall','Mr. Andrew Ogden','Mrs. Maria Berry','Mrs. Lillian Wallace','Mr. William Bailey','Mr. Paul Ball','Miss Pippa Bond','Miss Caroline Gray','']})
df.Names.str.split().str[0].value_counts(dropna=False)
# Mr. 5
# Mrs. 3
# Miss 2
# NaN 1
# Name: Names, dtype: int64
If you want to know the unique values and if there's always a space you can do this.
df = pd.DataFrame(['Mr. Richard Vance',
'Mrs. Angela Bell',
'Mr. Stewart Randall',
'Mr. Andrew Ogden',
'Mrs. Maria Berry',
'Mrs. Lillian Wallace',
'Mr. William Bailey',
'Mr. Paul Ball',
'Miss Pippa Bond',
'Miss Caroline Gray'], columns=['names'])
df['names'].str.split(' ').str[0].unique().tolist()
Output is a list:
['Mr.', 'Mrs.', 'Miss']
Here is a solution. You can use regex:
#Dataset
Names
0 Mr. Richard Vance
1 Mrs. Angela Bell
2 Mr. Stewart Randall
3 Mr. Andrew Ogden
4 Mrs. Maria Berry
5 Mrs. Lillian Wallace
df['Names'].str.extract(r'(\w+\.\s)').value_counts()
#Output:
Mr. 3
Mrs. 3
Note : (\w+\.\s) will extract Mr. and Mrs. parts (or any title like Dr.) from the names

How to take row values from one pandas dataframe and use them as reference to get values from another dataframe

I have two dataframes. One contains contact information for constituents. The other was created to pair up constituents that might be part of the same household.
Sample:
data1 = {'Household_0':['1234567','2345678','3456789','4567890'],
'Individual_0':['1111111','2222222','3333333','4444444'],
'Individual_1':['5555555','6666666','7777777','']}
df1=pd.DataFrame(data1)
data2 = {'Constituent Id':['1234567','2345678','3456789','4567890',
'1111111','2222222','3333333','4444444',
'5555555','6666666','7777777'],
'Display Name':['Clark Kent and Lois Lane','Bruce Banner and Betty Ross',
'Tony Stark and Pepper Pots','Steve Rogers','Clark Kent','Bruce Banner',
'Tony Stark','Steve Rogers','Lois Lane','Betty Ross','Pepper Pots']}
df2=pd.DataFrame(data2)
Resulting in:
df1
Household_0 Individual_0 Individual_1
0 1234567 1111111 5555555
1 2345678 2222222 6666666
2 3456789 3333333 7777777
3 4567890 4444444
df2
Constituent Id Display Name
0 1234567 Clark Kent and Lois Lane
1 2345678 Bruce Banner and Betty Ross
2 3456789 Tony Stark and Pepper Pots
3 4567890 Steve Rogers
4 1111111 Clark Kent
5 2222222 Bruce Banner
6 3333333 Tony Stark
7 4444444 Steve Rogers
8 5555555 Lois Lane
9 6666666 Betty Ross
10 7777777 Pepper Pots
I would like to take df1, reference the Constituent Id out of df2, and create a new dataframe that has the names of the constituents instead of their IDs, so that we can ensure they are truly family/household members.
I believe I can do this by iterating, but that seems like the wrong approach. Is there a straightforward way to do this?
you can map each column from df1 with a series based on df2 once set_index Constituent Id and select the column Display Name. Use apply to repeat the operation on each column.
print (df1.apply(lambda x: x.map(df2.set_index('Constituent Id')['Display Name'])))
Household_0 Individual_0 Individual_1
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN
You can pipeline melt, merge and pivot_table.
df3 = (
df1
.reset_index()
.melt('index')
.merge(df2, left_on='value', right_on='Constituent Id')
.pivot_table(values='Display Name', index='index', columns='variable', aggfunc='last')
)
print(df3)
outputs
variable Household_0 Individual_0 Individual_1
index
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN
You can also try using .applymap() to link the two together.
reference = df2.set_index('Constituent Id')['Display Name'].to_dict()
df1[df1.columns] = df1[df1.columns].applymap(reference.get)

startswith() function help needed in Pandas Dataframe

I have a Name Column in Dataframe in which there are Multiple names.
DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
"Mr. Roderick Robert Crispin",
"Cunningham"," Mr. Alfred Fleming"]})`
OUTPUT
Name
0 Brailey, Mr. William Theodore Ronald
1 Roger Marie Bricoux
2 Mr. Roderick Robert Crispin
3 Cunningham
4 Mr. Alfred Fleming
I wrote a row classification function, like if I pass a row/name it should return output class
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux', 'John Frederick Preston Clarke']
def classify_role(row):
if row.loc['name'] in mus:
return 'musician'
Calling a function
is_brailey = df['name'].str.startswith('Brailey')
print(classify_role(df[is_brailey].iloc[0]))
Should show 'musician'
But output is showing different class I think I am writing something wrong here in classify_role()
Must be this row
if row.loc['name'] in mus:
Summary:
I am in need of a solution if I put first name of a person in startswith() who is in musi it should return musician
EDIT: If want test if values exist in lists you can create dictionary and test membership by Series.isin:
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
'John Frederick Preston Clarke']
cat1 = ['Mr. Alfred Fleming','Cunningham']
d = {'musician':mus, 'category':cat1}
for k, v in d.items():
df.loc[df['Name'].isin(v), 'type'] = k
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux musician
2 Mr. Roderick Robert Crispin NaN
3 Cunningham category
4 Mr. Alfred Fleming category
Your solution should be changed:
mus = ['Brailey, Mr. William Theodore Ronald', 'Roger Marie Bricoux',
'John Frederick Preston Clarke']
def classify_role(row):
if row in mus:
return 'musician'
df['type'] = df['Name'].apply(classify_role)
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux musician
2 Mr. Roderick Robert Crispin None
3 Cunningham None
4 Mr. Alfred Fleming None
You can pass values in tuple to Series.str.startswith, solution should be expand to match more categories by dictionary:
d = {'musician': ['Brailey, Mr. William Theodore Ronald'],
'cat1':['Roger Marie Bricoux', 'Cunningham']}
for k, v in d.items():
df.loc[df['Name'].str.startswith(tuple(v)), 'type'] = k
print (df)
Name type
0 Brailey, Mr. William Theodore Ronald musician
1 Roger Marie Bricoux cat1
2 Mr. Roderick Robert Crispin NaN
3 Cunningham cat1
4 Mr. Alfred Fleming NaN

Categories