fetching substring with a condition from another df - python

I have 2 data sets, 1 with only address like this
import pandas as pd
import numpy as np
df = pd.DataFrame({"Address": ["36 omar st, pal, galambo","33 pom kopd malan", "15 kop st,dogg, ghog", "23 malo st, pal, kola"]})
Address
0 36 omar st, pal, galambo
1 33 pom kopd malan
2 15 kop st,dogg, ghog
3 23 malo st, pal, kola
and the other is a dataset with every state and the cities inside of it
df2 = pd.DataFrame({"State": ["galambo", "ghog", "ghog", "kola", "malan", "malan"], "City": ["pal", "dogg", "kopd", "kop", "pal", "kold"]})
State City
0 galambo pal
1 ghog dogg
2 ghog kopd
3 kola kop
4 malan pal
5 malan kold
I'm trying to fetch state name and city name out of each address, so I tried this
df["State"] = df['Address'].apply(lambda x: next((a for a in df2["State"].to_list() if a in x), np.nan))
df["City"] = df['Address'].apply(lambda x: next((a for a in df2["City"].to_list() if a in x), np.nan))
Address State City
0 36 omar st, pal, galambo galambo pal
1 33 pom kopd malan malan kopd
2 15 kop st,dogg, ghog ghog dogg
3 23 malo st, pal, kola kola pal
but as you see, the rows 1,3 are incorrect because according to df2 the State malan has no City called kopd, and State kola has no City called pal
so how can I make the output shows only the cities that are in the States as suggested in df2?
Update:
Expected output
Address State City
0 36 omar st, pal, galambo galambo pal
1 33 pom kopd malan malan NaN
2 15 kop st,dogg, ghog ghog dogg
3 23 malo st, pal, kola kola NaN

You can extract the last matching state/city name, then perform a merge to replace the invalid cities by NaN:
# craft regexes
regex_state = f"({'|'.join(df2['State'].unique())})"
regex_city = f"({'|'.join(df2['City'].unique())})"
# extract state/city (last match)
df['State'] = df['Address'].str.findall(regex_state).str[-1]
df['City'] = df['Address'].str.findall(regex_city).str[-1]
# fix city
df['City'] = df.merge(df2.assign(c=df2['City']), on=['City', 'State'], how='left')['c']
Output:
Address State City
0 36 omar st, pal, galambo galambo pal
1 33 pom kopd malan malan NaN
2 15 kop st,dogg, ghog ghog dogg
3 23 malo st, pal, kola kola NaN

Related

Change Typo Column Values with Right Word based on Columns in Other Dataframe

I have two dataframe, the first one is location ,
location = pd.DataFrame({'city': ['RIYADH','SEOUL','BUSAN','TOKYO','OSAKA'],
'country': ['Saudi Arabia','South Korea','South Korea','Japan','Japan']})
the other one is customer,
customer = pd.DataFrame({'id': [1001,2002,3003,4004,5005,6006,7007,8008,9009],
'city': ['tokio','Sorth KOREA','riadh','JAPANN','tokyo','osako','Arab Saudi','SEOUL','buSN']})
I want to change the typo word in location column in customer dataframe with the right one in city/country from location dataframe. So the output will be like this:
id location
1001 TOKYO
2002 South Korea
3003 RIYADH
4004 Japan
5005 TOKYO
6006 OSAKA
7007 Saudi Arabia
8008 SEOUL
9009 BUSAN
A possible solution, based on RapidFuzz:
from rapidfuzz import process
out = (customer.assign(
aux = customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
out['city'] = out.apply(lambda x:
process.extractOne(x['city'], x.loc['aux1':'aux2'])[0], axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])
Output:
id city
0 1001 TOKYO
1 2002 South Korea
2 3003 RIYADH
3 4004 Japan
4 5005 TOKYO
5 6006 OSAKA
6 7007 Saudi Arabia
7 8008 SEOUL
8 9009 BUSAN
EDIT
This tries to offer a solution for the OP's below comment:
from rapidfuzz import process
def get_match(x, y, score):
match = process.extractOne(x, y)
return np.nan if match[1] < score else match[0]
out = (customer.assign(
aux=customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
out['city'] = out.apply(lambda x: get_match(
x['city'], x.loc['aux1':'aux2'], 92), axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])
Output:
id city
0 1001 NaN
1 2002 NaN
2 3003 NaN
3 4004 NaN
4 5005 TOKYO
5 6006 NaN
6 7007 NaN
7 8008 SEOUL
8 9009 NaN

How to only extract the full words of a string in Python?

I want to extract only the full words of a string.
I have this df:
Students Age
0 Boston Terry Emma 23
1 Tommy Julien Cambridge 20
2 London 21
3 New York Liu 30
4 Anna-Madrid+ Pauline 26
5 Mozart Cambridge 27
6 Gigi Tokyo Lily 18
7 Paris Diane Marie Dive 22
And I want to extract the FULL words from the string, NOT parts of it (ex: I want Liu if Liu is written in names, not iu if just iu if written, because Liu is not iu.)
cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']
Desired df:
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN
I tried this code:
pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)
My code for cities works, I just need to repair the issue for the 'Liked Names'.
How to make this work? Thanks a lot!!!
I think what you are looking for are word boundaries. In a regular expression they can be expressed with a \b. An ugly (albeit working) solution is to modify the liked_names list to include word boundaries and then run the code:
l = [
["Boston Terry Emma", 23],
["Tommy Julien Cambridge", 20],
["London", 21],
["New York Liu", 30],
["Anna-Madrid+ Pauline", 26],
["Mozart Cambridge", 27],
["Gigi Tokyo Lily", 18],
["Paris Diane Marie Dive", 22],
]
cities = [
"Boston",
"Cambridge",
"Bruxelles",
"New York",
"London",
"Amsterdam",
"Madrid",
"Tokyo",
"Paris",
]
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"\b" + n + r"\b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])
pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)
print(df)
A nicer solution would be to include the word boundaries in the creation of the regular expression.
I first tried using \s, i.e. whitespace, but that did not work at the end of the list, so \b was the solution. You can check https://regular-expressions.mobi/wordboundaries.html?wlr=1 for some details.
You can try this regex:
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
pat = (
"(" + "|".join(r"[a-zA-Z]*{}[a-zA-Z]*".format(n) for n in liked_names) + ")"
)
df["Liked Names"] = df["Students"].str.extract(pat)
print(df)
Prints:
Students Age Liked Names
0 Boston Terry Emma 23 Emma
1 Tommy Julien Cambridge 20 Tommy Julien
2 London 21 NaN
3 New York Liu 30 Liu
4 Anna-Madrid+ Pauline 26 Pauline
5 Mozart Cambridge 27 NaN
6 Gigi Tokyo Lily 18 NaN
7 Paris Diane Marie Dive 22 NaN
You can do an additional check to see if matched name is in Students column.
import numpy as np
def check(row):
if row['Liked Names'] == row['Liked Names']:
# If `Liked Names` is not nan
# Get all possible names
patterns = row['Students'].split(' ')
# If matched `Liked Names` in `Students`
isAllMatched = all([name in patterns for name in row['Liked Names'].split(' ')])
if not isAllMatched:
return np.nan
else:
return row['Liked Names']
else:
# If `Liked Names` is nan, still return nan
return np.nan
df['Liked Names'] = df.apply(check, axis=1)
# print(df)
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN

Function to move specific row to top or bottom of pandas dataframe

I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc

Apply operation to concatenate certain rows in dataframe retuning None

I have some addresses that I would like to clean.
You can see that in column address1, we have some entries that are just numbers, where they should be numbers and street names like the first three rows.
df = pd.DataFrame({'address1':['15 Main Street','10 High Street','5 Other Street',np.nan,'15','12'],
'address2':['New York','LA','London','Tokyo','Grove Street','Garden Street']})
print(df)
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street
5 12 Garden Street
I'm trying to create a function that will check if address1 is a number, and if so, concat address1 and street name from address2, then delete address2.
My expected output is this. We can see index 4 and 5 now have complete address1 entries:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN <---
5 12 Garden Street NaN <---
What I have tried with the .apply() function:
def f(x):
try:
#if address1 is int
if isinstance(int(x['address1']), int):
# create new address using address1 + address 2
newaddress = str(x['address1']) +' '+ str(x['address2'])
# delete address2
x['address2'] = np.nan
# return newaddress to address1 column
return newadress
except:
pass
Applying the function:
df['address1'] = df.apply(f,axis=1)
However, the column address1 is now all None.
I've tried a few variations of this function but can't get it to work. Would appreciate advice.
You may avoid apply by using str.isdigit to pick exact rows need to modify. Create a mask m to identify these rows. Use agg on these rows and construct a sub-dataframe for these rows. Finally append back to original df
m = df.address1.astype(str).str.isdigit()
df1 = df[m].agg(' '.join, axis=1).to_frame('address1').assign(address2=np.nan)
Out[179]:
address1 address2
4 15 Grove Street NaN
5 12 Garden Street NaN
Finally, append it back to df
df[~m].append(df1)
Out[200]:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
If you still insist to use apply, you need modify f to return outside of if to return non-modify rows together with modified rows
def f(x):
y = x.copy()
try:
#if address1 is int
if isinstance(int(x['address1']), int):
# create new address using address1 + address 2
y['address1'] = str(x['address1']) +' '+ str(x['address2'])
# delete address2
y['address2'] = np.nan
except:
pass
return y
df.apply(f, axis=1)
Out[213]:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
Note: it is reccommended that apply should not modify the passed object, so I do y = x.copy() and modify and return y
You can create a mask and update:
mask = pd.to_numeric(df.address1, errors='coerce').notna()
df.loc[mask, 'address1'] = df.loc[mask, 'address1'] + ' ' +df.loc[mask,'address2']
df.loc[mask, 'address2'] = np.nan
output:
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN
Try this
apply try except and convert address1 in int
def test(row):
try:
address = int(row['address1'])
return 1
except:
return 0
df['address1'] = np.where(df['test']==1,df['address1']+ ' '+df['address2'],df['address1'])
df['address2'] = np.where(df['test']==1,np.nan,df['address2'])
df.drop(['test'],axis=1,inplace=True)
address1 address2
0 15 Main Street New York
1 10 High Street LA
2 5 Other Street London
3 NaN Tokyo
4 15 Grove Street NaN
5 12 Garden Street NaN

Trying to matching a column of names in one df where they could be an exact or partial match of another df 'scolumn?

Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
Break down of data frame inputs:
df1 is a time-series style data frame.
df2 is a regular data frame.
3.df1 and df2 do not have the same length.
df1 Names contain initials,titles, and even weird character encodings.
df2 Names are just a combination of First Name, Space and Last Name.
My attempts have centered around taking into account 1. Names, Districts and State.
My approaches have tried to take into account that names in df1 have initials or second names, titles, etc whereas df2 is simply first and last names. I tried to use str.contains('A-za-z') to account for this difference.
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
# Attempt code
df3 = df1.merge(df2, left_on = (df1.State, df1.District,df1.CandidateName.str.contains('[A-Za-z]')), right_on=(df2.State, df2.District,df2.Name.str.contains('[A-Za-z]')))
I included merging on District and State in order to reduce redundancies and inaccuracies. When I removed district and state from left_on and right_on, not did the output df3 increase in size with a lot of wrong matches.
Examples include CandidateName and Name being two different people:
Theodorick A. Bland sharing the same row as Jasson Lewis Sr.
Some of the row results with the Attempt Code above are as follows:
Header
key_0 key_1 key_2 CandidateName District_x Party_x State_x District_y Name Party_y State_y
Row 6, index 4
MN 2 True Jason Lewis 2 Democrat MN 2 Jasson Lewis Sr. Republican MN
Row 11, index 3
3 VA 10 True Barbara Comstock 10 VA 10 Barbara Comstock Democrat VA
We can use difflib for this to create an artificial key column to merge on. We call this column name, like the one in df2:
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Theodorick Bland VA 9 Theodorick Bland
2 Aedanus Rutherford Burke SC 2 Aedanus Burke
3 Aedanus Burke SC 2 Aedanus Burke
4 Jason Lewis MN 2 Jason Lewis
5 Jason Initial Lewis MN 2 Democrat Jason Lewis
6 Barbara Comstock VA 10 Democrat Barbara Comstock
Explanation of difflib.get_close_matches. It looks for similar strings in df2. This is how the new Name column in df1 looks like:
print(df1)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis

Categories