Pandas str.extract() regex to extract city info - python

I have a pandas df of addresses like this:
df['address']
0. ALL that certain piece, parcel or tract of land situate, lying and being in the City
of Travelers Rest, County of Greenville, State of South Carolina
1. Townes Street on the West, in the City of Greenville, County of Greenville, State of
South Carolina
2. State of South Carolina, County of Greenville, City of Hampton on the southern side
I want to extract the name of city such that expected results:
Travelers Rest
Greenville
Hampton
My code is below:
df['city'] = df['address'].str.extract(r'\b(?:City of?) (.+?(?=[,]))')
My results:
Travelers Rest
Greenville
City of Hampton on the...
However, when the city name doesn't end with a , it will pick up the rest of the string. If i don't end my regex in , I won't get the full city name in some cases. How can I resolve this?

One option for the example data could be matching the following words starting with a capital A-Z and optional non whitespace chars excluding a comma:
\bCity\s+of\s+([A-Z][^\s,]+(?:\s+[A-Z][^\s,]+)*)
Regex demo
data = [
"ALL that certain piece, parcel or tract of land situate, lying and being in the City of Travelers Rest, County of Greenville, State of South Carolina",
"Townes Street on the West, in the City of Greenville, County of Greenville, State of South Carolina",
"State of South Carolina, County of Greenville, City of Hampton on the southern side"
]
df = pd.DataFrame(data, columns=["address"])
df["city"] = df["address"].str.extract(r"\bCity\s+of\s+([A-Z][^\s,]+(?:\s+[A-Z][^\s,]+)*)")
print(df)
Output
address city
0 ALL that certain piece, parcel or tract of lan... Travelers Rest
1 Townes Street on the West, in the City of Gree... Greenville
2 State of South Carolina, County of Greenville,... Hampton

Related

Columns must be same length as key Error python

I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University
file2.csv
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
USA
The University of Queensland (UQ)
USA
University of California
USA
file_output.csv
research_groups_names_f1
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences
Chinese Academy of Sciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
USA
UQ
The University of Queensland (UQ)
Australia
Harvard University
Not found
USA
University of California, Los Angeles
University of California
USA
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN

Creating a column identifier based on another column

I have a df below as
NAME
German Rural
1990 german
Mexican 1998
Mexican City
How can i create a new column based on the values of these columns ( if the column has the term %German% or % german% regardless of capital or lower case or case insensitive?
Desired output
NAME | Identifier
German Rural Euro
1990 german Euro
Mexican 1998 South American
Mexican City South American
You could do that with something like the following.
conditions = [df["NAME"].str.lower().str.contains("german"),
df["NAME"].str.lower().str.contains("mexican")]
values = [ "Euro", 'South American']
df["identifiter"] = np.select(conditions, values, default=np.nan)
print(df)
NAME identifiter
0 German Rural Euro
1 1990 german Euro
2 Mexican 1998 South American
3 Mexican City South American

How to group categorical series (like you can in Tableau

I don't even know how to ask this question, so forgive me if I'm not using appropriate terminology. I have a dataframe of every judicial case that is filed and disposed of. There is a series in this df called 'Court Name' which, as the name implies, lists the court where each case is filed or disposed in. Here they are:
df_combined['Court Name'].value_counts()
Out[27]:
JP 6-1 143768
JP 6-2 111792
JP 3 98831
JP 7 92768
JP 4 74083
383rd District Court 61505
JP 2 60038
JP 5 51013
JP 1 35475
Jury Duty Court 34033
388th District Court 25713
County Court at Law 7 17788
County Court at Law 1 17389
County Criminal Court 4 16877
County Court at Law 4 16823
County Court at Law 2 16812
County Criminal Court 1 16736
County Criminal Court 3 16180
County Criminal Court 2 16025
County Court at Law 5 13243
65th District Court 12635
327th District Court 11957
409th District Court 11707
County Court at Law 6 10818
120th District Court 10633
41st District Court 10308
243rd District Court 9944
Mental Health Court 1 9415
168th District Court 9252
210th District Court 9122
171st District Court 9079
384th District Court 8637
346th District Court 8470
Criminal District Court 1 8274
34th District Court 8228
205th District Court 6141
County Court at Law 3 5283
Mental Health Court 2 4575
448th District Court 3466
Magistration 1835
Probate Court 2 1597
Probate Court 1 1590
384th Competency Court 568
346th Veterans Treatment Court 153
District Clerk 92
County Clerk 43
County Courts at Law 15
Family Court Services 12
Probate Courts 7
Domestic Relations Office 3
County Criminal Courts 2
Deceptive Trade 1
Name: Court Name, dtype: int64
I'm converting from Tableau to Python/Pandas/Numpy/Plotly/Dash, and in Tableau, you can create groups based on a series. What I need to do is to categorize all of the above outputs into
District Courts
County Courts
JP Courts, and
None of the above courts / courts I'm going to filter out.
The end desired result is a new 'Category' series, so let's say case number 1 is filed in the 388th District Court, it's category should be District, and if case 2 is filed in County Court at Law 1, it's category should be County, and so on.
I have already created lists where each of the above 'Court Name' values falls into its proper category, but I don't know what to with those lists, or even if creating these lists is appropriate. I'd like to not develop poor coding habits, so I'm relying on your collective expertise on the most efficient/elegant way to accomplish my end goal.
Thank you all so much in advance!
Jacob

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

Apply fuzzy matching across a dataframe column and save results in a new column

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Categories