Capturing the last group: everything when the first character appears - python

I am trying to capture everything after and including the first non-digit character in the following text:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
For example, I would want regex to capture groups in a way that it matches: 1, 1,486,399.87, 5, and ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324.
The code I have right now is:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
# text
t = " 1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324"
tt = re.search(r"(\d+)\s+(\$?[+-]?\d{1,3}(\,\d{3})*%?(\.\d+)?)\s+(\d+)\s+(\S*)", t)
ttgroup = len(tt.groups())
print(tt[ttgroup])
It returns only ORTIZ. I suppose we need to improve the (S*) grouping towards the end of the pattern. Is there a way we could capture the entire ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324 in the last group? Thank you so much!

I'd replace the last group, that is now (\S*), with (\S.*) since you want to capture the rest of the string. Also add the re.DOTALL flag since this is a multiline string:
tt = re.search(r"(\d+)\s+(\$?[+-]?\d{1,3}(\,\d{3})*%?(\.\d+)?)\s+(\d+)\s+(\S.*)", t, re.DOTALL)

Related

Column value not in index in padas dataframe

I am calculating a new column in dataframe using a regular expression with named capturing groups as follows:
(df["Address Column"]
.str.extract("(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)")
.apply(lambda x: x.str.title()))
However, I am getting a KeyError when calling new column "Suburb"
KeyError: "['Suburb'] not in index"
Sample data:
**Address column**
4a Mcarthurs Road, Altona north
1 Neal court, Altona North
4 Vermilion Drive, Greenvale
Lot 307 Bonds Lane, Greenvale
430 Blackshaws rd, Altona North
159 Bonds lane, Greenvale
Desired output:
Address Suburb
4a Mcarthurs Road Altona North
1 Neal court Altona North
4 Vermilion Drive Greenvale
Lot 307 Bonds Lane Greenvale
430 Blackshaws rd Altona North
159 Bonds lane Greenvale
Not sure why I am getting this!
Any help on this will be highly appreciated.
Thank you in advance for the support!
I think your problem is that you don't assign the result of your regexp query to the original df.
The following works for me:
r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
ret = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df = pd.concat([df, ret], axis=1)
df["Suburb"]
For completeness, this is how I initialized df.
import pandas as pd
s = pd.Series(["4a Mcarthurs Road, Altona north",
"1 Neal court, Altona North",
"4 Vermilion Drive, Greenvale",
"Lot 307 Bonds Lane, Greenvale",
"430 Blackshaws rd, Altona North",
"159 Bonds lane, Greenvale"])
df = pd.DataFrame({"Address Column": s})
The above code adds the new columns Address and Suburb to df:
Address Column Address Suburb
4a Mcarthurs Road, Altona north 4A Mcarthurs Road Altona North
1 Neal court, Altona North 1 Neal Court Altona North
4 Vermilion Drive, Greenvale 4 Vermilion Drive Greenvale
Lot 307 Bonds Lane, Greenvale Lot 307 Bonds Lane Greenvale
430 Blackshaws rd, Altona North 430 Blackshaws Rd Altona North
159 Bonds lane, Greenvale 159 Bonds Lane Greenvale

Need help in matching strings from phrases from multiple columns of a dataframe in python

Need help in matching phrases in the data given below where I need to match phrases from both TextA and TextB.
The following code did not helped me in doing it how can I address this I had 100s of them to match
#sorting jumbled phrases
def sorts(string_value):
sorted_string = sorted(string_value.split())
sorted_string = ' '.join(sorted_string)
return sorted_string
#Removing punctuations in string
punc = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
def punt(test_str):
for ele in test_str:
if ele in punc:
test_str = test_str.replace(ele, "")
return(test_str)
#matching strings
def lets_match(x):
for text1 in TextA:
for text2 in TextB:
try:
if sorts(punt(x[text1.casefold()])) == sorts(punt(x[text2.casefold()])):
return True
except:
continue
return False
df['result'] = df.apply(lets_match,axis =1)
even after implementing string sort, removing punctuations and case sensitivity I am still getting those strings as not matching. I am I missing something here can some help me in achieving it
Actually you can use difflib to match two text, here's what you can try:
from difflib import SequenceMatcher
def similar(a, b):
a=str(a).lower()
b=str(b).lower()
return SequenceMatcher(None, a, b).ratio()
def lets_match(d):
print(d[0]," --- ",d[1])
result=similar(d[0],d[1])
print(result)
if result>0.6:
return True
else:
return False
df["result"]=df.apply(lets_match,axis =1)
You can play with if result>0.6 value.
For more information about difflib you can visit here. There are other sequence matchers also like textdistance but I found it easy so I tried this.
Is there any issues with using the fuzzy match lib? The implementation is pretty straight forward and works well given the above data is relatively similar. I've performed the below without preprocessing.
import pandas as pd
""" Install the libs below via terminal:
$pip install fuzzywuzzy
$pip install python-Levenshtein
"""
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#creating the data frames
text_a = ['AKIL KUMAR SINGH','OUSMANI DJIBO','PETER HRYB','CNOC LIMITED','POLY NOVA INDUSTRIES LTD','SAM GAWED JR','ADAN GENERAL LLC','CHINA MOBLE LIMITED','CASTAR CO., LTD.','MURAN','OLD SAROOP FOR CAR SEAT COVERS','CNP HEALTHCARE, LLC','GLORY PACK LTD','AUNCO VENTURES','INTERNATIONAL COMPANY','SAMEERA HEAT AND ENERGY FUND']
text_b = ['Singh, Akil Kumar','DJIBO, Ousmani Illiassou','HRYB, Peter','CNOOC LIMITED','POLYNOVA INDUSTRIES LTD.','GAWED, SAM','ADAN GENERAL TRADING FZE','CHINA MOBILE LIMITED','CASTAR GROUP CO., LTD.','MURMAN','Old Saroop for Car Seat Covers','CNP HEATHCARE, LLC','GLORY PACK LTD.','AUNCO VENTURE','INTL COMPANY','SAMEERA HEAT AND ENERGY PROPERTY FUND']
df_text_a = pd.DataFrame(text_a, columns=['text_a'])
df_text_b = pd.DataFrame(text_b, columns=['text_b'])
def lets_match(txt: str, chklist: list) -> str:
return process.extractOne(txt, chklist, scorer=fuzz.token_set_ratio)
#match Text_A against Text_B
result_txt_ab = df_text_a.apply(lambda x: lets_match(str(x), text_b), axis=1, result_type='expand')
result_txt_ab.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_a[result_txt_ab.columns]=result_txt_ab
df_text_a
text_a Return Match Match Value
0 AKIL KUMAR SINGH Singh, Akil Kumar 100
1 OUSMANI DJIBO DJIBO, Ousmani Illiassou 72
2 PETER HRYB HRYB, Peter 100
3 CNOC LIMITED CNOOC LIMITED 70
4 POLY NOVA INDUSTRIES LTD POLYNOVA INDUSTRIES LTD. 76
5 SAM GAWED JR GAWED, SAM 100
6 ADAN GENERAL LLC ADAN GENERAL TRADING FZE 67
7 CHINA MOBLE LIMITED CHINA MOBILE LIMITED 79
8 CASTAR CO., LTD. CASTAR GROUP CO., LTD. 81
9 MURAN SAMEERA HEAT AND ENERGY PROPERTY FUND 41
10 OLD SAROOP FOR CAR SEAT COVERS Old Saroop for Car Seat Covers 100
11 CNP HEALTHCARE, LLC CNP HEATHCARE, LLC 58
12 GLORY PACK LTD GLORY PACK LTD. 100
13 AUNCO VENTURES AUNCO VENTURE 56
14 INTERNATIONAL COMPANY INTL COMPANY 74
15 SAMEERA HEAT AND ENERGY FUND SAMEERA HEAT AND ENERGY PROPERTY FUND 86
#match Text_B against Text_A
result_txt_ba= df_text_b.apply(lambda x: lets_match(str(x), text_a), axis=1, result_type='expand')
result_txt_ba.rename(columns={0:'Return Match', 1:'Match Value'}, inplace=True)
df_text_b[result_txt_ba.columns]=result_txt_ba
df_text_b
text_b Return Match Match Value
0 Singh, Akil Kumar AKIL KUMAR SINGH 100
1 DJIBO, Ousmani Illiassou OUSMANI DJIBO 100
2 HRYB, Peter PETER HRYB 100
3 CNOOC LIMITED CNOC LIMITED 74
4 POLYNOVA INDUSTRIES LTD. POLY NOVA INDUSTRIES LTD 74
5 GAWED, SAM SAM GAWED JR 86
6 ADAN GENERAL TRADING FZE ADAN GENERAL LLC 86
7 CHINA MOBILE LIMITED CHINA MOBLE LIMITED 81
8 CASTAR GROUP CO., LTD. CASTAR CO., LTD. 100
9 MURMAN ADAN GENERAL LLC 33
10 Old Saroop for Car Seat Covers OLD SAROOP FOR CAR SEAT COVERS 100
11 CNP HEATHCARE, LLC CNP HEALTHCARE, LLC 56
12 GLORY PACK LTD. GLORY PACK LTD 100
13 AUNCO VENTURE AUNCO VENTURES 53
14 INTL COMPANY INTERNATIONAL COMPANY 50
15 SAMEERA HEAT AND ENERGY PROPERTY FUND SAMEERA HEAT AND ENERGY FUND 100
I think you can't do it without a strings distance notion, what you can do is use, for example record linkage.
I will not get into details, but i'll show you an example of usage on this case.
import pandas as pd
import recordlinkage as rl
from recordlinkage.preprocessing import clean
# creating first dataframe
df_text_a = pd.DataFrame({
"Text A":[
"AKIL KUMAR SINGH",
"OUSMANI DJIBO",
"PETER HRYB",
"CNOC LIMITED",
"POLY NOVA INDUSTRIES LTD",
"SAM GAWED JR",
"ADAN GENERAL LLC",
"CHINA MOBLE LIMITED",
"CASTAR CO., LTD.",
"MURAN",
"OLD SAROOP FOR CAR SEAT COVERS",
"CNP HEALTHCARE, LLC",
"GLORY PACK LTD",
"AUNCO VENTURES",
"INTERNATIONAL COMPANY",
"SAMEERA HEAT AND ENERGY FUND"]
}
)
# creating second dataframe
df_text_b = pd.DataFrame({
"Text B":[
"Singh, Akil Kumar",
"DJIBO, Ousmani Illiassou",
"HRYB, Peter",
"CNOOC LIMITED",
"POLYNOVA INDUSTRIES LTD. ",
"GAWED, SAM",
"ADAN GENERAL TRADING FZE",
"CHINA MOBILE LIMITED",
"CASTAR GROUP CO., LTD.",
"MURMAN ",
"Old Saroop for Car Seat Covers",
"CNP HEATHCARE, LLC",
"GLORY PACK LTD.",
"AUNCO VENTURE",
"INTL COMPANY",
"SAMEERA HEAT AND ENERGY PROPERTY FUND"
]
}
)
# preprocessing in very important on results, you have to find which fit well on yuor problem.
cleaned_a = pd.DataFrame(clean(df_text_a["Text A"], lowercase=True))
cleaned_b = pd.DataFrame(clean(df_text_b["Text B"], lowercase=True))
# creating an indexing which will be used for comprison, you have various type of indexing, watch documentation.
indexer = rl.Index()
indexer.full()
# generating all passible pairs
pairs = indexer.index(cleaned_a, cleaned_b)
# starting evaluation phase
compare = rl.Compare(n_jobs=-1)
compare.string("Text A", "Text B", method='jarowinkler', label = 'text')
matches = compare.compute(pairs, cleaned_a, cleaned_b)
matches is now a MultiIndex DataFrame, what you want to do next is to find all max on the second index by first index. So you will have the results you need.
Results can be improved working on distance, indexing and/or preprocessing.

TypeError when trying to get a dictionary from a website

I am trying to get tv guide info with the following code. however I get TypeError: string indices must be integers.
Any help would be very useful.
import requests
url="https://www.digiturk.com.tr/_Services/TVguide/jProxy.aspx?cid=271&sd=13_4_2020_0_0"
html_content = requests.get(url).text
remove_copy="/*Copyright © 2009 Digital Platform İletişim Hizmetleri A.Ş. Tüm Hakları Saklıdır. Bu servisin izinsiz kullanımından doğacak tüm yasal yükümlülükleri izinsiz kullanan kişiler kabul etmiş sayılır.*/"
page_content=html_content.split(remove_copy)[-1]
null="null"
for ch in f["BChannels"]:
for pr in ch["CPrograms"]:
print(pr["PName"], pr["POName"], pr["BID"], pr["PDuration"])
page_content=html_content.split(remove_copy)[-1]
page_content is a string. You have to parse it in order to use it as a dict:
import json
...
page_content = json.loads(html_content.split(remove_copy)[-1])
...
for ch in page_content["BChannels"]:
for pr in ch["CPrograms"]:
print(pr["PName"], pr["POName"], pr["BID"], pr["PDuration"])
DERİN SULAR SUBMERGENCE 1048340359 6285
AZ SONRA... 1048340614 622
YEŞİL REHBER GREEN BOOK 1048340360 7458
AZ SONRA... 1048446330 934
AQUAMAN 1048446078 8245
AZ SONRA... 1048446329 1027
EVCİL HAYVANLARIN GİZLİ...2 THE SECRET LIFE OF PETS 2, THE ( 1048446287 4947
AZ SONRA... 1048446328 1056
KINGS Kings 1048446079 4887
AZ SONRA... 1048446327 1149
PARAZİT PARASITE 1048446285 7486
AZ SONRA... 1048446326 1482
HIRSIZLAR KRALI KING OF THIEVES 1048446080 6197
AZ SONRA... 1048446325 1352
VOX LUX 1048446081 6546
AZ SONRA... 1048446331 923
10x10 1048446082 4813
AZ SONRA... 1048446324 594
TULLY 1048446083 5485
AZ SONRA... 1048446323 3526
PARAZİT PARASITE 1048446295 7486
AZ SONRA... 1048446332 313
HIRSIZLAR KRALI KING OF THIEVES 1048446084 6650
Try:
import requests
url="https://www.digiturk.com.tr/_Services/TVguide/jProxy.aspx?cid=271&sd=13_4_2020_0_0"
html_content = requests.get(url).text
remove_copy="/*Copyright © 2009 Digital Platform İletişim Hizmetleri A.Ş. Tüm Hakları Saklıdır. Bu servisin izinsiz kullanımından doğacak tüm yasal yükümlülükleri izinsiz kullanan kişiler kabul etmiş sayılır.*/"
page_content=html_content.split(remove_copy)[-1]
null="null"
f = eval(page_content)
for ch in f["BChannels"]:
for pr in ch["CPrograms"]:
print(pr["PName"], pr["POName"], pr["BID"], pr["PDuration"])
The eval() converts the string to dictionary so that it can be traversed through.
Instead of eval() the json library can also be used. (need to import json using import json )
Change:
f = eval(page_content)
To:
f = json.loads(page_content)
Output:
DERİN SULAR SUBMERGENCE 1048340359 6285
AZ SONRA... 1048340614 622
YEŞİL REHBER GREEN BOOK 1048340360 7458
AZ SONRA... 1048446330 934
AQUAMAN 1048446078 8245
AZ SONRA... 1048446329 1027
EVCİL HAYVANLARIN GİZLİ...2 THE SECRET LIFE OF PETS 2, THE ( 1048446287 4947
AZ SONRA... 1048446328 1056
KINGS Kings 1048446079 4887
AZ SONRA... 1048446327 1149
PARAZİT PARASITE 1048446285 7486
AZ SONRA... 1048446326 1482
HIRSIZLAR KRALI KING OF THIEVES 1048446080 6197
AZ SONRA... 1048446325 1352
VOX LUX 1048446081 6546
AZ SONRA... 1048446331 923
10x10 1048446082 4813
AZ SONRA... 1048446324 594
TULLY 1048446083 5485
AZ SONRA... 1048446323 3526
PARAZİT PARASITE 1048446295 7486
AZ SONRA... 1048446332 313
HIRSIZLAR KRALI KING OF THIEVES 1048446084 6650

How to apply regex to get the exact house number with approximate residual address match

import re
list =[]
for element in address1:
z = re.match("^\d+", element)
if z:
list.append(z.string)
get_best_fuzzy("SATYAGRAH;OPP. RAJ SUYA BUNGLOW", list)
I have tried the above code, it is giving me the approximate address match for the addresses in my text file. How can I get the exact house number match with approximate rest address match. My addresses are in format:
1004; Jay Shiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India
101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India
101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India
you can try this.
Code :
import re
address = ["1004; Jayshiva Tower; Near Azad Society; Ambawadi Ahmedabad Gujarat 380015 India",
"101 GAMBS TOWER; FOUR BUNGLOWS;OPPOSITE GOOD SHEPHERD CHURCH ANDHERI WEST MUMBAI Maharashtra 400053 India",
"101/32-B; SHREE GANESH COMPLEX VEER SAVARKAR BLOCK; SHAKARPUR; EASE DEL HI DELHI Delhi 110092 India"]
for i in address:
z = re.match("^([^ ;]+)", i)
print(z.group())
Output :
1004
101
101/32-B

Python pandas regular expression replace part of the matching pattern

I've got a bunch of addresses like so:
df['street'] =
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
What I want to do is to remove those numbers at the end of the street part of the addresses to get:
df['street'] =
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
I have my regular expression like this:
df['street'] = df.street.str.replace(r"""\s(?:dr|ave|rd)[^a-zA-Z]\D*\d+$""", '', case=False)
which gives me this:
df['street'] =
5311 Whitsett
355 Sawyer St
607 Hampshire
342 Old Hwy 1
267 W Juniper
It dropped the words 'Ave', 'Rd' and 'Dr' from my original street addresses. Is there a way to keep part of the regular expression pattern (in my case this is 'Ave', 'Rd', 'Dr' and replace the rest?
EDIT:
Notice the address 342 Old Hwy 1. I do not want to also take out the number in such cases. That's why I specified the patterns ('Ave', 'Rd', 'Dr', etc) to have a better control of who gets changed.
df_street = '''
5311 Whitsett Ave 34
355 Sawyer St
607 Hampshire Rd #358
342 Old Hwy 1
267 W Juniper Dr 402
'''
# digits on the end are preceded by one of ( Ave, Rd, Dr), space,
# may be preceded by a #, and followed by a possible space, and by the newline
df_street = re.sub(r'(Ave|Rd|Dr)\s+#?\d+\s*\n',r'\1\n', df_street,re.MULTILINE|re.IGNORECASE)
print(df_street)
5311 Whitsett Ave
355 Sawyer St
607 Hampshire Rd
342 Old Hwy 1
267 W Juniper Dr
You should use the following regex:
>>> import re
>>> example_str = "607 Hampshire Rd #358"
>>> re.sub(r"\s*\#?[^\D]+\s*$", r"", example_str)
'607 Hampshire Rd'

Categories