Python regex solution for extracting German address format - python

I'm trying hard to write a Python regex code for extracting German address as show below.
Abc Gmbh Ensisheimer Straße 6-8 79346 Endingen
Def Gmbh Keltenstr . 16 77971 Kippenheim Deutschland
Ghi Deutschland Gmbh 53169 Bonn
Jkl Gmbh Ensisheimer Str . 6 -8 79346 Endingen
I wrote the below code for extracting individual address components and also put them together as a single regex but still unable to detect the above addresses. Can anyone please help me with it?
# TEST COMPANY NAME
string = 'Telekom Deutschland Gmbh 53169 Bonn Datum'
result = re.findall(r'([a-zA-Zäöüß]+\s*?[A-Za-zäöüß]+\s*?[A-Za-zäöüß]?)',string,re.MULTILINE)
print(result)
# TEST STREET NAME
result = re.findall(r'([a-zA-Zäöüß]+\s*\.)',string)
print(result)
# TEST STREET NUMBER
result = re.findall(r'(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)',string)
print(result)
# TEST POSTAL CODE
result = re.findall(r'(\d{5})',string)
print(result)
# TEST CITY NAME
result = re.findall(r'([A-Za-z]+)?',string)
print(result)
# TEST COMBINED ADDRESS COMPONENTS GROUP
result = re.findall(r'([a-zA-Zäöüß]+\s+?[A-Za-zäöüß]+\s+?[A-Za-zäöüß]+\s+([a-zA-Zäöüß]+\s*\.)+?\s+(\d{1,3}\s*[a-zA-Z]?[+|-]?\s*[\d{1,3}]?)+\s+(\d{5})+\s+([A-Za-z]+))',string)
print(result)
Please note that my objective is that if any of these addresses are present in a huge paragraph of text then the regex should extract and print only the addresses. Can someone please help me?

I would opt against a regex solution and use libpostal instead, it has bindings for a couple of other languages (in your case for python, use postal). You will have to install libpostal separately, since it includes 1.8GB of training data.
The good thing is, you can give it address parts in any order, it will pick the right parts most of the time.
It uses machine learning, trained on OpenStreetMap data in many languages.
For the examples given, it would not necessarily require to cut the company name and country from the string:
from postal.parser import parse_address
parse_address('Telekom Deutschland Gmbh 53169 Bonn Datum')
[('telekom deutschland gmbh', 'house'),
('53169', 'postcode'),
('bonn', 'city'),
('datum', 'house')]
parse_address('Keltenstr . 16 77971 Kippenheim')
[('keltenstr', 'road'),
('16', 'house_number'),
('77971', 'postcode'),
('kippenheim', 'city')]

Related

Replace a word in an address string with dictionary value using for-loop

I have an address 2300 S SUPER TEMPLE PL which I expect to get 2300 S SUPER TEMPLE PLACE as a result after spelling out the PL to PLACE. I have a dictionary of abbreviated street names:
st_abbr = {'DR': 'DRIVE',
'RD': 'ROAD',
'BLVD':'BOULEVARD',
'ST':'STREET',
'STE':'SUITE',
'APTS':'APARTMENTS',
'APT':'APARTMENT',
'CT':'COURT',
'LN' : 'LANE',
'AVE':'AVENUE',
'CIR':'CIRCLE',
'PKWY': 'PARKWAY',
'HWY': 'HIGHWAY',
'SQ':'SQUARE',
'BR':'BRIDGE',
'LK':'LAKE',
'MT':'MOUNT',
'MTN':'MOUNTAIN',
'PL':'PLACE',
'RTE':'ROUTE',
'TR':'TRAIL'}
with a for-loop, I would like to replace the key in address be spelled out. What I thought I should do is loop through each word in the address, thus I have the address.split(), and if the split match one of the keys in the dictionary, to replace that with a spelled out word.
for key in st_abbr.keys():
if key in address.split():
address = address.replace(key, st_abbr[key])
print(address)
It works perfectly on abbreviated street names but this is what I get 2300 S SUPER TEMPLACEE PLACE. It also replaced the PL within 'TEMPLE' with PLACE, thus it gave me 'TEMPLACEE'. I am trying to modify the for loop to only replace the abbreviated street if the street.split() is the exact match of the dict.keys(). I would like guidance on how to achieve that.
Use a comprehension:
addr = '2300 S SUPER TEMPLE PL'
new_addr = ' '.join(st_abbr.get(c, c) for c in addr.split())
print(new_addr)
# Output
2300 S SUPER TEMPLE PLACE
Can you shed a light the concept behind the .get(c,c) in the context of my problem?
# Equivalent code
' '.join(st_abbr[c] if c in st_abbr else c for c in addr.split())
Not sure whether it's the best idea or not, but regex usually can be helpful in these cases:
import re
def getValue(value):
before = value.group(1)
name = value.group("name")
after = value.group(3)
if name in st_abbr:
return before + st_abbr[name] + after
else:
return before + name + after
myString = "2300 S SUPER TEMPLE PL"
re.sub("(^|\s)+(?P<name>[A-Z]{2,4})($|\s)", getValue,myString)
Output
2300 S SUPER TEMPLE PLACE

How to match city names split by space?

Trying to figure out given two different types of strings, how to make a determination whether or not a city name is actually a split word? Since working in python, I Split the string and save s[0] for street num, s[-1] for zip code and so on but how to figure out whether the city name may be a split word such as New York or San Jose!?
E.g. : 123 Main Street St. Louisville OH 43071 [City name is single word]
E. g : 45 Holy Grail Al. Niagara Town ZP 32908 [City name 'Niagara Town' is two words]
Forgive the noob question.
Thank you,
I making two assumptions here:
1) That the number code before the town name is always numeric
2) That there is no town name with a number name
index = list(filter(lambda x: x[1].isnumeric(),enumerate(x.split())))[-1][0]
" ".join(x.split()[index+1:])
So what is happening: We try to identify the last part of the split that is purely numeric, and then get the index of that element. Then we join all elements after that numeric element.

Geocode the address written in native language using English letters

Friends,
I am analyzing some texts. My requirement is to gecode the address written in English letters of a different native language.
Ex: chandpur market ke paas, village gorthaniya, UP, INDIA
In above sentence words like, "ke paas" --> is a HINDI word (Indian national language), which means "near" in English and "chandapur market" is a noun (can be ignored for conversion)
Now my challenge is to convert such thousands of words to english and identify the street name and geo code it. Unfortunately, i do not have postal code or exact address.
Can you any one please help here?
Thanks in Advance !!
Google's geocode api supports Hindi, so you don't necessarily have to translate it to English. Here's an example using my googleway package (in R) specifying the language = "hi" argument.
You'll need an API key to use the Google API through googleway
library(googleway)
set_key("your_api_key")
res <- google_geocode(address = "village gorthaniya, UP, INDIA",
language = "hi")
geocode_address(res)
# [1] "गोर्थानिया, उत्तर प्रदेश 272181, भारत"
geocode_coordinates(res)
# lat lng
# 1 26.85848 82.50099
geocode_address_components(res)
# long_name short_name types
# 1 गोर्थानिया गोर्थानिया locality, political
# 2 बस्ती बस्ती administrative_area_level_2, political
# 3 उत्तर प्रदेश उ॰ प्र॰ administrative_area_level_1, political
# 4 भारत IN country, political
# 5 272181 272181 postal_code

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Python, Regular Expression Postcode search

I am trying to use regular expressions to find a UK postcode within a string.
I have got the regular expression working inside RegexBuddy, see below:
\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b
I have a bunch of addresses and want to grab the postcode from them, example below:
123 Some Road Name Town, City County PA23 6NH
How would I go about this in Python? I am aware of the re module for Python but I am struggling to get it working.
Cheers
Eef
repeating your address 3 times with postcode PA23 6NH, PA2 6NH and PA2Q 6NH as test for you pattern and using the regex from wikipedia against yours, the code is..
import re
s="123 Some Road Name\nTown, City\nCounty\nPA23 6NH\n123 Some Road Name\nTown, City"\
"County\nPA2 6NH\n123 Some Road Name\nTown, City\nCounty\nPA2Q 6NH"
#custom
print re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', s)
#regex from #http://en.wikipedia.orgwikiUK_postcodes#Validation
print re.findall(r'[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}', s)
the result is
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
both the regex's give the same result.
Try
import re
re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}", x)
You don't need the \b.
#!/usr/bin/env python
import re
ADDRESS="""123 Some Road Name
Town, City
County
PA23 6NH"""
reobj = re.compile(r'(\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b)')
matchobj = reobj.search(ADDRESS)
if matchobj:
print matchobj.group(1)
Example output:
[user#host]$ python uk_postcode.py
PA23 6NH

Categories