Python, Regular Expression Postcode search - python

I am trying to use regular expressions to find a UK postcode within a string.
I have got the regular expression working inside RegexBuddy, see below:
\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b
I have a bunch of addresses and want to grab the postcode from them, example below:
123 Some Road Name Town, City County PA23 6NH
How would I go about this in Python? I am aware of the re module for Python but I am struggling to get it working.
Cheers
Eef

repeating your address 3 times with postcode PA23 6NH, PA2 6NH and PA2Q 6NH as test for you pattern and using the regex from wikipedia against yours, the code is..
import re
s="123 Some Road Name\nTown, City\nCounty\nPA23 6NH\n123 Some Road Name\nTown, City"\
"County\nPA2 6NH\n123 Some Road Name\nTown, City\nCounty\nPA2Q 6NH"
#custom
print re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', s)
#regex from #http://en.wikipedia.orgwikiUK_postcodes#Validation
print re.findall(r'[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}', s)
the result is
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
both the regex's give the same result.

Try
import re
re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}", x)
You don't need the \b.

#!/usr/bin/env python
import re
ADDRESS="""123 Some Road Name
Town, City
County
PA23 6NH"""
reobj = re.compile(r'(\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b)')
matchobj = reobj.search(ADDRESS)
if matchobj:
print matchobj.group(1)
Example output:
[user#host]$ python uk_postcode.py
PA23 6NH

Related

Question on regex not performing as expected

I am trying to change the suffixes of companies such that they are all in a common pattern such as Limited, Limiteed all to LTD.
Here is my code:
re.sub(r"\s+?(CORPORATION|CORPORATE|CORPORATIO|CORPORATTION|CORPORATIF|CORPORATI|CORPORA|CORPORATN)", r" CORP", 'ABC CORPORATN')
I'm trying 'ABC CORPORATN' and it's not converting it to CORP. I can't see what the issue is. Any help would be great.
Edit: I have tried the other endings that I included in the regex and they all work except for corporatin (that I mentioned above)
I see that all te patterns begins with "CORPARA", so we can just go:
import re
print(re.sub("CORPORA\w+", "CORP", 'ABC CORPORATN'))
Output:
ABC CORP
Same for the possible patterns of limited; if they all begin with "Limit", you can
import re
print(re.sub("Limit\w+", "LTD", 'Shoe Shop Limited.'))
Output:
Shoe Shop LTD.

python regex for people names

hello i have tried to extract all the names from the following string:
import re
def Find(string):
url = re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
return url
string = 'Arnold Schwarzenegger was born in Austria. He and Sylvester Stalone used to run a restaurant with J. Edgar Hoover.'
print(Find(string))
but i have got a problem with the output(doesnt print the J. on edgar)
['Arnold Schwarzenegger', 'Sylvester Stalone', 'Edgar Hoover']
another question for you :)
i have tried to print the second string but i get a problem.
i need to write a regex that print it without www or http or https like in the example:
import re
def Find(string):
url = re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', string)
return url
string = 'To learn about pros/cons of data science, go to http://datascience.net. Alternatively, go to datascience.net/2020/'
print(Find(string))
output is:
['http://datascience.net.']
thanks
Question 1
Here's a regex that works for that specific case of three names:
((?:[A-Z]\.\s)?[A-Z][a-z]+\s[A-Z][a-z]+)
yields
Arnold Schwarzenegger
Sylvester Stalone
J. Edgar Hoover
Question 2
(?:http)?s?(?:\:\/\/)?(?:www.)?([A-z]+\.[A-z]+(?:[\./][A-z0-9]+)*\/?)
yields
http://datascience.net
datascience.net/2020/

Simple Regex in Python Three to replace text between '|' and '/' symbols

I want to replace the text between the '|' and '/' in the string ("|伊士曼柯达公司/") with '!!!'.
s = '柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/'
print(s)
s = re.sub(r'\|.*?\/.', '/!!!', s)
print('\t', s)
I tested the code first on https://regex101.com/, and it worked perfectly.
I can't quite figure out why it's not doing the replacement in python.
Variant's of escaping I've tried also include:
s = re.sub(r'|.*?\/.', '!!!', s)
s = re.sub(r'|.*?/.', '!!!', s)
s = re.sub(r'\|.*?/.', '!!!', s)
Each time the string comes out unchanged.
You can change your regex to this one, which uses lookarounds to ensure what you want to replace is preceded by | and followed by /
(?<=\|).*?(?=/)
Check this Python code,
import re
s = '柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/'
print(s)
s = re.sub(r'(?<=\|).*?(?=/)', '!!!', s)
print(s)
Prints like you expect,
柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|伊士曼柯达公司/
柯達⑀柯达⑀ /Kodak (brand, US film company)/full name Eastman Kodak Company 伊士曼柯達公司|!!!/
Online Python Demo

How to extract university/school/college name from string in python using regular expression?

SAMPLE CODE
import re
line = "should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa#dasdsa.com.lol"
match = re.search(r'/([A-Z][^\s,.]+[.]?\s[(]?)*(Hospital|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/', line)
print(match.group(0))
I'm trying to extract University/School/Organization names from given string using regular expression in python but it gives an error message.
ERROR MESSAGE
Traceback (most recent call last): File
"C:/Python/addOrganization.py", line 4, in
print(match.group(0)) AttributeError: 'NoneType' object has no attribute 'group'
Instead of search ,Try the re.sub to print your expected output
import re
i = "should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa#dasdsa.com.lol"
line = re.sub(r"[\w\W]* ((Hospital|University|Centre|Law School|School|Academy|Department)[\w -]*)[\w\W]*$", r"\1", i)
print line
The test string you've given is a made up one since the University name is immediately followed by a line terminator '.' while the other examples in your pastebin sample do not (they are followed by a comma).
line = should we use regex more often, University of Pennsylvania. let me know at 321dsasdsa#dasdsa.com.lol
I have managed to extract the names using a simple regex for examples in your pastebin you can see details here: regex101.com
Logic
Since the institute name is separated by a comma (except the first case where it starts with the university name), you can see that the match string will either lie in group1 or group2.
Then you can iterate through group1 & group2to see if it matches anything in the pre-defined match list & return the value.
Code
I have used two examples to show it works.
line1 = 'The George Washington University, Washington, DC, USA.'
line2 = 'Department of Pathology, University of Oklahoma Health Sciences Center, Oklahoma City, USA. adekunle-adesina#ouhsc.edu'
matchlist = ['Hospital','University','Institute','School','School','Academy'] # define all keywords that you need look up
p = re.compile('^(.*?),\s+(.*?),(.*?)\.') # regex pattern to match
# We use a list comprehension using 'any' function to check if any of the item in the matchlist can be found in either group1 or group2 of the pattern match results
line1match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line1)]
line2match = [m.group(1) if any(x in m.group(1) for x in matchlist) else m.group(2) for m in re.finditer(p,line2)]
print (line1match)
[Out]: ['The George Washington University']
print (line2match)
[Out]: ['University of Oklahoma Health Sciences Center']

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Categories