Regex for scraping City and State out of bloated description - python

You can see description here http://www.mdh.org/sites/www/healthapp/jobs/View.aspx?id=10
MDH Human Resources
525 E. Grant St.
Macomb, IL 61455
T: 309-836-1577
F: 309-836-1677
The page has this address and I want to extract City and State using regex. In this case it's Macomb and IL.
For a moment I used following regex but it did not work where description contain more than one similar patterns.
(\w+),\s+(\w{2})\s+\d+
How can I write regex which tells to first extract these address lines and then the line which has this pattern?

^([A-Z][A-Za-z\s]*),\s+([A-Z]{2})\s+\d{5}$
which I think would be good enough to keep noise away. The downside is that it could potentially avoid what you desire. In that case, you may want to iterate through the page using a less strong regular expression like yours. Anyway, you can't achieve perfection using regex.
It works with Javascript. Adjust the syntax to meet the need of Python.

Related

How to correct badly written emails?

I am trying to correct badly written emails contained in a list, by searching differences in the most common domains. E.g: hotmal.com to hotmail.com.
The thing is, there are tons of variations to one single domain. It would be extremly helpful if someone knew of an algorithm in python that can work as an autocorrect for email domains. Or if this is too complex of a problem for a few lines of code.
Check Levenshtein distance starting at https://en.wikipedia.org/wiki/Levenshtein_distance
It is commonly used for auto-correct
What if...you search for keywords in the domain. Like for hotmail.com, you can search for hot, or something similar. Also, like the #user10817019 wrote, you can combine it with searching for the first and last letters of the domain.
Write a small script in your preferred language that takes domains that start with h and end with l, and replace the entire string with hotmail so it fixes everything in between. Search for mai if they forgot the L. I had to do this the other day in vb.net so check my lists twice and correct bad data.

Getting Table Attributes from a Website

I am using Python 3.4, Windows 10, and Visual Studio 2015. I am trying to make a program that scrapes phone numbers from websites formatted like this one.
I am using Beautiful Soup 4, and am trying to get the number of beds from the table. I have tried soup.select('.td') and it only returns an empty array, I am not sure what else to try.
Why not grab the entire page HTML as a string and then use a regular expression to parse it? Is that not where Python excels?
In case you are afraid of regex, here is a beginner-friendly tutorial:
https://regexone.com/
The syntax for Python might be slightly different:
https://docs.python.org/2/library/re.html
And I seriously hope you are not scraping phone numbers for nefarious purposes. I don't want a phone call from you :-).
Here is another Stack Overflow answer which gives a good starting regex:
https://stackoverflow.com/a/123666/5129424
Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
Just because you "might mess it up" doesn't mean you shouldn't try it and test it. Regardless of what you do, you are either at the mercy of the structure of the page, which may change, or the format of the phone numbers, which may also change. There is no perfect solution.

Using Python, how can I validate a UK postcode?

I have a number of UK postcodes. However, they are not well formatted. Sometimes, they look normal postcodes such as SW5 2RT, sometimes they don't have the space in the middle, but still valid such as SW52RT. But in some cases, they are wrong due to human factor, and they look like ILovePython which is totally invalid as postcodes.
So, I am wondering how can I validate the postcodes efficiently (using Python)?
Many thanks.
=================================================================
EDIT:
Thanks for the answers at: this page. But it seems only check characters in the postcode whether they are alphabet or numbers, but don't care if the combination make sense. There could be a false postcode such as AA1 1AA that get passed through the validation.
You can use this package for your purpose:
uk-postcode-utils 0.1.0
Here it is links for the same:
https://pypi.python.org/pypi/uk-postcode-utils
Also please have a look at:
Python, Regular Expression Postcode search
You are correct to say that Regex validation goes only so far. To ensure that the post code is 'postally' valid, you'll need a reference set to validate it against. There are a large number (low thousands) of changes to UK addresses per day to keep track of, and I do not believe this is something a regex alone can solve.
There are a couple of ways you can do it, either use a 3rd party to help you capture a complete & correct address (many available including https://www.edq.com/uk/products/address-validation/real-time-capture (my company)), or get a data supply from Royal Mail and implement your own solution.
Coping with typo's and different formats shouldn't be a problem either way you do it. Most 3rd parties will do this easily for you and should be able to cope with some mistakes too (depending upon what you have to search upon). They'll all have web services you should be able to easily implement or grab integration snippets for.
The UK Office for National Statistics publishes a list of UK postcodes, both current and retired, so you could pull the relevant columns out of the latest .csv download, duplicate the current ones with the space(s) removed and then do a lookup (it might be best to use a proper database such as MySQL with an index for this).

Address parser for Python, how do I split an address

I am very new to Python but seem to be getting along. I am writing a web crawler in Python.
I've got the crawler working using the Beautiful Soup library and want to find the best library for parsing or splitting an address into it constituent parts.
Here is a sample of the text to be parsed.
['\r\n\t \t\t \t25 Stockwood Road', <br/>, 'Asheville, NC 28803', <br/>, '\t (828) 505-1638\t \t']
I understand it's a list and I can figure out how to remove the control character.
Since I'm so new I'd like recommendations on what libraries are being used for this - Python version, OS and perquisites.
I'd like to figure out the code for myself, but if you inclined to offer a sample, I wouldn;t argue. :)
you can try the python library usaddress (there's also a web interface for trying it out)
it parses addresses probabilistically, and is much more robust than regex-based parsers when dealing with messy addresses.
List Comprehension is pretty sleek for something like this. Also look into String Strip. It won't remove HTML blank elements though, but the tabs, newlines and spaces will be cleaned up.
out = [x.strip() for x in lst]

Python - Detect (spammy) URLS in string

So, I've been doing some research for a while now and I could't find anything about detecting a URL in a string. The problem is that most results are about detecting whether a string IS a URL, and not if it contains a URL. The 2 results that look best to me are
Regex to find urls in string in Python
and
Detecting a (naughty or nice) URL or link in a text string
but the first requires http://, which is not something spammers would use (:P) and the second one isn't in regex - and my limited knowledge does not know how to translate any of these. Something I have considered doing is using something dull like
spamlist = [".com",".co.uk","etc"]
for word in string:
if word in spamlist:
Do().stuff()
But that would honestly do more bad than good, and I am 100% sure there is a better way using regex or anything!
So if anyone knows anything that could help me I'd be very grateful! I've only been doing python for 1-2 months and not very intensively during this period but I feel like I'm making great progress and this one thing is all that's in the way, really.
EDIT: Sorry for not specifying earlier, I am looking to use this locally, not website (apache) based or anything similar. More trying to clean out any links from files I've got hanging around.
As I said in the comments,
Detecting a (naughty or nice) URL or link in a text string 's solution is a regex and you should probably make it a raw string or escape backslashes in it when using it in Python
You really shouldn't reinvent the square wheel here, especially since spam filtering is an arms race domain (couldn't remember the exact English phrase for this)

Categories