I have a number of UK postcodes. However, they are not well formatted. Sometimes, they look normal postcodes such as SW5 2RT, sometimes they don't have the space in the middle, but still valid such as SW52RT. But in some cases, they are wrong due to human factor, and they look like ILovePython which is totally invalid as postcodes.
So, I am wondering how can I validate the postcodes efficiently (using Python)?
Many thanks.
=================================================================
EDIT:
Thanks for the answers at: this page. But it seems only check characters in the postcode whether they are alphabet or numbers, but don't care if the combination make sense. There could be a false postcode such as AA1 1AA that get passed through the validation.
You can use this package for your purpose:
uk-postcode-utils 0.1.0
Here it is links for the same:
https://pypi.python.org/pypi/uk-postcode-utils
Also please have a look at:
Python, Regular Expression Postcode search
You are correct to say that Regex validation goes only so far. To ensure that the post code is 'postally' valid, you'll need a reference set to validate it against. There are a large number (low thousands) of changes to UK addresses per day to keep track of, and I do not believe this is something a regex alone can solve.
There are a couple of ways you can do it, either use a 3rd party to help you capture a complete & correct address (many available including https://www.edq.com/uk/products/address-validation/real-time-capture (my company)), or get a data supply from Royal Mail and implement your own solution.
Coping with typo's and different formats shouldn't be a problem either way you do it. Most 3rd parties will do this easily for you and should be able to cope with some mistakes too (depending upon what you have to search upon). They'll all have web services you should be able to easily implement or grab integration snippets for.
The UK Office for National Statistics publishes a list of UK postcodes, both current and retired, so you could pull the relevant columns out of the latest .csv download, duplicate the current ones with the space(s) removed and then do a lookup (it might be best to use a proper database such as MySQL with an index for this).
Related
I am trying to correct badly written emails contained in a list, by searching differences in the most common domains. E.g: hotmal.com to hotmail.com.
The thing is, there are tons of variations to one single domain. It would be extremly helpful if someone knew of an algorithm in python that can work as an autocorrect for email domains. Or if this is too complex of a problem for a few lines of code.
Check Levenshtein distance starting at https://en.wikipedia.org/wiki/Levenshtein_distance
It is commonly used for auto-correct
What if...you search for keywords in the domain. Like for hotmail.com, you can search for hot, or something similar. Also, like the #user10817019 wrote, you can combine it with searching for the first and last letters of the domain.
Write a small script in your preferred language that takes domains that start with h and end with l, and replace the entire string with hotmail so it fixes everything in between. Search for mai if they forgot the L. I had to do this the other day in vb.net so check my lists twice and correct bad data.
I'm creating a simple chatbot. I want to obtain the information from the user response. An example scenario:
Bot : Hi, what is your name?
User: My name is Edwin.
I wish to extract the name Edwin from the sentence. However, the user can response in different ways such as
User: Edwin is my name.
User: I am Edwin.
User: Edwin.
I'm tried to rely on the dependency relations between words but the result does not do well.
Any idea on what technique I could use to tackle this problem?
First off, I think a complete name detection is really heavy to set up. If you want your bot to be able to detect a name in like 99% of the cases, you've got some work. And I suppose the name detection is only the very beginning of your plans...
This said, here are the first ideas that came to my mind:
Names are, grammatically speaking, nouns. So if one can perform a grammatical analysis of the sentence, some candidates to the name can be found.
Names are supposed to begin with a cap, although on a chat this is likely not to be respected, so it might be of little use... However, if one came across a word beginning with a cap, it is likely to be someone's name (though it could be a place's name...).
The patterns you could reasonably think of when introducing yourself are not that numerous, so you could "hard-code" them, with of course a little tolerance towards typos.
If you are expecting an actual name, you could use a database holding a huge amount of names, but have fun with the Hawaiian or Chinese names. Still, this appears as a viable solution in the case of European names.
However, I am no AI specialist, and I'm looking forward to seeing other proposals.
I'd suggest using NER:
You can play with it yourself: http://nlp.cogcomp.org/
There are many alternatives, over only 2 'models':
Based on NLP training; uses HTTP for integration/delivery:
Microsoft LUIS
API.AI
IBM Watson
based on pattern matching; uses an interpreter (needs an native implementation or a bridge from other implementation)
Rivescript - Python interpreter available
ChatScript - needs a C++ bridge/interop
AIML - Python interpreter available
This is not an extensive listing of current options.
Detecting names can be complicated if you consider things like "My name is not important", "My name is very long", etc.
Here is public domain script in Self that attempts to parse a name, you may be able to adapt it to python, it also does some crazy stuff like lookup the words on Wiktionary to see if they are classified as names,
https://www.botlibre.com/script?id=525804
You can see description here http://www.mdh.org/sites/www/healthapp/jobs/View.aspx?id=10
MDH Human Resources
525 E. Grant St.
Macomb, IL 61455
T: 309-836-1577
F: 309-836-1677
The page has this address and I want to extract City and State using regex. In this case it's Macomb and IL.
For a moment I used following regex but it did not work where description contain more than one similar patterns.
(\w+),\s+(\w{2})\s+\d+
How can I write regex which tells to first extract these address lines and then the line which has this pattern?
^([A-Z][A-Za-z\s]*),\s+([A-Z]{2})\s+\d{5}$
which I think would be good enough to keep noise away. The downside is that it could potentially avoid what you desire. In that case, you may want to iterate through the page using a less strong regular expression like yours. Anyway, you can't achieve perfection using regex.
It works with Javascript. Adjust the syntax to meet the need of Python.
I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)
Overall Plan
Get my class information to automatically optimize and select my uni class timetable
Overall Algorithm
Logon to the website using its
Enterprise Sign On Engine login
Find my current semester and its
related subjects (pre setup)
Navigate to the right page and get the data from each related
subject (lecture, practical and
workshop times)
Strip the data of useless
information
Rank the classes which are closer
to each other higher, the ones on
random days lower
Solve a best time table solution
Output me a detailed list of the
BEST CASE information
Output me a detailed list of the
possible class information (some
might be full for example)
Get the program to select the best
classes automatically
Keep checking to see if we can
achieve 7.
6 in detail
Get all the classes, using the lectures as a focus point, would be highest ranked (only one per subject), and try to arrange the classes around that.
Questions
Can anyone supply me with links to something that might be similar to this hopefully written in python?
In regards to 6.: what data structure would you recommend to store this information in? A linked list where each object of uniclass?
Should i write all information to a text file?
I am thinking uniclass to be setup like the following
attributes:
Subject
Rank
Time
Type
Teacher
I am hardly experienced in Python and thought this would be a good learning project to try to accomplish.
Thanks for any help and links provided to help get me started, open to edits to tag appropriately or what ever is necessary (not sure what this falls under other than programming and python?)
EDIT: can't really get the proper formatting i want for this SO post ><
Depending on how far you plan on taking #6, and how big the dataset is, it may be non-trivial; it certainly smacks of NP-hard global optimisation to me...
Still, if you're talking about tens (rather than hundreds) of nodes, a fairly dumb algorithm should give good enough performance.
So, you have two constraints:
A total ordering on the classes by score;
this is flexible.
Class clashes; this is not flexible.
What I mean by flexible is that you can go to more spaced out classes (with lower scores), but you cannot be in two classes at once. Interestingly, there's likely to be a positive correlation between score and clashes; higher scoring classes are more likely to clash.
My first pass at an algorithm:
selected_classes = []
classes = sorted(classes, key=lambda c: c.score)
for clas in classes:
if not clas.clashes_with(selected_classes):
selected_classes.append(clas)
Working out clashes might be awkward if classes are of uneven lengths, start at strange times and so on. Mapping start and end times into a simplified representation of "blocks" of time (every 15 minutes / 30 minutes or whatever you need) would make it easier to look for overlaps between the start and end of different classes.
BeautifulSoup was mentioned here a few times, e.g get-list-of-xml-attribute-values-in-python.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.
There are waaay too many questions here.
Please break this down into subject areas and ask specific questions on each subject. Please focus on one of these with specific questions. Please define your terms: "best" doesn't mean anything without some specific measurement to optimize.
Here's what I think I see in your list of topics.
Scraping HTML
1 Logon to the website using its Enterprise Sign On Engine login
2 Find my current semester and its related subjects (pre setup)
3 Navigate to the right page and get the data from each related subject (lecture, practical and workshop times)
4 Strip the data of useless information
Some algorithm to "rank" based on "closer to each other" looking for a "best time". Since these terms are undefined, it's nearly impossible to provide any help on this.
5 Rank the classes which are closer to each other higher, the ones on random days lower
6 Solve a best time table solution
Output something.
7 Output me a detailed list of the BEST CASE information
8 Output me a detailed list of the possible class information (some might be full for example)
Optimize something, looking for "best". Another undefinable term.
9 Get the program to select the best classes automatically
10 Keep checking to see if we can achieve 7.
BTW, Python has "lists". Whether or not they're "linked" doesn't really enter into it.