parsing unstructured text using pyparsing in Python - python

I have hundreds of company report .txt files, and I want to extract some information from it. For example, one part of the file looks like this:
Mr. Davido will receive a base salary of $700,000 during the initial and any subsequent
term. The Chief Executive Officer of the Company (the CEO) and the Board (or a committee
thereof) shall review Mr. Davidos base salary at least annually, and may increase it at
any time in their sole discretion
I am trying to use pyparsing to extract the base salary value of the guy.
code
from pyparsing import *
# define grammar
digits = "0123456789"
integer = Word( digits )
money = Group("$"+integer+','+integer + Optional(','+integer , ' '))
start = Word("base salary")
salary = start + money
#search
for t in text:
result = salary.parseString( text )
print result
This always gives the error:
pyparsing.ParseException: Expected W:(base...) (at char 0), (line:1, col:1)
After some simple tests, I find that use this code I can only find what I want from the particular form of text which start with:
"base salary $700,000......"
and it can only identify the first case appears in that text.
So I was wondering if someone could help me with it. And, if possible also identify the name of the guy, and store the name and salary into a dataframe.
Thank you so much.

I'll answer your specific question first. parseString is used when you have defined a comprehensive grammar that will match everything from the beginning of the text. Since you are trying to pick out a specific phrase from somewhere in the middle of the input line, use searchString or scanString instead.
As pyparsing's author, I will concur with #Tritium21 - unless there are some specific forms and phrases that you can look for, you will tear your hair out trying to parse this kind of natural language input.

Related

Find VAT numbers of length 11 after a word in a string

I have the following text a="VAT number 12345678901 mobile number 34567890234" I want to find only the number corresponding to a VAT number made up of 11 numbers (ie 12345678901) and I don't want to find 34567890234.
the regex I use is:
rgx = "(?<!\d)\d{11}(?!\d)"
but re.findall(rg, a) gives me both 34567890234 and 12345678901.
Any idea?
In the precise string a="VAT number 12345678901 mobile number 34567890234", this would look for 11 digits followed by a space and the word mobile but only return the digits. rgx = "\d{11}(?=\smobile)"
There are a lot of browser driven regular expression creators out there and they are great resource for learning.
Your original expression uses negative look around expressions (?<\d) and (?!\d), they are not supported in all aspects so I tend to avoid them. Additionally, in terms of language structure, detecting the presence of something is generally more precise than the absence of something. Like if someone asks you what you want to drink and you reply "not poison" but you want a soda; you are less likely to get a soda.
So positive look around expressions will be more robust (?=abc) and (?<abc)
Try this
(?:VAT\s*number\s*)(\d{11})\s+
this not capturing block : (?:VAT\s*number\s*) ensure to search the number after.
this block :
(\d{11})\s+
capture the VAT number only if it consists of 11 digits.

Regex fuzzy word match

Tough regex question: I want to use regexes to extract information from news sentences about crackdowns. Here are some examples:
doc1 = "5 young students arrested"
doc2 = "10 rebels were reported killed"
I want to match sentences based on lists of entities and outcomes:
entities = ['students','rebels']
outcomes = ['arrested','killed']
How can I use a regex to extract the number of participants from 0-99999, any of the entities, any of the outcomes, all while ignoring random text (such as 'young' or 'were reported')? This is what I have:
re.findall(r'\d{1,5} \D{1,50}'+ '|'.join(entities) + '\D{1,50}' + '|'.join(outcomes),doc1)
i.e., a number, some optional random text, an entity, some more optional random text, and an outcome.
Something is going wrong, I think because of the OR statements. Thanks for your help!
This regex should match your two examples:
pattern = r'\d+\s+.*?(' + '|'.join(entities) + r').*?(' + '|'.join(outcomes) + ')'
What you were missing were parentheses around the ORs.
However, using only regex likely won't give you good results. Consider using Natural Language Processing libraries like NLTK that parses sentences.
As #ReutSharabani already answered, this is not a proper way to do nlp, but this answers the literal question.
The regex should read:
import re;
entities = ['students','rebels'];
outcomes = ['arrested','killed'];
p = re.compile(r'(\d{1,5})\D{1,50}('+'|'.join(entities)+')\D{1,50}('+'|'.join(outcomes)+')');
m = p.match(doc1);
number = m.group(1);
entity = m.group(2);
outcome = m.group(3);
You forgot to group () your OR-operations. Instead what you generated was a|b|\W|c|d|\W (short version).
You ought to try out the regex module!
It has built in fuzzy match capabilities. The other answers seem much more robust and sleek, but this could be done simply with fuzzy matching as well!
pattern = r'\d{1,5}(%(entities)s)(%(outcomes)s){i}' %{'entities' : '|'.join(entities), 'outcomes' : '|'.join(outcomes)}
regex.match(pattern, news_sentence)
What's happening here is that the {i} indicates you want a match with any number of inserts. The problem here is that it could insert characters into one of the entities or outcomes and still yield a match. If you want to accept slight alterations on spelling to any of your outcomes or entities, then you could also use {e<=1} or something. Read more in the provided link about approximate matching!

Remove all white spaces inside specific delimiters

I'm trying to process a xml file containing wrongly formed elements.
A wrongly formed elemement is one which doesn't respect the following pattern : <name attribute1=value1 attribute2=value2 ... attributeN=valueN>
There can be 0 to n attributes.
As a consequence, <my element number> is invalid, while <my element=number> is not.
Here is a sample of my text :
<product_name>
A high wind in Jamaica <The innocent voyage> The modern library of the world s best books Books Richard Arthur Warren Hughes
</product_name>
Here, <product_name> is a good element, while <The innocent voyage> is not.
When an incorrect element is spotted, I would like to have the <> replaced with neutral characters, such as +.
Since the file containing these tags is pretty big (1.5 GB), I would rather not use a brute force approach.
Would you guys see an fast (and if possible, elegant) way to solve this problem ?
As you state that you would rather stay away from regex, I was able to create the following code that doesn't use regex (although I'm sure regex would be quite useful)
def valid_tag(tag):
temp = tag.split()
for word in temp[1:]:
if "=" not in word:
return False
return True
Here you pass in a tag as a string as the parameter. For example: "<hello test=test>"
You can run this test on each tag by creating another method for getting a tag by finding a "<" and then the first ">" that follows and creating a substring from that which will be the tag that you pass into this method.
NOTE: This assumes that your tags are written as follows: <hello test=test> and not < hello test = test >
This method is still very primitive and makes a few assumptions as I stated above but hopefully it will give you the start you need.

Get address out of a paragraph with regex

Alright, this one's a bit of a pain. I'm doing some scraping with Python, trying to get an address out of a few lines of poorly tagged HTML. Here's a sample of the format:
256-555-5555<br/>
1234 Fake Ave S<br/>
Gotham (Lower Ward)<br/>
I'd like to retrieve only 1234 Fake Ave S, Gotham. Any ideas? I've been doing regex's all night and now my brain is mush...
Edit:
More detail about what the possible scenarios of how the data will arrive. Sometimes the first line will be there, sometimes not. All of the addresses I have seen have Ave, Way, St in it although I would prefer not to use that as a factor in the selection as I am not certain they will always be that way. The second and third line are alPhone (or possible email or website):
What I had in mind was something that
Selects everything on 2nd to last line (so, second line if there are three lines, first line if just two when there isn't a phone number).
Selects everything on last line that isn't in parentheses.
Combine the 2nd to last line and last line, adding a ", " in between the two.
I'm using Scrapy to acquire the HTML code. The address is all in the same div, I want to use regex to further break the data up into appropriate sections. Now how to do that is what I'm unable to figure out.
Edit2:
As per Ofir's comment, I should mention that I have already made expressions to isolate the phone number and parentheses section.
Phone (or possible email or website):
((1[-. ])?[0-9]{3}[-. ])?\(?([0-9]{3}[-. ][A?([0-9]{4})|([\w\.-]+#[\w\.-]+)|(www.+)|([\w\.-]*(?:com|net|org|us))
parentheses:
\((.*?)\)
I'm not sure how to use those to construct a everything-but-these statement.
It is possible that in your case it is easier to focus on what you don't want:
html tags (<br>)
phone numbers
everything in parenthesis
Each of which can be matched easily with simple regular expressions, making it easy to construct one to match the rest (presumably - the address)
This attempts to isolate the last two lines out of the string:
>>> s="""256-555-5555<br/>
... 1234 Fake Ave S<br/>
... Gotham (Lower Ward)<br/>
... """
>>> m = re.search(r'((?!</br>).*)<br/>\n((?!</br>).*)<br/>$)', s)
>>> print m.group(1)
1234 Fake Ave S
Trimming the parentheses is probably best left to a separate line of code, rather than complicating the regular expression further.
As far as I understood you problem, I think you are taking the wrong way to solve it.
Regexes are not a magical tool that could extract pertinent data from a pulp and jumble of undifferentiated elements of text. It is a tool that can only extract data from a text having variable parts but also a minimum of stable structure acting as anchors relatively to which the variable parts can be localized.
In your treatment, it seems to me that you first isolated this part containing possible phone number followed by address on 1/2 lines. But doing so, you lost information: what is before and what is after is anchoring information, you shouldn't try to find something in the remaining section obtained after having eliminated this information.
Moreover, I presume that you don't want only to catch a phone number and an address: you may want to extract other pieces of information lying before and after this section. With a good shaped regex, you could capture all the pieces in one shot.
So, please, give more of the text, with enough characters before and enough characters after the limited section allowing to write a correct and easier regex strategy to catch all the data you want. triplee has already asked you that, and you didn't, why ?

Python regex for fixing Australian/New Zealand Phone Numbers

I have a Python script that we're using to parse CSV files with user-entered phone numbers in it - ergo, there are quite a few weird format/errors. We need to parse these numbers into their separate components, as well as fix some common entry errors.
Our phone numbers are for Sydney or Melbourne (Australia), or Auckland (New Zealand), given in international format.
Our standard Sydney number looks like:
+61(2)8328-1972
We have the international prefix +61, followed by a single digit area code in brackets, 2, followed by the two halves of the local component, separated by a hyphen, 8328-1972.
Melbourne numbers simply have 3 instead of 2 in the area code, e.g.
+61(3)8328-1972
The Auckland numbers are similar, but they have a 7-digit local component (3 then 4 numbers), instead of the normal 8 digits.
+64(9)842-1000
We also have matches for a number of common errors. I've separated the regex expressions into their own class.
class PhoneNumberFormats():
"""Provides compiled regex objects for different phone number formats. We put these in their own class for performance reasons - there's no point recompiling the same pattern for each Employee"""
standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{3,4})(?P<local_second_half>\d{4})')
space_instead_of_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{3,4}) (?P<local_second_half>\d{4})')
We have one for standard_format numbers, then others for various common error cases e.g. putting an extra zero before the area code (02 instead of 2), or missing hyphens in the local component (e.g.83281972instead of8328-1972`) etc.
We then call these from cascaded if/elifs:
def clean_phone_number(self):
"""Perform some rudimentary checks and corrections, to make sure numbers are in the right format.
Numbers should be in the form 0XYYYYYYYY, where X is the area code, and Y is the local number."""
if not self.telephoneNumber:
self.PHFull = ''
self.PHFull_message = 'Missing phone number.'
else:
if PhoneNumberFormats.standard_format.search(self.telephoneNumber):
result = PhoneNumberFormats.standard_format.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = ''
elif PhoneNumberFormats.extra_zero.search(self.telephoneNumber):
result = PhoneNumberFormats.extra_zero.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = 'Extra zero in area code - ask user to remediate.'
elif PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber):
result = PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = 'Missing hyphen in local component - ask user to remediate.'
elif PhoneNumberFormats.space_instead_of_hyphen.search(self.telephoneNumber):
result = PhoneNumberFormats.missing_hyphen.search(self.telephoneNumber)
self.PHFull = '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half')
self.PHFull_message = 'Space instead of hyphen in local component - ask user to remediate.'
else:
self.PHFull = ''
self.PHFull_message = 'Number didn\'t match recognised format. Original text is: ' + self.telephoneNumber
My aim is to make the matching as tight as possible, yet still at least catch the common errors.
There are number of problems with what I've done above though:
I'm using \d{3,4} to match the first half of the local component. Ideally, however, we only really want to catch a 3-digit first half if if it's a New Zealand number (i.e. starts with +64(9)). That way, we can flag Sydney/Melbourne numbers that are missing a digit. I could separate out auckland_number into it's own regex pattern in PhoneNumberFormats, however, that means it wouldn't catch a New Zealand number combined with the error cases (extra_zero, missing_hyphen, space_instead_of_hyphen). So unless I recreate version of them just for Auckland, like auckland_extra_zero, which seems pointlessly repetitive, I can't see how to address this easily.
We don't pickup combinations of errors - e.g. if they have a extra zero, and a missing hyphen, we won't pick this up. Is there an easy way to do this using regex, without explicitly creating permutations of the different errors?
I'd like to address the above two issues, and hopefully tighten it up a bit to catch anything that I've missed. Is there a smarter way to do what I've attempted to do above?
Cheers,
Victor
Additional Comments:
The following is just to provide some context:
This script is for a global company, with one office in Sydney, one in Melbourne and one in Auckland.
The numbers come from an internal Active Directory listing of employees (i.e. it's not a customer listing, but our own office phones).
Hence, we're not looking for a general Australian phone number matching script, rather, we're looking at a general sript to parse numbers from three specific offices. General, it's only the last 4 numbers that should differ.
Mobile phones aren't required.
The script is designed to parse a CSV dump of the Active Directory, and reformat the numbers into an acceptable format for another program (QuickComm)
This program is from a external vendor, and requires numbers in the exact format that I've produced in the code above - that's why the numbers are spat out like 0283433422.
The script I've written can't change the records, it only works on a CSV dump of them - the records are stored in Active Directory, and the only way to access them to get them fixed is to email the employee and ask them to login and change their own records.
So this script is run by a PA, to produce the output required by this program. She/he will also get a list of people who have incorrectly formatted numbers - hence the messages about asking the user to remediate. In theory, there should only a be small number of these. We then email/ring these employees, asking them to fix their records - the script is run once a month (numbers may change), we also need to flag new employees that manage to enter their records in wrong as well.
#John Macklin: Are you recommending I scrap regexes, and just try to pull specific-position digits out of the string?
I was looking for a way to catch the common error cases, in combinations (e.g. space instead of hyphen, combined with an extra zero), but is this not easily feasible?
Don't use complicated regexes. Delete EVERYTHING except digits -- non-digits are error-prone cruft. If the third digit is 0, delete it.
Expect 61 followed by valid AUS area code ([23478] for generality NB 4 is for mobiles) then 8 digits
or 64 followed by valid NZL area code (whatever that is) followed by 7 digits. Anything else is bad. In the good stuff, insert the +()- at the appropriate places.
By the way (1) area code 2 is for the whole of NSW+ACT, not just Sydney, 3 is for VIC+TAS (2) lots of people these days don't have landlines, just mobiles, and people tend to retain the same mobile phone number longer than they maintain the same landline phone number or the same postal address, so mobile phone number is great for fuzzy matching customer records -- so I'm more than a little curious why you don't include them.
The following tell you all you ever wanted to know, plus a whole lot more, about the Australian and New Zealand phone numbering schemes.
Comment on the regexes:
(1) You are using the search method with a "^" prefix. Using the match method with no prefix is somewhat less inelegant.
(2) You don't seem to be checking for trailing rubbish in your phone number field:
>>> import re
>>> standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\
)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
>>> m =standard_format.search("+61(3)1234-567890whoopsie")
>>> m.groups()
('61', '3', '1234', '5678')
>>>
You may like to (a) end some of your regexes with \Z (NOT $) so that they don't match OK when there is trailing rubbish or (b) introduce another group to catch trailing rubbish.
and a social engineering comment: Have you yet tested the user reaction to a staff member carrying out this directive: "Space instead of hyphen in local component - ask user to remediate"? Can't the script just fix it and carry on?
and some comments on the code:
the self.PHFull code
(a) is terribly repetitive (if you must have regexes put them in a list with corresponding action codes and error messages and iterate over the list)
(b) is the same for "error" cases as for standard cases (so why are you asking the users to "remediate"???)
(c) throws away the country code and substitutes a 0 i.e. your standard +61(2)1234-5678 is being kept as 0212345678 aarrgghhh ... even if you have the country stored with the address that's no good if an NZer migrates to Aus and the address gets updated but not the phone number and please don't say that you are relying on the current (no NZ customers outside the Auckland area???) non-overlap of area codes ...
Update after full story revealed
Keep it SIMPLE for both you and the staff. Instructions to staff using Active Directory should be (depending on which office) "Fill in +61(2)9876-7 followed by your 3-digit extension number". If they can't get that right after a couple of attempts, it's time they got the DCM.
So you use one regex per office, filling in the constant part, so that say the SYD offices have numbers of the form +61(2)9876-7ddd you use the regex r"\+61\(2\)9876-7\d{3,3}\Z". If a regex matches, then you remove all non-digits and use "0" + the_digits[2:] for the next app. If no regexes match, send a rocket.
+1 for #John Machin's recommendations.
The World Telephone Number Guide is quite useful for national numbering plans, especially the exceptions.
The ITU has freely available standards for lots of stuff too.
Phone numbers are formatted that way to make them easier to remember for people-- there's no reason that I can see for storing them like that. Why not split by commas and parse each number by simply ignoring anything that's not a digit?
>>> import string
>>> def parse_number(number):
n = ''
for x in number:
if x in string.digits:
n += x
return n
Once you've got it like that you can do verification based on the itl prefix and area code. (if the 3rd digit is 3 then there should be 7 more digits, etc)
After it's verified, splitting into components is easy. The first two digits are the prefix, the next is the area code, etc. You can do a check for all the common mistakes without using regex. Outputting is also pretty easy in this case.

Categories