Regex: matching pattern fails - python

I'm trying to match a pattern with re in python, but I can't seem to get a match no matter how I try.
This is my matching pattern:
def get_report_date(report):
report_data = {}
with open(report, 'r') as f:
report_date = re.findall(f'([Q\d \d\d\d\d\s])', f.read())[0]
pprint(report_date)
report_data.update({f"{report_date.replace(' ', '_')}": report})
return report_data
and a piece of the file I'm trying to match:
(In millions, except number of shares which are reflected in thousands and per share amounts)
See accompanying Notes to Condensed Consolidated Financial Statements.
Apple Inc. | Q2 2018 Form 10-Q | 1 Apple Inc. CONDENSED CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (Unaudited)
I'm trying to scrape the Q2 2018
But I keep getting empty strings.

RegExp: r'(Q\d\s\d+\s)'
Explanation:
r prefix for raw string
Q to match the Q of quarter
\d to match the quarter number afterward
\s to match space
\d+ to match multiple numbers which are the year
\s to match space
Example:
import re
text = """(In millions, except number of shares which are reflected in thousands and per share amounts)
See accompanying Notes to Condensed Consolidated Financial Statements.
Apple Inc. | Q2 2018 Form 10-Q | 1 Apple Inc. CONDENSED CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (Unaudited)"""
x = re.findall(r'(Q\d\s\d+\s)', text)[0]
# Q2 2018
print(x)
Code Fix:
def get_report_date(report):
report_data = {}
with open(report, 'r') as f:
report_date = re.findall(r'(Q\d\s\d+\s)', f.read())[0]
pprint(report_date)
report_data.update({f"{report_date.replace(' ', '_')}": report})
return report_data

Related

Pandas Regex: Separate name from string that starts with word or start of string, and ends in certain words

I have a pandas series that contains rows of share names amongst other details:
Netflix DIVIDEND
Apple Inc (All Sessions) COMM
Intel Corporation CONS
Correction Netflix Section 31 Fee
I'm trying to use a regex to retrieve the stock name, which I did with this look ahead:
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(r"(^.*?(?=DIVIDEND|\(All|CONS|COMM|Section))")
The only thing I'm having trouble with is the row Correction Netflix Section 31 Fee, where my regex is getting the sharename as Correction Netflix. I don't want the word "Correction".
I need my regular expression to check for either the start of the string, OR the word "Correction ".
I tried a few things, such as an OR | with the start of string character ^. I also tried a look behind to check for ^ or Correction but the error says they need to be constant length.
r"((^|Correction ).*?(?=DIVIDEND|\(All|CONS|COMM|Section))"
gives an error; ValueError: Wrong number of items passed 2, placement implies 1. I'm new to regex so I don't really know what this means.
You could use an optional part, and in instead of lookarounds use a capture group with a match:
^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)
^ Start of string
(?:Correction\s*)?
(\S.*?)\s* Capture in group 1, matching a non whitespace char and as least chars as possible and match (not capture) 0+ whitespace chars
(?: Non capture group for the alternation |
\([^()]*\) Match from ( till )
| Or
DIVIDEND|All|CONS|COMM|Section Match any of the words
) Close group
Regex demo
data = ["Netflix DIVIDEND", "Apple Inc (All Sessions) COMM", "Intel Corporation CONS", "Correction Netflix Section 31 Fee"]
pattern = r"^(?:Correction\s*)?(\S.*?)\s*(?:\([^()]*\)|DIVIDEND|All|CONS|COMM|Section)"
transactions_df = pd.DataFrame(data, columns = ['MarketName'])
transactions_df["Share Name"] = transactions_df["MarketName"].str.extract(pattern)
print(transactions_df)
Output
0 Netflix DIVIDEND Netflix
1 Apple Inc (All Sessions) COMM Apple Inc
2 Intel Corporation CONS Intel Corporation
3 Correction Netflix Section 31 Fee Netflix

Line By Line output Python Regex

I am trying to figure out the best way to get the output to match in python using a few regex matches. Here is an example text.
Student ID: EDITED Sex: TRUCK
<<Fall 2016: 20160822 to 2
Rpt Dup
CRIJ 3310 Foundtns of Criminal Justice 3 A
COMM 3315 Leadership Communication 3 B
ENGL 3430 Professional Writing 4 A
<<Spring 2017: 20170117 to 20170512 () >>
MKTG 3303 Principles of Marketing 3 B
<<Summer 2017: 20170515 to 20170809 () >>
HUMA 4300 Selected Topics in Humanities 3
<<Fall 2017: 20170828 to 20171215 () >>
HUMA 4317 The Modern Era 3
COMM
4314 Intercultrl Communicatn 3
(((IT REPEATS THE SAME TYPE OF TEXT BUT WITH A DIFFERENT STUDENT BELOW)))
Here is some code:
import re
term_match = re.findall(r'^<<.*', filename, re.M)
course_match = re.findall(r'^[A-Z]{2,7}.*', filename, re.M
print('\n'.join(term_match))
print('\n'.join(course_match))
I have a regex to match the student ID and the Course info, my problem is getting them to be outputted in line by line order. On the document there are multiple students with lots of coursework so just matching is not good enough. I need to match ID, print the following coursework matches, and then print the next ID and coursework when it gets to that line. Any help on how to achieve such a thing would be great!
The flag re.MULTILINE will let the regex span multiple lines.
That said, you're probably better off looping line-by-line and recognizing when each new student id is encountered:
student_id = ''
for line in s.splitlines(False):
if not line:
continue
elif line.startswith('STUDENT'):
student_id = line[7:].strip()
else:
print(student_id, line)
One other thought, you could simplify the problem by dividing the text into chunks (one per student id):
starts = [mo.start() for mo in re.finditer(r'^STUDENT ID(.*)$', s, re.MULTILINE)]
starts.append(len(s))
chunks = []
for begin, end in zip(starts, starts[1:]):
chunks.append(s[begin:end])
After that, isolating the courses for each student should be much easier :-)

How to parse exact data without including surrounding text?

My code is very close to succeeding but I just need a little help.
I have 100's of pages of data but I am working on parsing only 1 page perfectly before I apply it to the others. In this one page, which is an email, I need to extract several things: a Date, Sector, Fish Species, Pounds, and Money. So far I have been successful in using RegularExpressions to recognize certain words and extract the data from that line: such as looking for "Sent" because I know the Date information will always follow that, and looking for either "Pounds" or "lbs" because the Pounds information will always precede that.
The problem I am having is that my code is grabbing the entire line that the data is on, not just the numeric data. I want to grab just the number value for Pounds, for example, but I realize this will be extremely difficult because every one of the 100's of emails is worded differently. I'm not sure if it is even possible to make this code foolproof because I need RegEx to recognize the text that surrounds the data, but not include it in my export command. So will I simply be blindly grabbing at characters following certain recognized words?
Here is a piece of my code used for extracting the Pounds data:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
sector_result = []
pattern = re.compile("Pounds | lbs", re.IGNORECASE)
for linenum, line in enumerate(f):
if pattern.search(line) != None:
sector_result.append((linenum, line.rstrip('\n')))
for linenum, line in sector_result:
print ("Pounds:", line)
And here is what it prints out:
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
Pounds: -American Plaice 2,000 lbs .60 lbs or best offer
Ideally I would just like the 5,000 lb numeric value to be exported but I am not sure how I would go about grabbing just that number.
Here is the original email text I need to parse:
From:
Sent: Friday, November 15, 2013 2:43pm
To:
Subject: NEFS 11 fish for lease
Greetings,
NEFS 11 has the following fish for lease:
-GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
-American Plaice 2,000 lbs .60 lbs or best offer
Here is another separate email though that will need to be parsed; this is why writing this code is difficult because it'll have to tackle a variety of differently worded emails, since all are written by different people:
From:
Sent: Monday, December 09, 2013 1:13pm
To:
Subject: NEFS 6 Stocks for lease October 28 2013
Hi All,
The following is available from NEFS VI:
4,000 lbs. GBE COD (live wt)
10,000 lbs. SNE Winter Flounder
10,000 lbs. SNE Yellowtail
10,000 lbs GB Winter Flounder
Will lease for cash or trade for GOM YT, GOM Cod, Dabs, Grey sole stocks on equitable basis.
Please forward all offers.
Thank you,
Any and all help is appreciated, as well as question asking criticism. Thanks.
Here's a regex flexible enough:
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = r'(\d[\d,.]+)\s*(?:lbs|[Pp]ounds)'
content = f.read()
### if you want only the first match ###
match = re.search(pattern, content)
if match:
print(match.group(1))
### if you want all the matches ###
matches = re.findall(pattern, content)
if matches:
print(matches)
You could be more thorough with the regex if needed.
Hope this helps!
UPDATE
The main part here is the regular expression (\d[\d,.]+)\s*(?:lbs|[Pp]ounds). This is a basic one, explained as follows:
(
\d -> Start with any digit character
[\d,.]+ -> Followed by either other digits or commas or dots
)
\s* -> Followed by zero or more spaces
(?:
lbs|[Pp]ounds -> Followed by either 'lbs' or 'Pounds' or 'pounds'
)
The parenthesis define the capturing group, so (\d[\d,.]+) is the stuff being captured, so basically the numeric part.
The parenthesis with a ?: define a non-capturing group.
This regex will match:
2,890 lbs (capturing '2,890')
3.6 pounds (capturing '3.6')
5678829 Pounds
23 lbs
9,894Pounds
etc
As well as unwanted stuff like:
2..... lbs
3,4,6,7,8 pounds
It will not match:
7,423
23m lbs
45 ppounds
2.8 Pound
You could make a much more complicated regex depending on the complexity of the contents you have. I would think this regex is good enough for your purposes.
Hope this helps clarify
Regex can recognize and not export text around a value, this is called a non-capturing group. For example:
Pounds: -GOM Cod up to 5,000 lbs (live wt) # 1.40 lbs
To recognize, up to, the value you want, and (live wt) you could write a regex like this:
(?: up to).(\d+,\d+.lbs).(?:\(live wt\))
Essentially (?:) is a matching group that isn't captured, so the regex only captures the middle bracketed group.
If you provide the exact surrounding text you want I can be more specific.
Edit:
Going off your new examples I can see that the only similarity between all examples is that you have a number (in the thousands so it has a ,), followed by some amount of whitespace, followed by lbs. So your regex would look like:
(?:(\d+,\d+)\s+lbs)
This will return the matches of the numbers themselves. You can see an example it working here. This regex will exclude the smaller values, by virtue of ignoring values that are not in the thousands (i.e. that do not contain a ,).
Edit 2:
Also I'd figure I'd point out that this can be done entirely without regex using str.split(). Instead of trying to find a particular word pattern, you can just use the fact that the number you want will be the word before lbs, i.e. if lbs is at position i, then your number is at position i-1.
The only other consideration you have to face is how to deal with multiple values, the two obvious ones are:
Biggest value.
First value.
Here's how both cases would work with your original code:
def max_pounds(line):
pound_values = {}
words = line.split()
for i, word in enumerate(words):
if word.lower() == 'lbs':
# Convert the number into an float
# And save the original string representation.
pound_values[(float(words[i-1].replace(',','')))] = words[i-1]
# Print the biggest numerical number.
print(pound_values[max(pound_values.keys())])
def first_pounds(line):
words = line.split()
for i, word in enumerate(words):
if word.lower() == 'lbs':
# print the number and exit.
print(words[i-1])
return
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
sector_result = []
pattern = re.compile("Pounds | lbs", re.IGNORECASE)
for linenum, line in enumerate(f):
if pattern.search(line) != None:
sector_result.append((linenum, line.rstrip('\n')))
for linenum, line in sector_result:
print ("Pounds:", line)
# Only one function is required.
max_pounds(line)
first_pounts(line)
The one caveat is that the code doesn't handle the edge case where lbs is the first word, but this is easily handled with a try-catch.
Neither regex or split will work if the value before lbs is something other than the number. If you run into that problem I would suggest searching your data for offending emails - and if the number is small enough editing them by hand.

Python regex to parse financial data

I am relatively new to regex (always struggled with it for some reason)...
I have text that is of this form:
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
Parsing the text reveals the following structure:
Two or more words beginning the sentence, and before the first comma, is the name of the person involved in the transaction
One or more words before ('sold'|'bought'|'exercised'|'sold post-exercise') is the title of the person
Presence of either one of these: ('sold'|'bought'|'exercised'|'sold post-exercise') AFTER the title, identifies the transaction type
first numeric string following the transaction type ('sold'|'bought'|'exercised'|'sold post-exercise') denotes the size of the transaction
'price of ' preceeds a numeric string, which specifies the price at which the deal was struck.
My question is:
How can I use this knowledge (and regex), to write a function that parses similar text to return the variables of interest (listed 1 - 5 above)?
Pseudo code for the function I want to write ..
def grok_directors_dealings_text(text_input):
name, title, transaction_type, lot_size, price = (None, None, None, None, None)
....
name = ...
title = ...
transaction_type = ...
lot_size = ...
price = ...
pass
How would I use regex to implement the functions to return the variables of interest when passed in text that conforms to the structure I have identified above?
[[Edit]]
For some reason, I have seemed to struggle with regex for a while, if I am to learn from the correct answer here on S.O, it will be much better, if an explanation is offered as to why the magical expression (sorry, regexpr) actually works.
I want to actually learn this stuff instead of copy pasting expressions ...
You can use the following regex:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
DEMO
Python:
import re
financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""
print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))
Output:
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]
EDIT 1
To understand how and what they mean, follow the DEMO link,on top right you can find a block explaining what each and every character means as follows:
Also Debuggex helps you simulate the string by showing what group matches which characters!
Here's a debuggex demo for your particular case:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
Debuggex Demo
I came up with this regex:
([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p
Debuggex Demo
Basically, we are using the parenthesis to capture the important info you want so let's check it out each one:
([\w ]+): \w matches any word character [a-zA-Z0-9_] one or more times, this will give us the name of the person;
([\w ]+)Another one of these after a space and comma to get the title;
(sold post-exercise|sold|bought|exercised) then we search for our transaction types. Notice I put the post-exercise before the post so that it tries to match the bigger word first;
([\d,\.]+) Then we try to find the numbers, which are made of digits (\d), a comma and probbably a dot may appear as well;
([\d\.,]+) Then we need to get to the price which is basically the same as the size of the transaction.
The regex that connects each group are pretty basic as well.
If you try it on regex101 it provides some explanation about the regex and generates this code in python to use:
import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')
test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."
re.findall(p, test_str)
You can use the following regex that just looks for characters surrounding the delimiters:
(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p
The parts in parentheses will be captured as groups.
>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
... print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]
this is the regex that will do it
(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)
you use it like this
import re
def get_data(line):
pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
m = re.match(pattern, line)
return m.groups()
for the first line this will return
('David Meredith', ' Financial Director ', 'sold post-exercise', '15,000', '1044.00')
EDIT:
adding explanation
this regex works as follows
the first characters (.*?), mean - take the string until the next match(witch is the ,)
. means every character
the * means that it can be many times (many characters and not just 1)
? means dont be greedy, that means that it will use the first ',' and another one (if there are many ',')
after that there is this again (.*?)
again take the characters until the next thing to match (with is the constant words)
after that there is (sold post-exercise|sold|bought|exercised) witch means - find one of the words (sperated by | )
after that there is a .*? witch again means take all text until next match (this time it is not surounded by () so it wont be selected as a group and wont be part of the output)
([\d|,]+) means take a digit (\d) or a comma. the + stands for one or more times
again .*? like before
'price of ' finds the actual string 'price of '
and last ([\d|.]+) means again take a digit or a dot (escaped because the character . is used by regex for 'any character') one or more times

Processing malformed text data with machine learning or NLP

I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...

Categories