I am trying to figure out the best way to get the output to match in python using a few regex matches. Here is an example text.
Student ID: EDITED Sex: TRUCK
<<Fall 2016: 20160822 to 2
Rpt Dup
CRIJ 3310 Foundtns of Criminal Justice 3 A
COMM 3315 Leadership Communication 3 B
ENGL 3430 Professional Writing 4 A
<<Spring 2017: 20170117 to 20170512 () >>
MKTG 3303 Principles of Marketing 3 B
<<Summer 2017: 20170515 to 20170809 () >>
HUMA 4300 Selected Topics in Humanities 3
<<Fall 2017: 20170828 to 20171215 () >>
HUMA 4317 The Modern Era 3
COMM
4314 Intercultrl Communicatn 3
(((IT REPEATS THE SAME TYPE OF TEXT BUT WITH A DIFFERENT STUDENT BELOW)))
Here is some code:
import re
term_match = re.findall(r'^<<.*', filename, re.M)
course_match = re.findall(r'^[A-Z]{2,7}.*', filename, re.M
print('\n'.join(term_match))
print('\n'.join(course_match))
I have a regex to match the student ID and the Course info, my problem is getting them to be outputted in line by line order. On the document there are multiple students with lots of coursework so just matching is not good enough. I need to match ID, print the following coursework matches, and then print the next ID and coursework when it gets to that line. Any help on how to achieve such a thing would be great!
The flag re.MULTILINE will let the regex span multiple lines.
That said, you're probably better off looping line-by-line and recognizing when each new student id is encountered:
student_id = ''
for line in s.splitlines(False):
if not line:
continue
elif line.startswith('STUDENT'):
student_id = line[7:].strip()
else:
print(student_id, line)
One other thought, you could simplify the problem by dividing the text into chunks (one per student id):
starts = [mo.start() for mo in re.finditer(r'^STUDENT ID(.*)$', s, re.MULTILINE)]
starts.append(len(s))
chunks = []
for begin, end in zip(starts, starts[1:]):
chunks.append(s[begin:end])
After that, isolating the courses for each student should be much easier :-)
Related
For my upcoming project, I am supposed to take a text file that has been scrambled and unscramble it into a specific format.
Each line in the scrambled file contains the line in the text file, a line number, and a three-letter code that identifies the work. Each of these items is separated by the |character. For example,
it ran away when it saw mine coming!"|164|ALC
cried to the man who trundled the barrow; "bring up alongside and help|27|TRI
"Of course he's stuffed," replied Dorothy, who was still angry.|46|WOO
My task is to write a program that reads each line in the text file, separates and unscrambles the lines, and collects the basic data you’d first set out to collect. For each work, I have to determine
its longest line (and the corresponding line number),
its shortest line (and corresponding line number), and
the average length of the lines in the entire work.
The summaries should be sorted by three-letter code and should be formatted as follows:
ALC
Longest Line (107): "No, I didn’t," said Alice: "I don’t think it’s at all a pity. I said Shortest Line (148): to." Average Length: 59
WOO
Longest Line (66): of my way. Whenever I’ve met a man I’ve been awfully scared; but I just Shortest Line (71): go." Average Length: 58
Then I have to make another file and this file should contain the three-letter code for a work followed by its text. The lines must all be included and should be ordered and should not include line numbers or three-letter codes. The lines should be separated by a separator with five dashes. The result should look like the following:
ALC
A large rose-tree stood near the entrance of the garden: the roses growing on it were white, but there were three gardeners at it, busily painting them red. Alice thought this a very curious thing, and she went nearer to watch them, and just as she came up to them she heard one of them say, "Look out now, Five! Don’t go splashing paint over me like that!" "I couldn’t help it," said Five, in a sulky tone; "Seven jogged my elbow." On which Seven looked up and said, "That’s right, Five! Always lay the blame on others!"
-----
TRI SQUIRE TRELAWNEY, Dr. Livesey, and the rest of these gentlemen having asked me to write down the whole particulars about Treasure Island, from the beginning to the end, keeping nothing back but the bearings of the island, and that only because there is still treasure not yet lifted, I take up my pen in the year of grace 17__ and go back to the time when my father kept the Admiral Benbow inn and the brown old seaman with the sabre cut first took up his lodging under our roof. I remember him as if it were yesterday, as he came plodding to the inn door, his sea-chest following behind him in a hand-barrow--a tall, strong, heavy, nut-brown man, his tarry pigtail falling over the
-----
WOO All this time Dorothy and her companions had been walking through the thick woods. The road was still paved with yellow brick, but these were much covered by dried branches and dead leaves from the trees, and the walking was not at all good. There were few birds in this part of the forest, for birds love the open country where there is plenty of sunshine. But now and then there came a deep growl from some wild animal hidden among the trees. These sounds made the little girl’s heart beat fast, for she did not know what made them; but Toto knew, and he walked close to Dorothy’s side, and did not even bark in return.
My question is, what sort of tools or methods for lists or any other data structures that python has would be the best to use for this project where I have to move lines of texts around and unscramble the order of the words themselves? I would greatly appreciate some advice or help with the code.
#Gesucther
The code you posted works, except when the program tries to find the data for the summaries file it only brings back:
TTL
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0
WOO
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0
ALG
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0
Is there something you can see that's causing the average, and the shortest lines to not print out correctly? Or even the longest line.
Here is a download link to the starting text file. https://drive.google.com/file/d/1Dwnk0ziqovEEuaC7r7YzZdkI5_bh7wvG/view?usp=sharing
EDIT*****
It is working properly now but is there a way to change the code so it outputs the line number where the longest and shortest lines are found? Instead of the character count?
TTL
Longest Line (82): *** END OF THE PROJECT GUTENBERG EBOOK TWENTY THOUSAND LEAGUES UNDER THE SEAS ***
Shortest Line (1): N
Average Length: 58
WOO
Longest Line (74): Section 5. General Information About Project Gutenberg-tm electronic works
Shortest Line (3): it.
Average Length: 58
ALG
Longest Line (76): 2. Alice through Q.’s 3d (_by railwayv) to 4th (_Tweedledum and Tweedledee_)
Shortest Line (1): 1
Average Length: 54
Above, next to longest line it has (76) because it's the character length in the sentence, but is there a way to have it be the line number instead?
EDIT****
It looks like my summary and unscrambled are coming out unalphabetilally? Is there a way to make them come out alphabetical instead?
I suggest using pandas for this. You can load your data as a dataframe with read.csv once you've added a newline character at the right positions, which can be done with regex:
import pandas as pd
import io
import re
data = '''it ran away when it saw mine coming!"|164|ALC cried to the man who trundled the barrow; "bring up alongside and help|27|TRI "Of course he's stuffed," replied Dorothy, who was still angry.|46|WOO'''
data = re.sub('(?<=[A-Z]{3})\s', '\n', data) # replace space after a word of three captial letters with newline character
df = pd.read_csv(io.StringIO(data), sep='|', names=['text', 'line', 'book'])
This will output the following dataframe:
text
line
book
0
it ran away when it saw mine coming!"
164
ALC
1
cried to the man who trundled the barrow; "bring up alongside and help
27
TRI
2
Of course he's stuffed, replied Dorothy, who was still angry.
46
WOO
Now you can process the data as you like. For example by getting the number of characters in the lines and printing the desired statistics:
df['length'] = df['text'].str.len()
print('longest string:', df[df['length']==df['length'].max()])
print('shortest string:', df[df['length']==df['length'].min()])
print('average string length:', df['length'].mean())
Or get the full texts of the books by sorting by line number, grouping the data by book and joining the lines per book:
full_texts = df.sort_values(['line']).groupby('book', as_index = False).agg({'text': ' '.join})
print('\n\n-----\n\n'.join(full_texts['book']+' '+full_texts['text']))
Result:
ALC it ran away when it saw mine coming!"
-----
TRI cried to the man who trundled the barrow; "bring up alongside and help
-----
WOO Of course he's stuffed, replied Dorothy, who was still angry.
if you aren't allowed to use any (3rd party) imports, this approach might help you:
First of all, we need to parse your scrambled file, where
Each line [...] contains the line in the text file, a line number, and a three-letter code that identifies the work. Each of these items is separated by the |character.
INPUT_FILE = "text.txt"
SUMMARIES_FILE = "summaries.txt"
UNSCRAMBLED_FILE = "unscrambled.txt"
books = {}
with open(INPUT_FILE, "r") as f:
for l in f:
l = l.strip().split("|")
text, line, book = l
texts = books.get(book, [])
texts.append((line, text))
books[book] = texts
The dictionary authors will now look like this:
{
'ALC': [('164', 'it ran away when it saw mine coming!"')],
'TRI': [('27', 'cried to the man who trundled the barrow; "bring up alongside and help')],
'WOO': [('46', '"Of course he\'s stuffed," replied Dorothy, who was still angry.')]
}
Now we can proceed to the processing of each line (please notice the comments in the code):
with open(SUMMARIES_FILE, "w") as summaries_file, open(UNSCRAMBLED_FILE, "w") as unscrambled_file:
summary = ""
unscrambled = ""
# iterate over all books
for book, texts in books.items():
# sort the lines by line number
texts = sorted(texts, key=lambda k: int(k[0]))
unscrambled += f"{book}\n"
total_len = 0
longest = shortest = None
# iterate over all (sorted) lines of the book
for text in texts:
line, text = text
unscrambled += text
length = len(text)
# keep track of the total length of each line (we need that to calculate the average)
total_len += length
# check whether the current sentence is the longest one yet
longest = longest if longest is not None and len(longest[1]) > length else (line, text)
# check whether the current sentence is the smallest one yet
shortest = shortest if shortest is not None and len(shortest[1]) < length else (line, text)
unscrambled += "\n-----\n"
# calculate the average length of lines
average_len = total_len // len(texts)
summary += f"{book}\n" \
f"Longest Line ({longest[0]}): {longest[1]}\n" \
f"Shortest Line ({shortest[0]}): {shortest[1]}\n" \
f"Average Lenght: {average_len}\n\n"
# write results to the appropriate files
summaries_file.write(summary)
unscrambled_file.write(unscrambled)
summaries.txt will contain:
ALC
Longest Line (37): it ran away when it saw mine coming!" Shortest Line (37): it ran away when it saw mine coming!" Average Lenght: 37
TRI
Longest Line (70): cried to the man who trundled the barrow; "bring up alongside and help Shortest Line (70): cried to the man who trundled the barrow; "bring up alongside and help Average Lenght: 70
WOO
Longest Line (63): "Of course he's stuffed," replied Dorothy, who was still angry. Shortest Line (63): "Of course he's stuffed," replied Dorothy, who was still angry. Average Lenght: 63
unscrambled.txt will contain:
ALC
it ran away when it saw mine coming!"
-----
TRI
cried to the man who trundled the barrow; "bring up alongside and help
-----
WOO
"Of course he's stuffed," replied Dorothy, who was still angry.
-----
However, this solution might not be as efficient as using pandas.
import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')
I have four speakers like this:
Team_A=[Fred,Bob]
Team_B=[John,Jake]
They are having a conversation and it is all represented by a string, ie. convo=
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
How do I disassemble and reassemble the string so I can split it into 2 strings, 1 string of what Team_A said, and 1 string from what Team_A said?
output: team_A_said="hello how is it going?", team_B_said="hi we are doing fine"
The lines don't matter.
I have this awful find... then slice code that is not scalable. Can someone suggest something else? Any libraries to help with this?
I didn't find anything in nltk library
This code assumes that contents of convo strictly conforms to the
name\nstuff they said\n\n
pattern. The only tricky code it uses is zip(*[iter(lines)]*3), which creates a list of triplets of strings from the lines list. For a discussion on this technique and alternate techniques, please see How do you split a list into evenly sized chunks in Python?.
#!/usr/bin/env python
team_ids = ('A', 'B')
team_names = (
('Fred', 'Bob'),
('John', 'Jake'),
)
#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
for name in names:
teams[name] = team_id
#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
'''
lines = convo.splitlines()
#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
team_id = teams[name]
said[team_id].append(text)
for team_id in team_ids:
print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))
output
Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'
You could use a regular expression to split up each entry. itertools.ifilter can then be used to extract the required entries for each conversation.
import itertools
import re
def get_team_conversation(entries, team):
return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]
Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']
convo = """
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine"""
find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]
print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)
Giving the following output:
Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']
It is a problem of language parsing.
Answer is a Work in progress
Finite state machine
A conversation transcript can be understood by imagining it as parsed by automata with the following states :
[start] ---> [Name]----> [Text]-+----->[end]
^ |
| | (whitespaces)
+-----------------+
You can parse your conversation by making it follow that state machine. If your parsing succeeds (ie. follows the states to end of text) you can browse your "conversation tree" to derive meaning.
Tokenizing your conversation (lexer)
You need functions to recognize the name state. This is straightforward
name = (Team_A | Team_B) + '\n'
Conversation alternation
In this answer, I did not assume that a conversation involves alternating between the people speaking, like this conversation would :
Fred # author 1
hello
John # author 2
hi
Bob # author 3
how is it going ?
Bob # ERROR : author 3 again !
are we still on for saturday, Fred ?
This might be problematic if your transcript concatenates answers from same author
There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?
Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person
you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person
I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...