Using Regex to 'Clean Up' a List of Names [closed]

Using Regex to 'Clean Up' a List of Names [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am using Regex to clean up a list of names so that they are normal. Let's say this list was...
000000AAAAAARob Alsod ## Notice multiple 0's and A's?
AAAPerson Person ## Here, too
Jeff the awesome Guy ## Four words...
Jenna DEeath ## A name like this can exist.
GEOFFERY EVERDEEN ## All caps
shy guy ## All lowercase
Theone Normalperson ## Example name. This one is fine.
Guywith Whitespace ## Trailing or leading whitespace is a nono.
So, as you can see, people don't format their names correctly, so I need a program to highlight all the unwanted stuff. This includes:
Numbers at the start of the name.
Any uppercase without lowercase after. i.e. AAAAAAAJosh
Anything that is all uppercase.
Anything that doesn't start with uppercase. i.e. josh
Trailing and leading whitespace.
I think that is all I need to filter out. The ending product should look something like this:
Rob Alsod ## No more 0's and A's.
Person Person ## No more leading A's (or other letters).
Jeff Guy ## No lowercase words in his name.
Jenna DEeath ## HASN'T removed the D in the middle.
## Name removed as it was all uppercase.
## Name removed as it was all lowercase.
Theone Normalperson ## Nothing changed.
Guywith Whitespace ## Removed whitespace.
EDIT: Sorry about that. Here is my current code:
# Enter your code for "Name Cleaning" here.
import re
namenum = []
num = 0
for sen in open('file.txt'):
namenum += [sen.split(',')]
namenum[num][0] = re.sub(r'\s[a-z]+', '', namenum[num][0])
namenum[num][0] = re.sub(r'^([0-9]*)', '', namenum[num][0])
namenum[num][0] = re.sub(r'^[A-Z]*?\s[A-Z]*?$', '', namenum[num][0])
namenum[num][0] = re.sub(r'[^a-zA-Z ][A-Z]*(?=[A-Z])', '', namenum[num][0])
namenum[num][0] = re.sub(r'\b[a-z]+\b', '', namenum[num][0])
namenum[num][0] = re.sub(r'^\s*', '', namenum[num][0])
namenum[num][0] = re.sub(r'\s*$', '', namenum[num][0])
if namenum[num][0] == '':
namenum[num][0] = 'Invalid Name'
num += 1
for i in range(len(namenum)):
namenum[i][1] = int(namenum[i][1].strip())
namenum = sorted(namenum, key=lambda item: (-item[1], item[0]))
for i in range(0, len(namenum)):
print(namenum[i][0]+','+str(namenum[i][1]))
It does half the job, but it misses out on some stuff for some reason.
Here is the output:
AAAAAARob Alsod
AAAPerson Person
Guywith Whitespace
Invalid Name
Invalid Name
Jeff Guy
Jenna DEeath
Theone Normalperson
I also tried inputting a name like harry hamilton and it gave back harry, which it should have removed.

This regex removes all your invalid examples. None of your examples requires the for loop which filters banned words, but I think you will need it.
Although this code removes all invalid names from a list it should be easy to modify it to request a new input from the user. Also it doesn't let you know why a name is invalid, but you could just display all the rules.
from re import match
def rules(name):
for badWord in bannedWords:
if name.lower().find(badWord) >= 0:
return False
return match(r'^([A-Z][a-z]+(?:[A-Z]?[a-z]+)* ?){1,}$', name)
bannedWords = ('really', 'awesome')
input = ['000000AAAAAARob Alsod', 'AAAPerson Person', 'Jeff the awesome Guy', 'Jenna DEeath', 'GEOFFERY EVERDEEN', 'shy guy', 'Theone Normalperson', ' Guywith Whitespace', 'Someone Middlename MacIntyre', '', 'Jack Really Awesome']
results = filter(rules, input)
print results
Produces the result:
['Theone Normalperson', 'Someone Middlename MacIntyre']
Without the for loop:
from re import match
def rules(name):
return match(r'^([A-Z][a-z]+(?:[A-Z]?[a-z]+)* ?){1,}$', name)
input = ['000000AAAAAARob Alsod', 'AAAPerson Person', 'Jeff the awesome Guy', 'Jenna DEeath', 'GEOFFERY EVERDEEN', 'shy guy', 'Theone Normalperson', ' Guywith Whitespace', 'Someone Middlename MacIntyre', '', 'Jack Really Awesome']
results = filter(rules, input)
print results
Produces the result:
['Theone Normalperson', 'Someone Middlename MacIntyre', 'Jack Really Awesome']

Related

Python: split a text into individual English sentences; retain the punctuation [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to make a function, takes a string/text as an argument, return list of sentences in the text. Sentence boundaries like(.,?,!) should not be removed.
I don't want it to split on abbreviations (Dr. Kg. Mr. Mrs., e.g. "Dr. Jones").
Should I make a dictionary of all abbreviations?
Given input:
input = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"
Expected output:
output=['I think Dr. Jones is busy now.','Can you visit some other day?','I was really surprised!']
What I've tried:
# performing somthing like this:
output = input.split('.')
# will produce
'''
['I think Dr', ' Jones is busy now', ' Can you visit some other day? I was really surprised!']
'''
# where as doing
output = input.split(' ')
# will produce
'''
['I', 'think', 'Dr.', 'Jones', 'is', 'busy', 'now.', 'Can', 'you', 'visit', 'some', 'other', 'day?', 'I', 'was', 'really', 'surprised!']
'''
Basic assumption is that the text intput is not anomalously punctuated!

A clumsy way of achieving it is as follows:
abbr = {'Dr.', 'Mr.', 'Mrs.', 'Ms.'}
sentence_ender = ['.', '?', '!']
s = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"
def containsAny(wrd, charList):
# The list comprehension generates a list of True and False.
# "1 in [ ... ]" returns true is the list has atleast 1 true, else false
# we are essentially testing whether the word contains the sentence ender char
return 1 in [c in wrd for c in charList]
def separate_sentences(string):
sentences = [] # will be a list of all complete sentences
temp = [] # will be a list of all words in current sentence
for wrd in string.split(' '): # the input string is split on spaces
temp.append(wrd) # append current word to temp
# The following condition checks that if the word is not an abbreviation
# yet contains any of the sentence delimiters,
# make 'space separated' sentence and clear temp
if wrd not in abbr and containsAny(wrd, sentence_ender):
sentences.append(' '.join(temp)) # combine words currently in temp
temp = [] # clear temp, for next sentence
return sentences
print(separate_sentences(s))
Should produce:
['I think Dr. Jones is busy now.', 'Can you visit some other day?', 'I was really surprised!']

Python -- split a string with multiple occurrences of same delimiter

How can I take a string that looks like this
string = 'Contact name: John Doe Contact phone: 222-333-4444'
and split the string on both colons? Ideally the output would look like:
['Contact Name', 'John Doe', 'Contact phone','222-333-4444']
The real issue is that the name can be an arbitrary length however, I think it might be possible to use re to split the string after a certain number of space characters (say at least 4, since there will likely always be at least 4 spaces between the end of any name and beginning of Contact phone) but I'm not that good with regex. If someone could please provide a possible solution (and explanation so I can learn), that would be thoroughly appreciated.

You can use re.split:
import re
s = 'Contact name: John Doe Contact phone: 222-333-4444'
new_s = re.split(':\s|\s{2,}', s)
Output:
['Contact name', 'John Doe', 'Contact phone', '222-333-4444']
Regex explanation:
:\s => matches an occurrence of ': '
| => evaluated as 'or', attempts to match either the pattern before or after it
\s{2,} => matches two or more whitespace characters

Python regex to match Ledger/hledger account journal entry

I am writing a program in Python to parse a Ledger/hledger journal file.
I'm having problems coming up with a regex that I'm sure is quite simple. I want to parse a string of the form:
expenses:food:food and wine 20.99
and capture the account sections (between colons, allowing any spaces), regardless of the number of sub-accounts, and the total, in groups. There can be any number of spaces between the final character of the sub-account name and the price digits.
expenses:food:wine:speciality 19.99 is also allowable (no space in sub-account).
So far I've got (\S+):|(\S+ \S+):|(\S+ (?!\d))|(\d+.\d+) which is not allowing for any number of sub-accounts and possible spaces. I don't think I want to have OR operators in there either as this is going to concatenated with other regexes with .join() as part of the parsing function.
Any help greatly appreciated.
Thanks.

You can use the following:
((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)
Now we can use:
s = 'expenses:food:wine:speciality 19.99'
rgx = re.compile(r'((?:[^\s:]+)(?:\:[^\s:]+)*)\s*(\d+\.\d+)')
mat = rgx.match(s)
if mat:
categories,price = mat.groups()
categories = categories.split(':')
Now categories will be a list containing the categories, and price a string with the price. For your sample input this gives:
>>> categories
['expenses', 'food', 'wine', 'speciality']
>>> price
'19.99'

You don't need regex for such a simple thing at all, native str.split() is more than enough:
def split_ledger(line):
entries = line.split(":") # first split all the entries
last = entries.pop() # take the last entry
return entries + last.rsplit(" ", 1) # split on last space and return all together
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine ', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality ', '19.99']
Or if you don't want the leading/trailing whitespace in any of the entries:
def split_ledger(line):
entries = [e.strip() for e in line.split(":")]
last = entries.pop()
return entries + [e.strip() for e in last.rsplit(" ", 1)]
print(split_ledger("expenses:food:food and wine 20.99"))
# ['expenses', 'food', 'food and wine', '20.99']
print(split_ledger("expenses:food:wine:speciality 19.99"))
# ['expenses', 'food', 'wine', 'speciality', '19.99']

What Regex to use in this example

I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:
'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'
The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.
For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.
I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:
regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)
mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)
#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:
print "Goal = ", mysearch2.group()
if mysearch3 is not None:
print "Assist = ", mysearch3.group()
if mysearch4 is not None:
print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
print "Yellow Card = ", mysearch5.group()
This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.
Can anyone see what I am doing wrong?

You can try something like this:
import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
'Man of the Match',
'Goal',
'Assist',
'Yellow Card',
'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']

Some general overview on regex methods in python:
re.search | re.match
In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:
s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
# if line contains an IP address...
print(line)
You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:
lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
if repattern.match(line):
print(line)
# Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!
The take away is that you use these two functions to select lines in a longer document (or any iterable)
re.findall | re.finditer
It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.
s = """Phone book:
Adam: (555)123-4567
Joe: (555)987-6543
Alice:(555)135-7924"""
pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']
re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.
other methods: re.split | re.sub
re.split is great if you have a number of things you need to split by. Imagine you had the string:
s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''
There's no great way to do that with str.split like you're used to, so instead do:
separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
# special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']
re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....
dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
\d{1,2} # two digits
(.) # followed by a separator (capture)
\d{1,2} # two more digits
\1 # a backreference to that separator
\d{2}(?:\d{2})? # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
match = match_pat.match(date)
if match:
sep = match.group(1) # the separator
separators.append(sep)
else:
dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09
A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:
def get_department(dept_num):
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
if hasattr(dept_num, 'group'): # then it's a match, not a number
dept_num = dept_num.group(0)
return departments.get(dept_num, "Unknown Dept")
file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file
dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance
Without using regex here you could do:
replaced_lines = []
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
for line in file.splitlines():
the_split_line = line.split(',')
replaced_lines.append(','.join(the_split_line[:-1]+ \
departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!
Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!

Python Error"TypeError: coercing to Unicode: need string or buffer, list found"

The purpose of this code is to make a program that searches a persons name (on Wikipedia, specifically) and uses keywords to come up with reasons why that person is significant.
I'm having issues with this specific line "if fact_amount < 5 and (terms in sentence.lower()):" because I get this error ("TypeError: coercing to Unicode: need string or buffer, list found")
If you could offer some guidance it would be greatly appreciated, thank you.
import requests
import nltk
import re
#You will need to install requests and nltk
terms = ['pronounced'
'was a significant'
'major/considerable influence'
'one of the (X) most important'
'major figure'
'earliest'
'known as'
'father of'
'best known for'
'was a major']
names = ["Nelson Mandela","Bill Gates","Steve Jobs","Lebron James"]
#List of people that you need to get info from
for name in names:
print name
print '==============='
#Goes to the wikipedia page of the person
r = requests.get('http://en.wikipedia.org/wiki/%s' % (name))
#Parses the raw html into text
raw = nltk.clean_html(r.text)
#Tries to split each sentence.
#sort of buggy though
#For example St. Mary will split after St.
sentences = re.split('[?!.][\s]*',raw)
fact_amount = 0
for sentence in sentences:
#I noticed that important things came after 'he was' and 'she was'
#Seems to work for my sample list
#Also there may be buggy sentences, so I return 5 instead of 3
if fact_amount < 5 and (terms in sentence.lower()):
#remove the reference notation that wikipedia has
#ex [ 33 ]
sentence = re.sub('[ [0-9]+ ]', '', sentence)
#removes newlines
sentence = re.sub('\n', '', sentence)
#removes trailing and leading whitespace
sentence = sentence.strip()
fact_amount += 1
#sentence is formatted. Print it out
print sentence + '.'
print

You should be checking it the other way
sentence.lower() in terms
terms is list and sentence.lower() is a string. You can check if a particular string is there in a list, but you cannot check if a list is there in a string.

you might mean if any(t in sentence_lower for t in terms), to check whether any terms from terms list is in the sentence string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Regex to 'Clean Up' a List of Names [closed] - python

Related

Python: split a text into individual English sentences; retain the punctuation [closed]

Python -- split a string with multiple occurrences of same delimiter

Python regex to match Ledger/hledger account journal entry

What Regex to use in this example

Python Error"TypeError: coercing to Unicode: need string or buffer, list found"

Categories

Resources