Iterate and match all elements with regex - python

So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.

Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]

The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']

You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']

Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)

Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1

Related

Regex - removing everything after first word following a comma

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.
d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)
Tried:
x['names'] = [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]
Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']
Not sure why my regex isn't working, but any help would be appreciated.
You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object
Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...
Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

Capture the n previous words when matching a string

Let's say I have this text:
abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
I want to capture these personal names:
Mark Jones, Taylor Daniel Lautner, Allan Stewart Konigsberg Farrow.
Basically, when we find (P followed by any capital letter, we capture the n previous words that start with a capital letter.
What I have achieved so far is to capture just one previous word with this code: \w+(?=\s+(\(P+[A-Z])). But I couldn't evolve from that.
I appreciate it if someone can help :)
Regex pattern
\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]
In order to find all matching occurrences of the above regex pattern we can use re.findall
import re
text = """abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
"""
matches = re.findall(r'\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]', text)
>>> matches
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']
Regex details
\b : Word boundary to prevent partial matches
((?:[A-Z]\w+\s?)+): First Capturing group
(?:[A-Z]\w+\s?)+: Non capturing group matches one or more times
[A-Z]: Matches a single alphabet from capital A to Z
\w+: Matches any word character one or more times
\s? : Matches any whitespace character zero or one times
\s : Matches a single whitespace character
\(: Matches the character ( literally
P : Matches the character P literally
[A-Z] : Matches a single alphabet from capital A to Z
See the online regex demo
With your shown samples, could you please try following. Using Python's re library here to fetch the results. Firstly using findall to fetch all values from given string var where (.*?)\s+\((?=P[A-Z]) will catch everything which is having P and a capital letter after it, then creating a list lst. Later using substitute function to substitute everything non-spacing things followed by spaces 1st occurrences with NULL to get exact values.
import re
var="""abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)"""
lst = re.findall(r'(.*?)\s+\((?=P[A-Z])',var)
[re.sub(r'^\S+\s+','',s) for s in lst]
Output will be as follows:
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']

Python -- split a string with multiple occurrences of same delimiter

How can I take a string that looks like this
string = 'Contact name: John Doe Contact phone: 222-333-4444'
and split the string on both colons? Ideally the output would look like:
['Contact Name', 'John Doe', 'Contact phone','222-333-4444']
The real issue is that the name can be an arbitrary length however, I think it might be possible to use re to split the string after a certain number of space characters (say at least 4, since there will likely always be at least 4 spaces between the end of any name and beginning of Contact phone) but I'm not that good with regex. If someone could please provide a possible solution (and explanation so I can learn), that would be thoroughly appreciated.
You can use re.split:
import re
s = 'Contact name: John Doe Contact phone: 222-333-4444'
new_s = re.split(':\s|\s{2,}', s)
Output:
['Contact name', 'John Doe', 'Contact phone', '222-333-4444']
Regex explanation:
:\s => matches an occurrence of ': '
| => evaluated as 'or', attempts to match either the pattern before or after it
\s{2,} => matches two or more whitespace characters

strip white spaces and new lines when reading from file

I have the following code, that successfully strips end of line characters when reading from file, but doesn't do so for any leading and trailing white spaces (I want the spaces in between to be left!)
What is the best way to achieve this? (Note, this is a specific example, so not a duplicate of general methods to strip strings)
My code: (try it with the test data: "Mr Moose" (not found) and if you try "Mr Moose " (that is a space after the Moose) it will work.
#A COMMON ERROR is leaving in blank spaces and then finding you cannot work with the data in the way you want!
"""Try the following program with the input: Mr Moose
...it doesn't work..........
but if you try "Mr Moose " (that is a space after Moose..."), it will work!
So how to remove both new lines AND leading and trailing spaces when reading from a file into a list. Note, the middle spaces between words must remain?
"""
alldata=[]
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
for line in f.readlines():
alldata.append((line.strip()))
print(alldata)
print()
print()
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
teacher=input("Enter teacher you are looking for:")
if teacher in teacher_names:
print("found")
else:
print("No")
Desired output, on producing the list alldata
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
i.e - remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
Contents of teacherbook:
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English
Thanks in advance
You could use a regex:
txt='''\
Mr Moose : Maths
Mr Goose: History
Mrs Congenelipilling: English'''
>>> [re.sub(r'\s*:\s*', ':', line).strip() for line in txt.splitlines()]
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
So your code becomes:
import re
col_num=0
teacher_names=[]
delimiter=":"
with open("teacherbook.txt") as f:
alldata=[re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip() for line in f]
print(alldata)
for x in alldata:
teacher_names.append(x.split(delimiter)[col_num])
print(teacher_names)
Prints:
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
['Mr Moose', 'Mr Goose', 'Mrs Congenelipilling']
The key part is the regex:
re.sub(r'\s*{}\s*'.format(delimiter), delimiter, line).rstrip()
^ 0 to unlimited spaced before the delimiter
^ place for the delimiter
^ unlimited trailing space
Interactive Demo
For an all Python solution, I would use str.partition to get the left hand and right hand side of the delimiter then strip the whitespace as needed:
alldata=[]
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
alldata.append(lh.rstrip() + sep + rh.lstrip())
Same output
Another suggestion. Your data is more suited to a dict than a list.
You can do:
di={}
with open("teacherbook.txt") as f:
for line in f:
lh,sep,rh=line.rstrip().partition(delimiter)
di[lh.rstrip()]=rh.lstrip()
Or comprehension version:
with open("teacherbook.txt") as f:
di={lh.rstrip():rh.lstrip()
for lh,_,rh in (line.rstrip().partition(delimiter) for line in f)}
Then access like this:
>>> di['Mr Moose']
'Maths'
No need to use readlines(), you can simply iterate through the file object to get each line, and use strip() to remove the \n and whitespaces. As such, you can use this list comprehension;
with open('teacherbook.txt') as f:
alldata = [':'.join([value.strip() for value in line.split(':')])
for line in f]
print(alldata)
Outputs;
['Mr Moose:Maths', 'Mr Goose:History', 'Mrs Congenelipilling:English']
Change:
teacher_names.append(x.split(delimiter)[col_num])
to:
teacher_names.append(x.split(delimiter)[col_num].strip())
remove all leading and trailing white space at the start, and before or after the delimiter. The spaces in between words such as Mr Moose, must be left.
You can split your string at the delimiter, strip the whitespace from them, and concatenate them back together again:
for line in f.readlines():
new_line = ':'.join([s.strip() for s in line.split(':')])
alldata.append(new_line)
Example:
>>> lines = [' Mr Moose : Maths', ' Mr Goose : History ']
>>> lines
[' Mr Moose : Maths', ' Mr Goose : History ']
>>> data = []
>>> for line in lines:
new_line = ':'.join([s.strip() for s in line.split(':')])
data.append(new_line)
>>> data
['Mr Moose:Maths', 'Mr Goose:History']
You can do it easily with regex - re.sub:
import re
re.sub(r"[\n \t]+$", "", "aaa \t asd \n ")
Out[17]: 'aaa \t asd'
first argument pattern - [all characters you want to remove]++ - one or more matches$$ - end of the string
https://docs.python.org/2/library/re.html
With string.rstrip('something') you can remove that 'something' from the right end of the string like this:
a = 'Mr Moose \n'
print a.rstrip(' \n') # prints 'Mr Moose\n' instead of 'Mr Moose \n\n'

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

Categories