How can I take a string that looks like this
string = 'Contact name: John Doe Contact phone: 222-333-4444'
and split the string on both colons? Ideally the output would look like:
['Contact Name', 'John Doe', 'Contact phone','222-333-4444']
The real issue is that the name can be an arbitrary length however, I think it might be possible to use re to split the string after a certain number of space characters (say at least 4, since there will likely always be at least 4 spaces between the end of any name and beginning of Contact phone) but I'm not that good with regex. If someone could please provide a possible solution (and explanation so I can learn), that would be thoroughly appreciated.
You can use re.split:
import re
s = 'Contact name: John Doe Contact phone: 222-333-4444'
new_s = re.split(':\s|\s{2,}', s)
Output:
['Contact name', 'John Doe', 'Contact phone', '222-333-4444']
Regex explanation:
:\s => matches an occurrence of ': '
| => evaluated as 'or', attempts to match either the pattern before or after it
\s{2,} => matches two or more whitespace characters
Related
I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.
d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)
Tried:
x['names'] = [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]
Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']
Not sure why my regex isn't working, but any help would be appreciated.
You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object
Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...
Put the triggered pattern into capture group 1 and then restore it in what gets replaced.
So say, I have a sentence as follows:
sent = "My name is xyz and I got my name from my parents. My email address is nomail#gmail.com"
I want to get all the words in this sentence that start with a vowel, so words like is, I, is. This is my regular expression so far and it isn't working.
re.findall('^(aeiou|AEIOU)[\w|\s].',sent)
This is the result I get
['. ', '..', '.s', '#g', '.c']
Any help would be appreciated.
First of all, your parentheses are not balanced, and you are not checking for word boundaries. Try this:
"\b[(aeiou|AEIOU)].*?\b"
You can use re.findall with re.I:
import re
sent = "My name is xyz and I got my name from my parents. My email address is nomail#gmail.com"
result = re.findall('(?<=\W)[aeiou]\w+|(?<=\W)[aeiou]', sent, re.I)
Output:
['is', 'and', 'I', 'email', 'address', 'is']
So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.
Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]
The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']
You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']
Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)
Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1
I am trying to Parse the following string:
EDITED to account for spaces...
['THE LOCATION', 'THE NAME', 'THE JOB', 'THE AREA')]
Right now I use regular expressions and split the data with the comma into a list
InfoL = re.split(",", Info)
However my output is
'THE LOCATION'
'THE NAME'
'THE JOB'
'THE AREA')]
Looking to have the output as follows
THE LOCATION
THE NAME
THE JOB
THE AREA
Thoughts?
One possibility is to use strip() to remove the unwanted characters:
In [18]: s="['LOCATION', 'NAME', 'JOB', 'AREA')]"
In [19]: print '\n'.join(tok.strip("[]()' ") for tok in s.split(','))
LOCATION
NAME
JOB
AREA
Like your original solution, this will break if any of strings are allowed to contain commas.
P.S. If that closing parenthesis in your example is a typo, you might be able to use ast.literal_eval():
In [22]: print '\n'.join(ast.literal_eval(s))
LOCATION
NAME
JOB
AREA
InfoL = re.split("[, '()\[\]]*", Info)
try this code snippet :
import re
tmpS = "['THE LOCATION', 'THE NAME', 'THE JOB', 'THE AREA')]"
tmpS = re.sub('[^\w\s,]+', '', tmpS)
print tmpS # Result -> 'THE LOCATION, THE NAME, THE JOB, THE AREA'
for s in tmpS.split(','):
print s.strip()
O/P:
THE LOCATION
THE NAME
THE JOB
THE AREA
Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.