Using str.split for pandas dataframe values based on parentheses location - python

Let's say I have the following dataframe series df['Name'] column:
Name
'Jerry'
'Adam (and family)'
'Paul and Hellen (and family):\n'
'John and Peter (and family):/n'
How would I remove all the contents in Name after the first parentheses?
df['Name']= df['Name'].str.split("'(").str[0]
doesn't seem to work and I don't understand why?
The output I want is
Name
'Jerry'
'Adam'
'Paul and Hellen'
'John and Peter'
so everything after the parentheses is deleted.

Solution with split - is necessary escape ( by \:
df['Name']= df['Name'].str.split("\s+\(").str[0]
print (df)
Name
0 'Jerry'
1 'Adam
2 'Paul and Hellen
3 'John and Peter
Solution with regex and replace:
df['Name']= df['Name'].str.replace("\s+\(.*$", "")
print (df)
Name
0 'Jerry'
1 'Adam
2 'Paul and Hellen
3 'John and Peter
\s+\(.*$ means replace from optional whitespace, first ( to the end of string $ to "" - empty string.

Use regular expression:
>>> import re
>>> str = 'Adam (and family)'
>>> result = re.sub(r"( \().*$", '', str)
>>> print result
Adam

Related

Regex - removing everything after first word following a comma

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.
d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)
Tried:
x['names'] = [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]
Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']
Not sure why my regex isn't working, but any help would be appreciated.
You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:
x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")
0 smith,john
1 smith, john
2 brown, bob
3 brown, bob
Name: names, dtype: object
Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...
Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

How to replace certain parts of a string using a list?

namelist = ['John', 'Maria']
e_text = 'John is hunting, Maria is cooking'
I need to replace 'John' and 'Maria'. How can I do this?
I tried:
for name in namelist:
if name in e_text:
e_text.replace(name, 'replaced')
But it only works with 'John'. The output is: 'replaced is hunting, Maria is cooking'. How can I replace the two names?
Thanks.
Strings are immutable in python, so replacements don't modify the string, only return a modified string. You should reassign the string:
for name in namelist:
e_text = e_text.replace(name, "replaced")
You don't need the if name in e_text check since replace already does nothing if it's not found.
You could form a regex alteration of names and then re.sub on that:
namelist = ['John', 'Maria']
pattern = r'\b(?:' + '|'.join(namelist) + r')\b'
e_text = 'John is hunting, Maria is cooking'
output = re.sub(pattern, 'replaced', e_text)
print(e_text + '\n' + output)
This prints:
John is hunting, Maria is cooking
replaced is hunting, replaced is cooking

Remove Numbers and Turn into a List

#
PYTHON
A clearer way of asking the question is:
If I have a string as follows:
'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
How do I turn that string into:
Palm Beach, Gavea, Maronas, Iowa, Orange Park
So that is, make each item in the list 'title'(ie. Uppercase first letter and the rest lower case), delete the numbers and the word 'Race'.
I am setting up to export to Excel.
Thanks in advance - Angus
#
You can do it without importing any library:
races = """PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5"""
''.join([ch if not ch.isdigit() else 'xxx' for ch in races.replace('Race ','')]).split('xxx')
Output:
['PALM BEACH.', 'Gavea', 'Maronas', 'IOWA', 'ORANGE PARK.', '']
You can use re.split and some string manipulation:
import re
>>> s = 'PALM BEACH.Race 6GaveaRace 5MaronasRace 7IOWARace 3ORANGE PARK.Race 5'
>>> # Split by the race and folowed by a digit
>>> race_names = re.split('Race \d+', s)
>>> def format_name(name):
... # Remove the trailing period on some race names
... name = name.rstrip('.')
... # Change name to title case
... name = name.title()
... return name
>>> # Format the name and remove any empty entries in the list
>>> race_names = [format_name(name) for name in race_names if name]
>>> list(race_names)
['Palm Beach', 'Gavea', 'Maronas', 'Iowa', 'Orange Park']

Iterate and match all elements with regex

So I have something like this:
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
I want to replace every element with the first name so it would look like this:
data = ['Alice Smith', 'Tim', 'Uncle Neo']
So far I got:
for i in range(len(data)):
if re.match('(.*) and|with|\&', data[i]):
a = re.match('(.*) and|with|\&', data[i])
data[i] = a.group(1)
But it doesn't seem to work, I think it's because of my pattern but I can't figure out the right way to do this.
Use a list comprehension with re.split:
result = [re.split(r' (?:and|with|&) ', x)[0] for x in data]
The | needs grouping with parentheses in your attempt. Anyway, it's too complex.
I would just use re.sub to remove the separation word & the rest:
data = [re.sub(" (and|with|&) .*","",d) for d in data]
result:
['Alice Smith', 'Tim', 'Uncle Neo']
You can try this:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
final_data = [re.sub('\sand.*?$|\s&.*?$|\swith.*?$', '', i) for i in data]
Output:
['Alice Smith', 'Tim', 'Uncle Neo']
Simplify your approach to the following:
import re
data = ['Alice Smith and Bob', 'Tim with Sam Dunken', 'Uncle Neo & 31']
data = [re.search(r'.*(?= (and|with|&))', i).group() for i in data]
print(data)
The output:
['Alice Smith', 'Tim', 'Uncle Neo']
.*(?= (and|with|&)) - positive lookahead assertion, ensures that name/surname .* is followed by any item from the alternation group (and|with|&)
Brief
I would suggest using Casimir's answer if possible, but, if you are not sure what word might follow (that is to say that and, with, and & are dynamic), then you can use this regex.
Note: This regex will not work for some special cases such as names with apostrophes ' or dashes -, but you can add them to the character list that you're searching for. This answer also depends on the name beginning with an uppercase character and the "union word" as I'll name it (and, with, &, etc.) not beginning with an uppercase character.
Code
See this regex in use here
Regex
^((?:[A-Z][a-z]*\s*)+)\s.*
Substitution
$1
Result
Input
Alice Smith and Bob
Tim with Sam Dunken
Uncle Neo & 31
Output
Alice Smith
Tim
Uncle Neo
Explanation
Assert position at the beginning of the string ^
Match a capital alpha character [A-Z]
Match between any number of lowercase alpha characters [a-z]*
Match between any number of whitespace characters (you can specify spaces if you'd prefer using * instead) \s*
Match the above conditions between one and unlimited times, all captured into capture group 1 (...)+: where ... contains everything above
Match a whitespace character, followed by any character (except new line) any number of times
$1: Replace with capture group 1

What is efficient way to match words in string?

Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.

Categories