I need to clean a list of strings containing names. I need to remove titles and then things like 's etc. The code works ok but I'd like to transform it to two comprehension lists. My attempts like this one [name.replace(e, '') for name in names_ for e in replace] didn't work, I'm definitely missing something. Will appreciate your help!
names = ['Mrs Marple', 'Maj Gen Smith', "Tony Dobson's"]
replace = ['Mrs ', 'Maj ', 'Gen ']
names_new = []
for name in names:
for e in replace:
name = name.replace(e, '')
names_new.append(name)
names_final = []
for name in names_new:
if name.endswith("'s"):
name = name[:-2]
names_final.append(name)
else:
names_final.append(name)
print(names_final)
You can use re.sub() to do exactly what you want:
import re
names = ['Mrs Marple', 'Maj Gen Smith', "Tony Dobson's"]
replace = ['Mrs ', 'Maj ', 'Gen ']
names = [re.sub(r'(Mrs\s|Maj\s|Gen\s|\'s$)', '', x) for x in names]
print(names)
Output:
['Marple', 'Smith', 'Tony Dobson']
the problem is due to name = name.replace(e, '') statement in the for loop, and as we can't use assignment operator in comprehensions, you used name.replace(e, '') but again replace() method is not inplace as the string in python is not mutable.
Solution I that I have written is based on using reduce, here were replacing all the occurrences of elements in sequence replace.
from functools import reduce
names = ['Mrs Marple', 'Maj Gen Smith', "Tony Dobson's"]
replace = ['Mrs ','Maj ','Gen ']
result = [reduce(lambda str, e: str.replace(e, ''), replace, name) for name in names]
Here is the result
print(result)
['Marple', 'Smith', "Tony Dobson's"]
The solution by #chrisz works but if replace list is generated on the fly or is too long, we won't be able to form a regex for it. This solution works pretty much in any scenario.
Related
with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc
es:
firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002
change with:
firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002
I'm trying with:
df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')
......
......
df
it seems to work, but on more populated columns I always miss some characters.
Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!
Thanks!
P.S also encoded text for emoticons
You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']
Try the below. It works on the names you have used in post
first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)
output
['joe down', 'lucash brown', 'antony', 'mary']
namelist = ['John', 'Maria']
e_text = 'John is hunting, Maria is cooking'
I need to replace 'John' and 'Maria'. How can I do this?
I tried:
for name in namelist:
if name in e_text:
e_text.replace(name, 'replaced')
But it only works with 'John'. The output is: 'replaced is hunting, Maria is cooking'. How can I replace the two names?
Thanks.
Strings are immutable in python, so replacements don't modify the string, only return a modified string. You should reassign the string:
for name in namelist:
e_text = e_text.replace(name, "replaced")
You don't need the if name in e_text check since replace already does nothing if it's not found.
You could form a regex alteration of names and then re.sub on that:
namelist = ['John', 'Maria']
pattern = r'\b(?:' + '|'.join(namelist) + r')\b'
e_text = 'John is hunting, Maria is cooking'
output = re.sub(pattern, 'replaced', e_text)
print(e_text + '\n' + output)
This prints:
John is hunting, Maria is cooking
replaced is hunting, replaced is cooking
This question already has answers here:
How would you make a comma-separated string from a list of strings?
(15 answers)
Closed 3 years ago.
I know the desired syntax lies in the first function but I for the life of me can't find where it is.
I've attempted to remove commas and add spaces to each .split() each has yielded an undesired return value.
def get_country_codes(prices):
price_list = prices.split(',')
results = ''
for price in price_list:
results += price.split('$')[0]
return results
def main():
prices = "US$40, AU$89, JP$200"
price_result = get_country_codes(prices)
print(price_result)
if __name__ == "__main__":
main()
The current output:
US AU JP
The desired output:
US, AU, JP
It looks like you could benefit from using a list to collect the country codes of the prices instead of a string. Then you can use ', '.join() later.
Maybe like this:
def get_country_codes(prices):
country_code_list = []
for price in prices.split(','):
country_code = price.split('$')[0].strip()
country_code_list.append(country_code)
return country_code_list
if __name__ == '__main__':
prices = "US$40, AU$89, JP$200"
result_list = get_country_codes(prices)
print(', '.join(result_list))
Or if you like really short code:
prices = "US$40, AU$89, JP$200"
print(
', '.join(
price.split('$')[0].strip()
for price in prices.split(',')))
You could also use regex if you want to. Since you know country codes will be two capital letters only (A-Z), you can look for a match of two capital letters that precede a dollar sign.
def get_country_codes(prices):
country_codes = re.findall(r'([A-Z]{2})\$', prices)
return ', '.join(country_codes)
See regex demo here.
Look at the successive steps:
Your string:
In [1]: prices = "US$40, AU$89, JP$200"
split into a list on comma
In [2]: alist = prices.split(',')
In [3]: alist
Out[3]: ['US$40', ' AU$89', ' JP$200']
split the substrings on $
In [4]: [price.split('$') for price in alist]
Out[4]: [['US', '40'], [' AU', '89'], [' JP', '200']]
select the first element:
In [5]: [price.split('$')[0] for price in alist]
Out[5]: ['US', ' AU', ' JP']
Your += joins the strings as is; same as join with ''. Note that the substrings still have the initial blank for the original string.
In [6]: ''.join([price.split('$')[0] for price in alist])
Out[6]: 'US AU JP'
Join with comma:
In [7]: ','.join([price.split('$')[0] for price in alist])
Out[7]: 'US, AU, JP'
join is the easiest way of joining a list of strings with a specific delimiter between, in effect reversing a split. += in a loop is harder to use, since it tends to add an extra delimiter at the start or end.
I have a list of names which I'm using to pull out of a target list of strings. For example:
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
output = ['Chris Smith', 'Kim', 'CHRIS']
So the rules so far are:
Case insensitive
Cannot match partial word ('ie Christmas/hijacked shouldn't match Chris/Jack)
Other words in string are okay as long as name is found in the string per the above criteria.
To accomplish this, another SO user suggested this code in this thread:
[targ for targ in target_list if any(re.search(r'\b{}\b'.format(name), targ, re.I) for name in first_names)]
This works very accurately so far, but very slowly given the names list is ~5,000 long and the target list ranges from 20-100 lines long with some strings up to 30 characters long.
Any suggestions on how to improve performance here?
SOLUTION: Both of the regex based solutions suffered from OverflowErrors so unfortunately I could not test them. The solution that worked (from #mglison's answer) was:
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
This provided a tremendous increase in performance from 15 seconds to under 1 second.
Seems like you could combine them all into 1 super regex:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex_string = '|'.join(r"(?:\b"+re.escape(x)+r"\b)" for x in names)
print regex_string
regex = re.compile(regex_string,re.I)
print [t for t in target if regex.search(t)]
A non-regex solution which will only work if the names are a single word (no whitespace):
new_names = set(name.lower() for name in names)
[ t for t in target if any(map(new_names.__contains__,t.lower().split())) ]
the any expression could also be written as:
any(x in new_names for x in t.lower().split())
or
any(x.lower() in new_names for x in t.split())
or, another variant which relies on set.intersection (suggested by #DSM below):
[ t for t in target if new_names.intersection(t.lower().split()) ]
You can profile to see which performs best if performance is really critical, otherwise choose the one that you find to be easiest to read/understand.
*If you're using python2.x, you'll probably want to use itertools.imap instead of map if you go that route in the above to get it to evaluate lazily -- It also makes me wonder if python provides a lazy str.split which would have performance on par with the non-lazy version ...
this one is the simplest one i can think of:
[item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
all together:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
results = [item for item in target if re.search(r'\b(%s)\b' % '|'.join(names), item)]
print results
>>>
['Chris Smith', 'Kim']
and to make it more efficient, you can compile the regex first.
regex = re.compile( r'\b(%s)\b' % '|'.join(names) )
[item for item in target if regex.search(item)]
edit
after considering the question and looking at some comments, i have revised the 'solution' to the following:
import re
names = ['Chris', 'Jack', 'Kim']
target = ['Chris Smith', 'I hijacked this thread', 'Kim','Christmas is here', 'CHRIS']
regex = re.compile( r'\b((%s))\b' % ')|('.join([re.escape(name) for name in names]), re.I )
results = [item for item in target if regex.search(item)]
results:
>>>
['Chris Smith', 'Kim', 'CHRIS']
You're currently doing one loop inside another, iterating over two lists. That's always going to give you quadratic performance.
One local optimisation is to compile each name regex (which will make applying each regex faster). However, the big win is going to be to combine all of your regexes into one regex which you apply to each item in your input. See #mgilson's answer for how to do that. After that, your code performance should scale linearly as O(M+N), rather than O(M*N).
Example:
names = ['James John', 'Robert David', 'Paul' ... the list has 5K items]
text1 = 'I saw James today'
text2 = 'I saw James John today'
text3 = 'I met Paul'
is_name_in_text(text1,names) # this returns false 'James' in not in list
is_name_in_text(text2,names) # this returns 'James John'
is_name_in_text(text3,names) # this return 'Paul'
is_name_in_text() searches if any of the name list is in text.
The easy way to do is to just check if the name is in the list by using in operator, but the list has 5,000 items, so it is not efficient. I can just split the text into words and check if the words are in the list, but this not going to work if you have more than one word matching. Line number 7 will fail in this case.
Make names into a set and use the in-operator for fast O(1) lookup.
You can use a regex to parse out the possible names in a sentence:
>>> import re
>>> findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')
>>> def is_name_in_text(text, names):
for possible_name in set(findnames.findall(text)):
if possible_name in names:
return possible_name
return False
>>> names = set(['James John', 'Robert David', 'Paul'])
>>> is_name_in_text('I saw James today', names)
False
>>> is_name_in_text('I saw James John today', names)
'James John'
>>> is_name_in_text('I met Paul', names)
'Paul'
Build a regular expression with all the alternatives. This way you don't have to worry about somehow pulling the names out of the phrases beforehand.
import re
names_re = re.compile(r'\b' +
r'\b|\b'.join(re.escape(name) for name in names) +
r'\b')
print names_re.search('I saw James today')
You may use Python's set in order to get good performance while using the in operator.
If you have a mechanism of pulling the names out of the phrases and don't need to worry about partial matches (the full name will always be in the string), you can use a set rather than a list.
Your code is exactly the same, with this addition at line 2:
names = set(names)
The in operation will now function much faster.