I'm trying to write a function process(s,d) to replace abbreviations in a string with their full meaning by using a dictionary. where s is the string input and d is the dictionary. For example:
>>>d = {'ASAP':'as soon as possible'}
>>>s = "I will do this ASAP. Regards, X"
>>>process(s,d)
>>>"I will do this as soon as possible. Regards, X"
I have tried using the split function to separate the string and compare each part with the dictionary.
def process(s):
return ''.join(d[ch] if ch in d else ch for ch in s)
However, it returns me the same exact string. I have a suspicion that the code doesn't work because of the full stop behind ASAP in the original string. If so, how do I ignore the punctuation and get ASAP to be replaced?
Here is a way to do it with a single regex:
In [24]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}
In [25]: s = 'I will do this ASAP, AFAIK. Regards, X'
In [26]: re.sub(r'\b' + '|'.join(d.keys()) + r'\b', lambda m: d[m.group(0)], s)
Out[26]: 'I will do this as soon as possible, as far as I know. Regards, X'
Unlike versions based on str.replace(), this observes word boundaries and therefore won't replace abbreviations that happen to appear in the middle of other words (e.g. "etc" in "fetch").
Also, unlike most (all?) other solutions presented thus far, it iterates over the input string just once, regardless of how many search terms there are in the dictionary.
You can do something like this:
def process(s,d):
for key in d:
s = s.replace(key,d[key])
return s
Here is a working solution: use re.split(), and split by word boundaries (preserving the interstitial characters):
''.join( d.get( word, word ) for word in re.split( '(\W+)', s ) )
One significant difference that this code has from Vaughn's or Sheena's answer is that this code takes advantage of the O(1) lookup time of the dictionary, while their solutions look at every key in the dictionary. This means that when s is short and d is very large, their code will take significantly longer to run. Furthermore, parts of words will still be replaced in their solutions: if d = { "lol": "laugh out loud" } and s="lollipop" their solutions will incorrectly produce "laugh out loudlipop".
use regular expressions:
re.sub(pattern,replacement,s)
In your application:
ret = s
for key in d:
ret = re.sub(r'\b'+key+r'\b',d[key],ret)
return ret
\b matches the beginning or end of a word. Thanks Paul for the comment
Instead of splitting by spaces, use:
split("\W")
It will split by anything that's not a character that would be part of a word.
python 3.2
[s.replace(i,v) for i,v in d.items()]
This is string replacement as well (+1 to #VaughnCato). This uses the reduce function to iterate through your dictionary, replacing any instances of the keys in the string with the values. s in this case is the accumulator, which is reduced (i.e. fed to the replace function) on every iteration, maintaining all past replacements (also, per #PaulMcGuire's point above, this replaces keys starting with the longest and ending with the shortest).
In [1]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}
In [2]: s = 'I will do this ASAP, AFAIK. Regards, X'
In [3]: reduce(lambda x, y: x.replace(y, d[y]), sorted(d, key=lambda i: len(i), reverse=True), s)
Out[3]: 'I will do this as soon as possible, as far as I know. Regards, X'
As for why your function didn't return what you expected - when you iterate through s, you are actually iterating through the characters of the string - not the words. Your version could be tweaked by iterating over s.split() (which would be a list of the words), but you then run into an issue where the punctuation is causing words to not match your dictionary. You can get it to match by importing string and stripping out string.punctuation from each word, but that will remove the punctuation from the final string (so regex would be likely be the best option if replacement doesn't work).
Related
My question is pretty simple, but I haven't been able to find a proper solution.
Given below is my program:
given_list = ["Terms","I","want","to","remove","from","input_string"]
input_string = input("Enter String:")
if any(x in input_string for x in given_list):
#Find the detected word
#Not in bool format
a = input_string.replace(detected_word,"")
print("Some Task",a)
Here, given_list contains the terms I want to exclude from the input_string.
Now, the problem I am facing is that the any() produces a bool result and I need the word detected by the any() and replace it with a blank, so as to perform some task.
Edit: any() function is not required at all, look for useful solutions below.
Iterate over given_list and replace them:
for i in given_list:
input_string = input_string.replace(i, "")
print("Some Task", input_string)
No need to detect at all:
for w in given_list:
input_string = input_string.replace(w, "")
str.replace will not do anything if the word is not there and the substring test needed for the detection has to scan the string anyway.
The problem with finding each word and replacing it is that python will have to iterate over the whole string, repeatedly. Another problem is you will find substrings where you don't want to. For example, "to" is in the exclude list, so you'd end up changing "tomato" to "ma"
It seems to me like you seem to want to replace whole words. Parsing is a whole new subject, but let's simplify. I'm just going to assume everything is lowercase with no punctuation, although that can be improved later. Let's use input_string.split() to iterate over whole words.
We want to replace some words with nothing, so let's just iterate over the input_string, and filter out the words we don't want, using the builtin function of the same name.
exclude_list = ["terms","i","want","to","remove","from","input_string"]
input_string = "one terms two i three want to remove"
keepers = filter(lambda w: w not in exclude_list, input_string.lower().split())
output_string = ' '.join(keepers)
print (output_string)
one two three
Note that we create an iterator that allows us to go through the whole input string just once. And instead of replacing words, we just basically skip the ones we don't want by having the iterator not return them.
Since filter requires a function for the boolean check on whether to include or exclude each word, we had to define one. I used "lambda" syntax to do that. You could just replace it with
def keep(word):
return word not in exclude_list
keepers = filter(keep, input_string.split())
To answer your question about any, use an assignment expression (Python 3.8+).
if any((word := x) in input_string for x in given_list):
# match captured in variable word
I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?
Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']
Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)
So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("אבגדהוזחטיכךלמנס עפצקרשתםןףץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)
I found a nice question where one can search for multiple endings of a string using: endswith(tuple)
Check if string ends with one of the strings from a list
My question is, how can I return which value from the tuple is actually found to be the match? and what if I have multiple matches, how can I choose the best match?
for example:
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ('AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA')
str.endswith(endings) ## this will return true for all of values inside the tuple, but how can I get which one matches the best
In this case, multiple matches can be found from the tuple, how can I deal with this and return only the best (biggest) match, which in this case should be: 'AAAAAAAAA' which I want to remove at the end (which can be done with a regular expression or so).
I mean one could do this in a for loop, but maybe there is an easier pythonic way?
>>> s = "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
>>> endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
>>> max([i for i in endings if s.endswith(i)],key=len)
'AAAAAAAAA'
import re
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
print max([i for i in endings if re.findall(i+r"$",str)],key=len)
How about:
len(str) - len(str.rstrip('A'))
str.endswith(tuple) is (currently) implemented as a simple loop over tuple, repeatedly re- running the match, any similarities between the endings are not taken into account.
In the example case, a regular expression should compile into an automaton that essentially runs in linear time:
regexp = '(' + '|'.join(
re.escape(ending) for ending in sorted(endings, key=len, reverse=True
) + ')$'
Edit 1: As pointed out correctly by Martijn Pieters, Python's re does not return the longest overall match, but for alternates only matches the first matching subexpression:
https://docs.python.org/2/library/re.html#module-re:
When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
(emphasis mine)
Hence, unfortunately the need for sorting by length.
Note that this makes Python's re different from POSIX regular expressions, which match the longest overall match.
Suppose we have:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
How can I replace each appearance within s of any of d's keys, with the corresponding value (in this case, the result would be 'Досуг englishA')?
Using re:
import re
s = 'Спорт not russianA'
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
keys = (re.escape(k) for k in d.keys())
pattern = re.compile(r'\b(' + '|'.join(keys) + r')\b')
result = pattern.sub(lambda x: d[x.group()], s)
# Output: 'Досуг not englishA'
This will match whole words only. If you don't need that, use the pattern:
pattern = re.compile('|'.join(re.escape(k) for k in d.keys()))
Note that in this case you should sort the words descending by length if some of your dictionary entries are substrings of others.
You could use the reduce function:
reduce(lambda x, y: x.replace(y, dict[y]), dict, s)
Solution found here (I like its simplicity):
def multipleReplace(text, wordDict):
for key in wordDict:
text = text.replace(key, wordDict[key])
return text
one way, without re
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'.split()
for n,i in enumerate(s):
if i in d:
s[n]=d[i]
print ' '.join(s)
Almost the same as ghostdog74, though independently created. One difference,
using d.get() in stead of d[] can handle items not in the dict.
>>> d = {'a':'b', 'c':'d'}
>>> s = "a c x"
>>> foo = s.split()
>>> ret = []
>>> for item in foo:
... ret.append(d.get(item,item)) # Try to get from dict, otherwise keep value
...
>>> " ".join(ret)
'b d x'
With the warning that it fails if key has space, this is a compressed solution similar to ghostdog74 and extaneons answers:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
' '.join(d.get(i,i) for i in s.split())
I used this in a similar situation (my string was all in uppercase):
def translate(string, wdict):
for key in wdict:
string = string.replace(key, wdict[key].lower())
return string.upper()
hope that helps in some way... :)
Using regex
We can build a regular expression that matches any of the lookup dictionary's keys, by creating regexes to match each individual key and combine them with |. We use re.sub to do the substitution, by giving it a function to do the replacement (this function, of course, will do the dict lookup). Putting it together:
import re
# assuming global `d` and `s` as in the question
# a function that does the dict lookup with the global `d`.
def lookup(match):
return d[match.group()]
# Make the regex.
joined = '|'.join(re.escape(key) for key in d.keys())
pattern = re.compile(joined)
result = pattern.sub(lookup, s)
Here, re.escape is used to escape any characters with special meaning in the replacements (so that they don't interfere with building the regex, and are matched literally).
This regex pattern will match the substrings anywhere they appear, even if they are part of a word or span across multiple words. To avoid this, modify the regex so that it checks for word boundaries:
# pattern = re.compile(joined)
pattern = re.compile(rf'\b({joined})\b')
Using str.replace iteratively
Simply iterate over the .items() of the lookup dictionary, and call .replace with each. Since this method returns a new string, and does not (cannot) modify the string in place, we must reassign the results inside the loop:
for to_replace, replacement in d.items():
s = s.replace(to_replace, replacement)
This approach is simple to write and easy to understand, but it comes with multiple caveats.
First, it has the disadvantage that it works sequentially, in a specific order. That is, each replacement has the potential to interfere with other replacements. Consider:
s = 'one two'
s = s.replace('one', 'two')
s = s.replace('two', 'three')
This will produce 'three three', not 'two three', because the 'two' from the first replacement will itself be replaced in the second step. This is normally not desirable; however, in the rare case when it should work this way, this approach is the only practical one.
This approach also cannot easily be fixed to respect word boundaries, because it must match literal text, and a "word boundary" can be marked in multiple different ways - by varying kinds of whitespace, but also without text at the beginning and end of the string.
Finally, keep in mind that a dict is not an ideal data structure for this approach. If we will iterate over the dict, then its ability to do key lookup is useless; and in Python 3.5 and below, the order of dicts is not guaranteed (making the sequential replacement problem worse). Instead, it would be better to specify a list of tuples for the replacements:
d = [('Спорт', 'Досуг'), ('russianA', 'englishA')]
s = 'Спорт russianA'
for to_replace, replacement in d: # no more `.items()` call
s = s.replace(to_replace, replacement)
By tokenization
The problem becomes much simpler if the string is first cut into pieces (tokenized), in such a way that anything that should be replaced is now an exact match for a dict key. That would allow for using the dict's lookup directly, and processing the entire string in one go, while also not building a custom regex.
Suppose that we want to match complete words. We can use a simpler, hard-coded regex that will match whitespace, and which uses a capturing group; by passing this to re.split, we split the string into whitespace and non-whitespace sections. Thus:
import re
tokenizer = re.compile('([ \t\n]+)')
tokenized = tokenizer.split(s)
Now we look up each of the tokens in the dictionary: if present, it should be replaced with the corresponding value, and otherwise it should be left alone (equivalent to replacing it with itself). The dictionary .get method is a natural fit for this task. Finally, we join the pieces back up. Thus:
s = ''.join(d.get(token, token) for token in tokenized)
More generally, for example if the strings to replace could have spaces in them, a different tokenization rule will be needed. However, it will usually be possible to come up with a tokenization rule that is simpler than the regex from the first section (that matches all the keys by brute force).
Special case: replacing single characters
If the keys of the dict are all one character (technically, Unicode code point) each, there are more specific techniques that can be used. See Best way to replace multiple characters in a string? for details.