Replace adjacent identical tokens that match a regex - python

In a python application, I need to replace adjacent identical occurrences of whitespace separated tokens that match a regex, e.g. for a pattern such as "a\w\w"
"xyz abc abc zzq ak9 ak9 ak9 foo abc" --> "xyz abc*2 zzq ak9*3 foo bar abc"
EDIT
My example above didn't make it clear that tokens which don't match the regex should not be aggregated. A better example is
"xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc"
--> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc"
END EDIT
I have working code posted below, but it seems more complicated than it should be.
I'm not looking for a round of code golf, but I would be interested in a solution that's more readable using standard Python libraries with similar performance.
In my application, it's safe to assume that the input strings will be less than 10000 chars long and that any given string will contain only a handful, say < 10, of the possible strings that match the pattern.
import re
def fm_pattern_factory(ptnstring):
"""
Return a regex that matches two or more occurrences
of ptnstring separated by whitespace.
>>> fm_pattern_factory('abc').match(' abc abc ') is None
False
>>> fm_pattern_factory('abc').match('abc') is None
True
"""
ptn = r"\s*({}(?:\s+{})+)\s*".format(ptnstring, ptnstring)
return re.compile(ptn)
def fm_gather(target, ptnstring):
"""
Replace adjacent occurences of ptnstring in target with
ptnstring*N where n is the number occurrences.
>>> fm_gather('xyz abc abc def abc', 'abc')
'xyz abc*2 def abc'
>>> fm_gather('xyz abc abc def abc abc abc qrs', 'abc')
'xyz abc*2 def abc*3 qrs'
"""
ptn = fm_pattern_factory(ptnstring)
result = []
index = 0
for match in ptn.finditer(target):
result.append(target[index:match.start()+1])
repl = "{}*{}".format(ptnstring, match.group(1).count(ptnstring))
result.append(repl)
index = match.end() - 1
result.append(target[index:])
return "".join(result)
def fm_gather_all(target, ptn):
"""
Apply fm_gather() to all distinct matches for ptn.
>>> s = "x abc abc y abx abx z acq"
>>> ptn = re.compile(r"a..")
>>> fm_gather_all(s, ptn)
'x abc*2 y abx*2 z acq'
"""
ptns = set(ptn.findall(target))
for p in ptns:
target = fm_gather(target, p)
return "".join(target)

Sorry, I was working on the answer before seeing you first comment. If this doesn't answer your question, let me know, and I'll remove it or will try to modify it accordingly.
For the simple input provided in the question (what in the code below is stored in the my_string variable), you could maybe try a different approach: Walk your input list and keep a "bucket" of <matching_word, num_of_occurrences>:
my_string="xyz abc abc zzq ak9 ak9 ak9 foo abc"
my_splitted_string=my_string.split(' ')
occurrences = []
print ("my_splitted_string is a %s now containing: %s"
% (type(my_splitted_string), my_splitted_string))
current_bucket = [my_splitted_string[0], 1]
occurrences.append(current_bucket)
for i in range(1, len(my_splitted_string)):
current_word = my_splitted_string[i]
print "Does %s match %s?" % (current_word, current_bucket[0])
if current_word == current_bucket[0]:
current_bucket[1] += 1
print "It does. Aggregating"
else:
current_bucket = [current_word, 1]
occurrences.append(current_bucket)
print "It doesn't. Creating a new 'bucket'"
print "Collected occurrences: %s" % occurrences
# Now re-collect:
re_collected_str=""
for occurrence in occurrences:
if occurrence[1] > 1:
re_collected_str += "%s*%d " % (occurrence[0], occurrence[1])
else:
re_collected_str += "%s " % (occurrence[0])
print "Compressed string: '%s'"
This outputs:
my_splitted_string is a <type 'list'> now containing: ['xyz', 'abc', 'abc', 'zzq', 'ak9', 'ak9', 'ak9', 'foo', 'abc']
Does abc match xyz?
It doesn't. Creating a new 'bucket'
Does abc match abc?
It does. Aggregating
Does zzq match abc?
It doesn't. Creating a new 'bucket'
Does ak9 match zzq?
It doesn't. Creating a new 'bucket'
Does ak9 match ak9?
It does. Aggregating
Does ak9 match ak9?
It does. Aggregating
Does foo match ak9?
It doesn't. Creating a new 'bucket'
Does abc match foo?
It doesn't. Creating a new 'bucket'
Collected occurrences: [['xyz', 1], ['abc', 2], ['zzq', 1], ['ak9', 3], ['foo', 1], ['abc', 1]]
Compressed string: 'xyz abc*2 zzq ak9*3 foo abc '
(beware of the final blank space)

The following seems to be solid and has good performance in my application. Thanks to BorrajaX for an answer that pointed out benefits of not scanning the input string more often than absolutely necessary.
The function below also preserves newlines and whitespace in the output. I forgot to state that in my question, but it turns out to be desirable in my app which needs to produce some human-readable intermediate output.
def gather_token_sequences(masterptn, target):
"""
Find all sequences in 'target' of two or more identical adjacent tokens
that match 'masterptn'. Count the number of tokens in each sequence.
Return a new version of 'target' with each sequence replaced by one token
suffixed with '*N' where N is the count of tokens in the sequence.
Whitespace in the input is preserved (except where consumed within replaced
sequences).
>>> mptn = r'ab\w'
>>> tgt = 'foo abc abc'
>>> gather_token_sequences(mptn, tgt)
'foo abc*2'
>>> tgt = 'abc abc '
>>> gather_token_sequences(mptn, tgt)
'abc*2 '
>>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
>>> gather_token_sequences(mptn, tgt)
'\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
"""
# Emulate python's strip() function except that the leading and trailing
# whitespace are captured for final output. This guarantees that the
# body of the remaining string will start and end with a token, which
# slightly simplifies the subsequent matching loops.
stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
head, body, tail = stripped.groups()
# Init the result list and loop variables.
result = [head]
i = 0
token = None
while i < len(body):
## try to match master pattern
match = re.match(masterptn, body[i:])
if match is None:
## Append char and advance.
result += body[i]
i += 1
else:
## Start new token sequence
token = match.group(0)
esc = re.escape(token) # might have special chars in token
ptn = r"((?:{}\s+)+{})".format(esc, esc)
seq = re.match(ptn, body[i:])
if seq is None: # token is not repeated.
result.append(token)
i += len(token)
else:
seqstring = seq.group(0)
replacement = "{}*{}".format(token, seqstring.count(token))
result.append(replacement)
i += len(seq.group(0))
result.append(tail)
return ''.join(result)

Related

Case-insensitive regex that keeps capitalization after replacement?

In Python, the following code lets me replace most sanely-written text case-consistently. It has bugs if, say, from_substring is "the" and my_string contains "theater," but this is just to illustrate what I want to do.
def cap_preserving_replace(my_string, from_substring, to_substring):
temp = my_string.replace(from_substring.lower(), to_substring.lower())
temp = temp.replace(from_substring.upper(), to_substring.upper())
temp = temp.replace(from_substring.title(), to_substring.title())
return temp
However, I also want (in much rarer cases) to be able to make the following sorts of conversions:
before -> afters
Before -> Afters
BEFORE -> AFTERS
BeFoRe -> AfTeRs
The general code I have is ugly. And while I probably just want to replace lower, upper and title cases, for which the code would be trivial, I am wondering if there is a good regex that can preserve capitalization schemes. The code below does so, converting abc to def.
import re
begin_string = 'abc'
end_string = 'def'
def case_keep_replace(my_string, begin_string = 'abc', end_string = 'def'):
offsets = [m.start() for m in re.finditer(begin_string, my_string, re.IGNORECASE)]
for x in offsets:
temp_string = ''
for y in range(0, len(begin_string)):
if my_string[x+y].islower():
temp_string += end_string[y]
else:
temp_string += end_string[y].upper()
my_string = my_string[:x] + temp_string + my_string[x + len(temp_string):]
return my_string
process_string = "All 8: abc, abC, aBc, aBC, Abc, AbC, ABc, ABC."
print("BEFORE:", process_string)
print("AFTERS: ", case_keep_replace(process_string))
Is there a suitable regex for case_keep_replace?
(Note: for case_keep_replace, the code should also check that from_substring and to_substring are the same length, but I wanted to focus the code chunk on the main question.)
The point here is that you need to pass the re.MatchData object into the callback method, or a lambda, used as the second argument to re.sub.
However, your current code needs simplifying a bit, and here is how your code should look like:
import re
def case_keep_replace(m, end_string):
temp_string = ''
for y in range(0, len(m.group())):
if m.group()[y].islower():
temp_string += end_string[y]
else:
temp_string += end_string[y].upper()
return temp_string + end_string[y+1:]
# Now, testing
process_string = "All 8: abc, abC, aBc, aBC, Abc, AbC, ABc, ABC." # input string
begin_string = 'abc' # Actual pattern
end_string = 'def' # Replacement string
print( process_string)
print( re.sub(begin_string, lambda m: case_keep_replace(m, end_string), process_string, flags=re.I) )
# => All 8: abc, abC, aBc, aBC, Abc, AbC, ABc, ABC.
# All 8: def, deF, dEf, dEF, Def, DeF, DEf, DEF.
See the Python demo.
Note: I added end_string[y+1:] in th return temp_string + end_string[y+1:], so as not to lose the rest of the end_string replacement, see this Python demo.

find the uppercase letter on a string and replace it

This is my code :
def cap_space(txt):
e = txt
upper = "WLMFSC"
letters = [each for each in e if each in upper]
a = ''.join(letters)
b = a.lower()
c = txt.replace(a,' '+b)
return c
who i built to find the uppercase latters on a given string and replace it with space and the lowercase of the latter
example input :
print(cap_space('helloWorld!'))
print(cap_space('iLoveMyFriend'))
print(cap_space('iLikeSwimming'))
print(cap_space('takeCare'))
what should output be like :
hello world!
i love my friend
take care
i like swimming
what i get as output instead is :
hello world!
iLoveMyFriend
iLikeSwimming
take care
the problem here is the condition only applied if there only one upper case latter in the given string for some reasons how i could improve it to get it applied to every upper case latter on the given string ?
Being a regex addict, I can offer the following solution which relies on re.findall with an appropriate regex pattern:
def cap_space(txt):
parts = re.findall(r'^[a-z]+|[A-Z][a-z]*[^\w\s]?', txt)
output = ' '.join(parts).lower()
return output
inp = ['helloWorld!', 'iLoveMyFriend', 'iLikeSwimming', 'akeCare']
output = [cap_space(x) for x in inp]
print(inp)
print(output)
This prints:
['helloWorld!', 'iLoveMyFriend', 'iLikeSwimming', 'akeCare']
['hello world!', 'i love my friend', 'i like swimming', 'ake care']
Here is an explanation of the regex pattern used:
^[a-z]+ match an all lowercase word from the very start of the string
| OR
[A-Z] match a leading uppercase letter
[a-z]* followed by zero or more lowercase letters
[^\w\s]? followed by an optional "symbol" (defined here as any non word,
non whitespace character)
You can make use of nice python3 methods str.translate and str.maketrans:
In [281]: def cap_space(txt):
...: upper = "WLMFSC"
...: letters = [each for each in txt if each in upper]
...: d = {i: ' ' + i.lower() for i in letters}
...: return txt.translate(str.maketrans(d))
...:
...:
In [283]: print(cap_space('helloWorld!'))
...: print(cap_space('iLoveMyFriend'))
...: print(cap_space('iLikeSwimming'))
...: print(cap_space('takeCare'))
hello world!
i love my friend
i like swimming
take care
A simple and crude way. It might not be effective but it is easier to understand
def cap_space(sentence):
characters = []
for character in sentence:
if character.islower():
characters.append(character)
else:
characters.append(f' {character.lower()}')
return ''.join(characters)
a is all the matching uppercase letters combined into a single string. When you try to replace them with txt.replace(a, ' '+b), it will only match if all the matchinguppercase letters are consecutive in txt, or there's just a single match. str.replace() matches and replaces the whole seawrch string, not any characters in it.
Combining all the matches into a single string won't work. Just loop through txt, checking each character to see if it matches.
def cap_space(txt):
result = ''
upper = "WLMFSC"
for c in txt:
if c in upper:
result += ' ' + c.lower()
else:
result += c
return result

How to extract a specific symbols from string if they follow after the number

I need to extract a single or multiple symbols # from a string. I only need those symbols that follow one after another and are not separated with any characters and white spaces.
The symbol or multiple symbols # should follow right after number. If not the symbols should be disregarded and not returned.
From a string a i would need to extract only three ### symbols since the fourth symbol is separated with a white space character.
a='some text 1 a8 777### # more text here 123 456`
result would be:
###
From variable b the function would return None since not a single symbol # follows after a number or numbers.
b='some text ### # more text here 123 456`
From c variable only a single symbol # is returned since it is the only one that follows after the numbers (and not separated from them):
c='some text ### 777# more text here 123 456`
result: #
You can use regex for this:
>>> import re
>>> r = re.compile(r'\d(#+)')
>>> a = 'some text 1 a8 777### # more text here 123 456'
>>> r.search(a).group(1)
'###'
>>> b = 'some text ### # more text here 123 456'
>>> r.search(b) #None
>>> c = 'some text ### 777# more text here 123 456'
>>> r.search(c).group(1)
'#'
Combine it with an if condition to check whether the regex matched anything in string or not:
>>> m = r.search(c)
>>> if m:
print m.group(1)
#
While there's probably a regular expression to do this, a loop is easier to understand if you don't know what regex-es are.
i = 0
found = False
while i < len(string) and not found:
if i != 0 and string[i] == '#':
if string[i-1].isnumeric():
found = True
else:
i+=1
else:
i+=1
if not found:
return None
else:
out = ''
while string[i] == '#':
out += '#'
i+=1
return out
Probably can be rewritten better, but that's the simplistic way to do it.
Footnote: A regex would be better though.
import re
print re.findall('[0-9]#+', a)
This would print a list containing all the matches, in the above case it would print
['7###']
Now you can do slicing on the string, to get what you want.
Hope this helps !
Does this work for you?
>>> import re
>>> re.search('\d(#+)', a).groups()[0]
'###'
>>> re.search('\d(#+)', b)
>>> re.search('\d(#+)', c).groups()[0]
'#'

Substring search for multiword strings - Python

I want to check a set of sentences and see whether some seed words occurs in the sentences. but i want to avoid using for seed in line because that would have say that a seed word ring would have appeared in a doc with the word bring.
I also want to check whether multiword expressions (MWE) like word with spaces appears in the document.
I've tried this but this is uber slow, is there a faster way of doing this?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
The desired output should be this:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]
Consider using a regular expression:
import re
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)
\b matches the start or end of a "word" (sequence of word characters).
Example:
>>> for line in docs:
... print pattern.findall(line)
...
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]
This should work and be somewhat faster than your current approach:
docs_seed = []
for d in docs:
for s in seed:
pos = d.find(s)
if not pos == -1 and (d[pos - 1] == " "
and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
docs_seed.append((s, d))
break
find gives us the position of the seed value in the doc (or -1 if it is not found), we then check that the characters before and after the value are spaces (or the string ends after the substring). This also fixes the bug in your original code that multiword expressions don't need to start or end on a word boundary - your original code would match "words with spaces" for an input like "swords with spaces".

Determining if a query is in a string

I have a list of query terms, each with a boolean operator associated with them, like, say:
tom OR jerry OR desperate AND dan OR mickey AND mouse
Okay, now I have a string containing user-defined input, inputStr.
My question is, in Python, is there a way to determine if the string defined by the user contains the words in the "query"?
I have tried this:
if ('tom' or 'jerry' or 'desperate' and 'dan' or 'mickey' and 'mouse') in "cartoon dan character desperate":
print "in string"
But it doesn't give the output i expect.
As you can see, I don't care about whether the query terms are ordered; just whether they are in the string or not.
Can this be done? Am I missing something like a library which can help me achieve the required functionality?
Many thanks for any help.
To check whether any of the words in a list are in the string:
any(word in string for word in lst)
Example:
# construct list from the query by removing 'OR', 'AND'
query = "tom OR jerry OR desperate AND dan OR mickey AND mouse"
lst = [term for term in query.split() if term not in ["OR", "AND"]]
string = "cartoon dan character desperate"
print any(word in string for word in lst)
If you use re.search() as #jro suggested then don't forget to escape words to avoid collisions with the regex syntax:
import re
m = re.search("|".join(map(re.escape, lst)), string)
if m:
print "some word from the list is in the string"
The above code assumes that query has no meaning other than the words it contains. If it does then assuming that 'AND' binds stronger than 'OR' i.e., 'a or b and c' means 'a or (b and c)' you could check whether a string satisfies query:
def query_in_string(query, string):
for term in query.split('OR'):
lst = map(str.strip, term.split('AND'))
if all(word in string for word in lst):
return True
return False
The above could be written more concisely but it might be less readable:
def query_in_string(query, string):
return any(all(word.strip() in string for word in term.split('AND'))
for term in query.split('OR'))
Example
query = "tom OR jerry AND dan"
print query_in_string(query, "cartoon jerry") # -> False no dan or tom
print query_in_string(query, "tom is happy") # -> True tom
print query_in_string(query, "dan likes jerry") # -> True jerry and dan
If you want to reject partial matches e.g., 'dan' should not match 'danial' then instead of word in string you
could use re.search() and add '\b':
re.search(r"\b%s\b" % re.escape(word), string)
I would use a regular expression:
>>> import re
>>> s = "cartoon dan character desperate"
>>> l = ['dan', 'mickey', 'mouse']
>>> print re.search('(%s)' % '|'.join(l), s)
<_sre.SRE_Match object at 0x0233AA60>
>>> l = ['nothing']
>>> print re.search('(%s)' % '|'.join(l), s)
None
Where s is the string to search in and l is a list of words that should be in s. If the search function doesn't return None, you have a match.
if ('tom' or 'jerry' or 'desperate' and 'dan' or 'mickey' and 'mouse') in "cartoon dan character desperate"
does not mean what you think it means, because the parentheses cause the or and and operations to be evaluated first, e.g:
>>> "tom" or "jerry" or "desperate" and "dan" or "mickey" and "mouse"
'tom'
... so your if-clause really means if 'tom' in "cartoon dan character desperate".
What you probably meant was something like:
if ('tom' in inputStr) or ('jerry' in inputStr) or ('desperate' in inputStr and 'dan' in inputStr) or ('mickey' in inputStr and 'mouse' in inputStr)

Categories