Substring search for multiword strings - Python - python

I want to check a set of sentences and see whether some seed words occurs in the sentences. but i want to avoid using for seed in line because that would have say that a seed word ring would have appeared in a doc with the word bring.
I also want to check whether multiword expressions (MWE) like word with spaces appears in the document.
I've tried this but this is uber slow, is there a faster way of doing this?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
The desired output should be this:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]

Consider using a regular expression:
import re
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)
\b matches the start or end of a "word" (sequence of word characters).
Example:
>>> for line in docs:
... print pattern.findall(line)
...
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]

This should work and be somewhat faster than your current approach:
docs_seed = []
for d in docs:
for s in seed:
pos = d.find(s)
if not pos == -1 and (d[pos - 1] == " "
and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
docs_seed.append((s, d))
break
find gives us the position of the seed value in the doc (or -1 if it is not found), we then check that the characters before and after the value are spaces (or the string ends after the substring). This also fixes the bug in your original code that multiword expressions don't need to start or end on a word boundary - your original code would match "words with spaces" for an input like "swords with spaces".

Related

Placeholder variable for a string in Python that can take the form of any character

is there some kind of method in python that would allow me something like this:
if string == "*" + "this" + "*" +"blue" + "*":
and it would be True if "string" was "this is blue", "this was blue" or "XYZ this XYZ blue XYZ"
is something like this possible in python in a simple way?
I don't mean the %s format cause there you need to pass some value as %s, i need it to be able to check all possible forms resp. just ignore everything between "this" and "blue". However I can't just check if the first 4 characters of the string are "this" and the last 4 would be "blue" because the string is actually a long text and I need to be able to check if within this long text there is a part that says "this .... blue"
Use a regex, which is provided in Python via the re module:
>>> import re
>>> re.match(".*this.*blue.*", "this is blue")
<re.Match object; span=(0, 12), match='this is blue'>
In a regex, .* has the wildcard effect you're looking for; . means "any character" and * means "any number of them".
If you wanted to do this without a regex, you could use the str.find method, which gives you the index of the first occurrence of a string within a larger string. First find the index of the first word:
>>> string = "did you know that this blue bird is a corvid?"
>>> string.find("this")
18
You can then slice the string to everything after that word:
>>> string[18+4:]
' blue bird is a corvid?'
and repeat the find operation with the second word:
>>> string[18+4:].find("blue")
1
If either find() call returns -1, there is no match. In the form of a function this might look like:
>>> def find_words(string, *words):
... for word in words:
... i = string.find(word)
... if i < 0:
... return False
... string = string[i+len(word):]
... return True
...
>>> find_words(string, "blue", "bird")
True
>>> find_words(string, "bird", "blue")
False
Nope, but I think you can roll your own. Something like this:
inps = [
'this is blue',
'this was blue',
'XYZ this XYZ blue XYZ',
'this is',
'blue here',
]
def find_it(string: str, *werds):
num_werds = len(werds)
werd_idx = 0
cur_werd = werds[werd_idx]
for w in string.split(' '):
if w == cur_werd:
werd_idx += 1
if werd_idx == num_werds:
return True
cur_werd = werds[werd_idx]
return False
for s in inps:
print(find_it(s, 'this', 'blue'))
Out:
True
True
True
False
False
Try this, Only with simple python code.
def check(string,patt):
string = string.split(' ')
patt = patt.split(' ')
if len(string) != len(patt):
return False
return all(True if v =="*" else v==string[i] for i,v in enumerate(patt))
patt = "* blue * sky *"
string="Hii blue is sky h"
if check(string,patt):
print("Yes")
else:
print("No")

find the uppercase letter on a string and replace it

This is my code :
def cap_space(txt):
e = txt
upper = "WLMFSC"
letters = [each for each in e if each in upper]
a = ''.join(letters)
b = a.lower()
c = txt.replace(a,' '+b)
return c
who i built to find the uppercase latters on a given string and replace it with space and the lowercase of the latter
example input :
print(cap_space('helloWorld!'))
print(cap_space('iLoveMyFriend'))
print(cap_space('iLikeSwimming'))
print(cap_space('takeCare'))
what should output be like :
hello world!
i love my friend
take care
i like swimming
what i get as output instead is :
hello world!
iLoveMyFriend
iLikeSwimming
take care
the problem here is the condition only applied if there only one upper case latter in the given string for some reasons how i could improve it to get it applied to every upper case latter on the given string ?
Being a regex addict, I can offer the following solution which relies on re.findall with an appropriate regex pattern:
def cap_space(txt):
parts = re.findall(r'^[a-z]+|[A-Z][a-z]*[^\w\s]?', txt)
output = ' '.join(parts).lower()
return output
inp = ['helloWorld!', 'iLoveMyFriend', 'iLikeSwimming', 'akeCare']
output = [cap_space(x) for x in inp]
print(inp)
print(output)
This prints:
['helloWorld!', 'iLoveMyFriend', 'iLikeSwimming', 'akeCare']
['hello world!', 'i love my friend', 'i like swimming', 'ake care']
Here is an explanation of the regex pattern used:
^[a-z]+ match an all lowercase word from the very start of the string
| OR
[A-Z] match a leading uppercase letter
[a-z]* followed by zero or more lowercase letters
[^\w\s]? followed by an optional "symbol" (defined here as any non word,
non whitespace character)
You can make use of nice python3 methods str.translate and str.maketrans:
In [281]: def cap_space(txt):
...: upper = "WLMFSC"
...: letters = [each for each in txt if each in upper]
...: d = {i: ' ' + i.lower() for i in letters}
...: return txt.translate(str.maketrans(d))
...:
...:
In [283]: print(cap_space('helloWorld!'))
...: print(cap_space('iLoveMyFriend'))
...: print(cap_space('iLikeSwimming'))
...: print(cap_space('takeCare'))
hello world!
i love my friend
i like swimming
take care
A simple and crude way. It might not be effective but it is easier to understand
def cap_space(sentence):
characters = []
for character in sentence:
if character.islower():
characters.append(character)
else:
characters.append(f' {character.lower()}')
return ''.join(characters)
a is all the matching uppercase letters combined into a single string. When you try to replace them with txt.replace(a, ' '+b), it will only match if all the matchinguppercase letters are consecutive in txt, or there's just a single match. str.replace() matches and replaces the whole seawrch string, not any characters in it.
Combining all the matches into a single string won't work. Just loop through txt, checking each character to see if it matches.
def cap_space(txt):
result = ''
upper = "WLMFSC"
for c in txt:
if c in upper:
result += ' ' + c.lower()
else:
result += c
return result

Find substring in Python

I have found synonyms of a word "plant"
syn = wordnet.synsets('plant')[0].lemmas()
>>>[Lemma('plant.n.01.plant'), Lemma('plant.n.01.works'), Lemma('plant.n.01.industrial_plant')]
and an input word
word = 'work'
I want to find if 'work' appears in syn. How to do it?
Lemma's have a name() method so what you could do is
>>> 'works' in map(lambda x: x.name(), syn)
True
Edit: did not see you said "work", not works, so this would be:
>>> for i in syn:
... if 'work' in i.name():
... print True
...
True
You can wrap it in a function for example.
Or a mixture of the two suggestions I made:
any(map(lambda x: 'work' in x, map(lambda x: x.name(), syn)))
You can easily check for the presence of a substring using the keyword in in python:
>>> word = "work"
>>> word in 'plant.n.01.works'
True
>>> word in 'plant.n.01.industrial_plant'
False
If you want to test this in a list you can do a loop:
syn = ["plant.one","plant.two"]
for plant in syn:
if word in plant:
print("ok")
Or better a list comprehension:
result = [word in plant for plant in syn]
# To get the number of matches, you can sum the resulting list:
sum(result)
Edit: If you have a long list of words to look for, you can just nest two loops:
words_to_search = ["work","spam","foo"]
syn = ["plant.one","plant.two"]
for word in words_to_search_for:
if sum([word in plant for plant in syn]):
print("{} is present in syn".format(word))
Note that you are manipulating Lemma objects and not strings. You might need to check for word in plant.name instead of just word if the object do not implement the [__contains__](https://docs.python.org/2/library/operator.html#operator.__contains__) method. I am not familiar with this library though.
str1 = "this is a example , xxx"
str2 = "example"
target_len = len(str2)
str_start_position = str1.index(str2) #or str1.find(str2)
str_end_position = str_start_position + target_len
you can use str_start_position and str_end_position to get your target substring

Python change a single character of a string [duplicate]

s = 'the brown fox'
...do something here...
s should be:
'The Brown Fox'
What's the easiest way to do this?
The .title() method of a string (either ASCII or Unicode is fine) does this:
>>> "hello world".title()
'Hello World'
>>> u"hello world".title()
u'Hello World'
However, look out for strings with embedded apostrophes, as noted in the docs.
The algorithm uses a simple language-independent definition of a word as groups of consecutive letters. The definition works in many contexts but it means that apostrophes in contractions and possessives form word boundaries, which may not be the desired result:
>>> "they're bill's friends from the UK".title()
"They'Re Bill'S Friends From The Uk"
The .title() method can't work well,
>>> "they're bill's friends from the UK".title()
"They'Re Bill'S Friends From The Uk"
Try string.capwords() method,
import string
string.capwords("they're bill's friends from the UK")
>>>"They're Bill's Friends From The Uk"
From the Python documentation on capwords:
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.
Just because this sort of thing is fun for me, here are two more solutions.
Split into words, initial-cap each word from the split groups, and rejoin. This will change the white space separating the words into a single white space, no matter what it was.
s = 'the brown fox'
lst = [word[0].upper() + word[1:] for word in s.split()]
s = " ".join(lst)
EDIT: I don't remember what I was thinking back when I wrote the above code, but there is no need to build an explicit list; we can use a generator expression to do it in lazy fashion. So here is a better solution:
s = 'the brown fox'
s = ' '.join(word[0].upper() + word[1:] for word in s.split())
Use a regular expression to match the beginning of the string, or white space separating words, plus a single non-whitespace character; use parentheses to mark "match groups". Write a function that takes a match object, and returns the white space match group unchanged and the non-whitespace character match group in upper case. Then use re.sub() to replace the patterns. This one does not have the punctuation problems of the first solution, nor does it redo the white space like my first solution. This one produces the best result.
import re
s = 'the brown fox'
def repl_func(m):
"""process regular expression match groups for word upper-casing problem"""
return m.group(1) + m.group(2).upper()
s = re.sub("(^|\s)(\S)", repl_func, s)
>>> re.sub("(^|\s)(\S)", repl_func, s)
"They're Bill's Friends From The UK"
I'm glad I researched this answer. I had no idea that re.sub() could take a function! You can do nontrivial processing inside re.sub() to produce the final result!
Here is a summary of different ways to do it, and some pitfalls to watch out for
They will work for all these inputs:
"" => ""
"a b c" => "A B C"
"foO baR" => "FoO BaR"
"foo bar" => "Foo Bar"
"foo's bar" => "Foo's Bar"
"foo's1bar" => "Foo's1bar"
"foo 1bar" => "Foo 1bar"
Splitting the sentence into words and capitalizing the first letter then join it back together:
# Be careful with multiple spaces, and empty strings
# for empty words w[0] would cause an index error,
# but with w[:1] we get an empty string as desired
def cap_sentence(s):
return ' '.join(w[:1].upper() + w[1:] for w in s.split(' '))
Without splitting the string, checking blank spaces to find the start of a word
def cap_sentence(s):
return ''.join( (c.upper() if i == 0 or s[i-1] == ' ' else c) for i, c in enumerate(s) )
Or using generators:
# Iterate through each of the characters in the string
# and capitalize the first char and any char after a blank space
from itertools import chain
def cap_sentence(s):
return ''.join( (c.upper() if prev == ' ' else c) for c, prev in zip(s, chain(' ', s)) )
Using regular expressions, from steveha's answer:
# match the beginning of the string or a space, followed by a non-space
import re
def cap_sentence(s):
return re.sub("(^|\s)(\S)", lambda m: m.group(1) + m.group(2).upper(), s)
Now, these are some other answers that were posted, and inputs for which they don't work as expected if we define a word as being the start of the sentence or anything after a blank space:
.title()
return s.title()
# Undesired outputs:
"foO baR" => "Foo Bar"
"foo's bar" => "Foo'S Bar"
"foo's1bar" => "Foo'S1Bar"
"foo 1bar" => "Foo 1Bar"
.capitalize() or .capwords()
return ' '.join(w.capitalize() for w in s.split())
# or
import string
return string.capwords(s)
# Undesired outputs:
"foO baR" => "Foo Bar"
"foo bar" => "Foo Bar"
using ' ' for the split will fix the second output, but not the first
return ' '.join(w.capitalize() for w in s.split(' '))
# or
import string
return string.capwords(s, ' ')
# Undesired outputs:
"foO baR" => "Foo Bar"
.upper()
Be careful with multiple blank spaces, this gets fixed by using ' ' for the split (like shown at the top of the answer)
return ' '.join(w[0].upper() + w[1:] for w in s.split())
# Undesired outputs:
"foo bar" => "Foo Bar"
Why do you complicate your life with joins and for loops when the solution is simple and safe??
Just do this:
string = "the brown fox"
string[0].upper()+string[1:]
Copy-paste-ready version of #jibberia anwser:
def capitalize(line):
return ' '.join(s[:1].upper() + s[1:] for s in line.split(' '))
If only you want the first letter:
>>> 'hello world'.capitalize()
'Hello world'
But to capitalize each word:
>>> 'hello world'.title()
'Hello World'
If str.title() doesn't work for you, do the capitalization yourself.
Split the string into a list of words
Capitalize the first letter of each word
Join the words into a single string
One-liner:
>>> ' '.join([s[0].upper() + s[1:] for s in "they're bill's friends from the UK".split(' ')])
"They're Bill's Friends From The UK"
Clear example:
input = "they're bill's friends from the UK"
words = input.split(' ')
capitalized_words = []
for word in words:
title_case_word = word[0].upper() + word[1:]
capitalized_words.append(title_case_word)
output = ' '.join(capitalized_words)
An empty string will raise an error if you access [1:]. Therefore I would use:
def my_uppercase(title):
if not title:
return ''
return title[0].upper() + title[1:]
to uppercase the first letter only.
Although all the answers are already satisfactory, I'll try to cover the two extra cases along with the all the previous case.
if the spaces are not uniform and you want to maintain the same
string = hello world i am here.
if all the string are not starting from alphabets
string = 1 w 2 r 3g
Here you can use this:
def solve(s):
a = s.split(' ')
for i in range(len(a)):
a[i]= a[i].capitalize()
return ' '.join(a)
This will give you:
output = Hello World I Am Here
output = 1 W 2 R 3g
As Mark pointed out, you should use .title():
"MyAwesomeString".title()
However, if would like to make the first letter uppercase inside a Django template, you could use this:
{{ "MyAwesomeString"|title }}
Or using a variable:
{{ myvar|title }}
The suggested method str.title() does not work in all cases.
For example:
string = "a b 3c"
string.title()
> "A B 3C"
instead of "A B 3c".
I think, it is better to do something like this:
def capitalize_words(string):
words = string.split(" ") # just change the split(" ") method
return ' '.join([word.capitalize() for word in words])
capitalize_words(string)
>'A B 3c'
To capitalize words...
str = "this is string example.... wow!!!";
print "str.title() : ", str.title();
#Gary02127 comment, the below solution works with title with apostrophe
import re
def titlecase(s):
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?", lambda mo: mo.group(0)[0].upper() + mo.group(0)[1:].lower(), s)
text = "He's an engineer, isn't he? SnippetBucket.com "
print(titlecase(text))
You can try this. simple and neat.
def cap_each(string):
list_of_words = string.split(" ")
for word in list_of_words:
list_of_words[list_of_words.index(word)] = word.capitalize()
return " ".join(list_of_words)
Don't overlook the preservation of white space. If you want to process 'fred flinstone' and you get 'Fred Flinstone' instead of 'Fred Flinstone', you've corrupted your white space. Some of the above solutions will lose white space. Here's a solution that's good for Python 2 and 3 and preserves white space.
def propercase(s):
return ''.join(map(''.capitalize, re.split(r'(\s+)', s)))
The .title() method won't work in all test cases, so using .capitalize(), .replace() and .split() together is the best choice to capitalize the first letter of each word.
eg: def caps(y):
k=y.split()
for i in k:
y=y.replace(i,i.capitalize())
return y
You can use title() method to capitalize each word in a string in Python:
string = "this is a test string"
capitalized_string = string.title()
print(capitalized_string)
Output:
This Is A Test String
A quick function worked for Python 3
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> capitalizeFirtChar = lambda s: s[:1].upper() + s[1:]
>>> print(capitalizeFirtChar('помните своих Предковъ. Сражайся за Правду и Справедливость!'))
Помните своих Предковъ. Сражайся за Правду и Справедливость!
>>> print(capitalizeFirtChar('хай живе вільна Україна! Хай живе Любовь поміж нас.'))
Хай живе вільна Україна! Хай живе Любовь поміж нас.
>>> print(capitalizeFirtChar('faith and Labour make Dreams come true.'))
Faith and Labour make Dreams come true.
Capitalize string with non-uniform spaces
I would like to add to #Amit Gupta's point of non-uniform spaces:
From the original question, we would like to capitalize every word in the string s = 'the brown fox'. What if the string was s = 'the brown fox' with non-uniform spaces.
def solve(s):
# If you want to maintain the spaces in the string, s = 'the brown fox'
# Use s.split(' ') instead of s.split().
# s.split() returns ['the', 'brown', 'fox']
# while s.split(' ') returns ['the', 'brown', '', '', '', '', '', 'fox']
capitalized_word_list = [word.capitalize() for word in s.split(' ')]
return ' '.join(capitalized_word_list)
Easiest solution for your question, it worked in my case:
import string
def solve(s):
return string.capwords(s,' ')
s=input()
res=solve(s)
print(res)
Another oneline solution could be:
" ".join(map(lambda d: d.capitalize(), word.split(' ')))
In case you want to downsize
# Assuming you are opening a new file
with open(input_file) as file:
lines = [x for x in reader(file) if x]
# for loop to parse the file by line
for line in lines:
name = [x.strip().lower() for x in line if x]
print(name) # Check the result
I really like this answer:
Copy-paste-ready version of #jibberia anwser:
def capitalize(line):
return ' '.join([s[0].upper() + s[1:] for s in line.split(' ')])
But some of the lines that I was sending split off some blank '' characters that caused errors when trying to do s[1:]. There is probably a better way to do this, but I had to add in a if len(s)>0, as in
return ' '.join([s[0].upper() + s[1:] for s in line.split(' ') if len(s)>0])

Replace adjacent identical tokens that match a regex

In a python application, I need to replace adjacent identical occurrences of whitespace separated tokens that match a regex, e.g. for a pattern such as "a\w\w"
"xyz abc abc zzq ak9 ak9 ak9 foo abc" --> "xyz abc*2 zzq ak9*3 foo bar abc"
EDIT
My example above didn't make it clear that tokens which don't match the regex should not be aggregated. A better example is
"xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc"
--> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc"
END EDIT
I have working code posted below, but it seems more complicated than it should be.
I'm not looking for a round of code golf, but I would be interested in a solution that's more readable using standard Python libraries with similar performance.
In my application, it's safe to assume that the input strings will be less than 10000 chars long and that any given string will contain only a handful, say < 10, of the possible strings that match the pattern.
import re
def fm_pattern_factory(ptnstring):
"""
Return a regex that matches two or more occurrences
of ptnstring separated by whitespace.
>>> fm_pattern_factory('abc').match(' abc abc ') is None
False
>>> fm_pattern_factory('abc').match('abc') is None
True
"""
ptn = r"\s*({}(?:\s+{})+)\s*".format(ptnstring, ptnstring)
return re.compile(ptn)
def fm_gather(target, ptnstring):
"""
Replace adjacent occurences of ptnstring in target with
ptnstring*N where n is the number occurrences.
>>> fm_gather('xyz abc abc def abc', 'abc')
'xyz abc*2 def abc'
>>> fm_gather('xyz abc abc def abc abc abc qrs', 'abc')
'xyz abc*2 def abc*3 qrs'
"""
ptn = fm_pattern_factory(ptnstring)
result = []
index = 0
for match in ptn.finditer(target):
result.append(target[index:match.start()+1])
repl = "{}*{}".format(ptnstring, match.group(1).count(ptnstring))
result.append(repl)
index = match.end() - 1
result.append(target[index:])
return "".join(result)
def fm_gather_all(target, ptn):
"""
Apply fm_gather() to all distinct matches for ptn.
>>> s = "x abc abc y abx abx z acq"
>>> ptn = re.compile(r"a..")
>>> fm_gather_all(s, ptn)
'x abc*2 y abx*2 z acq'
"""
ptns = set(ptn.findall(target))
for p in ptns:
target = fm_gather(target, p)
return "".join(target)
Sorry, I was working on the answer before seeing you first comment. If this doesn't answer your question, let me know, and I'll remove it or will try to modify it accordingly.
For the simple input provided in the question (what in the code below is stored in the my_string variable), you could maybe try a different approach: Walk your input list and keep a "bucket" of <matching_word, num_of_occurrences>:
my_string="xyz abc abc zzq ak9 ak9 ak9 foo abc"
my_splitted_string=my_string.split(' ')
occurrences = []
print ("my_splitted_string is a %s now containing: %s"
% (type(my_splitted_string), my_splitted_string))
current_bucket = [my_splitted_string[0], 1]
occurrences.append(current_bucket)
for i in range(1, len(my_splitted_string)):
current_word = my_splitted_string[i]
print "Does %s match %s?" % (current_word, current_bucket[0])
if current_word == current_bucket[0]:
current_bucket[1] += 1
print "It does. Aggregating"
else:
current_bucket = [current_word, 1]
occurrences.append(current_bucket)
print "It doesn't. Creating a new 'bucket'"
print "Collected occurrences: %s" % occurrences
# Now re-collect:
re_collected_str=""
for occurrence in occurrences:
if occurrence[1] > 1:
re_collected_str += "%s*%d " % (occurrence[0], occurrence[1])
else:
re_collected_str += "%s " % (occurrence[0])
print "Compressed string: '%s'"
This outputs:
my_splitted_string is a <type 'list'> now containing: ['xyz', 'abc', 'abc', 'zzq', 'ak9', 'ak9', 'ak9', 'foo', 'abc']
Does abc match xyz?
It doesn't. Creating a new 'bucket'
Does abc match abc?
It does. Aggregating
Does zzq match abc?
It doesn't. Creating a new 'bucket'
Does ak9 match zzq?
It doesn't. Creating a new 'bucket'
Does ak9 match ak9?
It does. Aggregating
Does ak9 match ak9?
It does. Aggregating
Does foo match ak9?
It doesn't. Creating a new 'bucket'
Does abc match foo?
It doesn't. Creating a new 'bucket'
Collected occurrences: [['xyz', 1], ['abc', 2], ['zzq', 1], ['ak9', 3], ['foo', 1], ['abc', 1]]
Compressed string: 'xyz abc*2 zzq ak9*3 foo abc '
(beware of the final blank space)
The following seems to be solid and has good performance in my application. Thanks to BorrajaX for an answer that pointed out benefits of not scanning the input string more often than absolutely necessary.
The function below also preserves newlines and whitespace in the output. I forgot to state that in my question, but it turns out to be desirable in my app which needs to produce some human-readable intermediate output.
def gather_token_sequences(masterptn, target):
"""
Find all sequences in 'target' of two or more identical adjacent tokens
that match 'masterptn'. Count the number of tokens in each sequence.
Return a new version of 'target' with each sequence replaced by one token
suffixed with '*N' where N is the count of tokens in the sequence.
Whitespace in the input is preserved (except where consumed within replaced
sequences).
>>> mptn = r'ab\w'
>>> tgt = 'foo abc abc'
>>> gather_token_sequences(mptn, tgt)
'foo abc*2'
>>> tgt = 'abc abc '
>>> gather_token_sequences(mptn, tgt)
'abc*2 '
>>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
>>> gather_token_sequences(mptn, tgt)
'\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
"""
# Emulate python's strip() function except that the leading and trailing
# whitespace are captured for final output. This guarantees that the
# body of the remaining string will start and end with a token, which
# slightly simplifies the subsequent matching loops.
stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
head, body, tail = stripped.groups()
# Init the result list and loop variables.
result = [head]
i = 0
token = None
while i < len(body):
## try to match master pattern
match = re.match(masterptn, body[i:])
if match is None:
## Append char and advance.
result += body[i]
i += 1
else:
## Start new token sequence
token = match.group(0)
esc = re.escape(token) # might have special chars in token
ptn = r"((?:{}\s+)+{})".format(esc, esc)
seq = re.match(ptn, body[i:])
if seq is None: # token is not repeated.
result.append(token)
i += len(token)
else:
seqstring = seq.group(0)
replacement = "{}*{}".format(token, seqstring.count(token))
result.append(replacement)
i += len(seq.group(0))
result.append(tail)
return ''.join(result)

Categories