Python regex to find multiple consecutive punctuations - python

I am streaming plain text records via MapReduce and need to check each plain text record for 2 or more consecutive punctuation symbols. The 12 symbols I need to check for are: -/\()!"+,'&..
I have tried translating this punctuation list into an array like this:
punctuation = [r'-', r'/', r'\\', r'\(', r'\)', r'!', r'"', r'\+', r',', r"'", r'&', r'\.']
I can find individual characters with nested for loops, for example:
for t in test_cases:
print t
for p in punctuation:
print p
if re.search(p, t):
print 'found a match!', p, t
else:
print 'no match'
However, the single backslash character is not found when I test this and I don't know how to get only results that are 2 or more consecutive occurrences in a row. I've read that I need to use the + symbol, but don't know the correct syntax to use this.
Here are some test cases:
The quick '''brown fox
The &&quick brown fox
The quick\brown fox
The quick\\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox
Which when translated into a Pythonic list looks like this:
test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
How do I use Python regex to identify and report all matches where the punctuation symbol appears 2 or more times in a row?

The punctuation characters can be put into a character class is square brackets.
Then it depends, whether the series of two or more punctuation characters consists of any punctuation character or whether the punctuation characters are the same.
In the first case curly braces can be appended to specify the number of minimum (2) and maximum repetitions. The latter is unbounded and left empty:
[...]{2,} # min. 2 or more
If only repetitions of the same character needs to be found, then the first matched punctuation character is put into a group. Then the same group (= same character) follows one or more:
([...])\1+
The back reference \1 means the first group in the expression. The groups, represented by the opening parentheses are numbered from left to right.
The next issue is escaping. There are escaping rules for Python strings and additional escaping is needed in the regular expression. The character class does not require much escaping, but the backslash must be doubled. Thus the
following example quadruplicates the backslash, one doubling because of the string, the second because of the regular expression.
Raw strings r'...' are useful for patterns, but here both the single and double quotation marks are needed.
>>> import re
>>> test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\'.]{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
>>> for t in test_cases:
match = pattern_same_punctuation.search(t)
if match:
print("{:24} => {}".format(t, match.group(1)))
else:
print(t)
The quick '''brown fox => '''
The &&quick brown fox => &&
The quick\brown fox
The quick\\brown fox => \\
The -quick brown// fox => //
The quick--brown fox => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox => ""
The quick,, brown fox => ,,
The quick brown fox... => ...
The quick-brown fox
The ((quick brown fox => ((
The quick brown)) fox => ))
The quick brown fox!!! => !!!
The 'quick' brown fox
>>>

You can use {2} in a regular expression to match two consecutive occurrences of a character class:
>>> regex = re.compile(r'[-/()!"+,\'&]{2}')
>>> [s for s in test_cases if regex.search(s)]
["The quick '''brown fox",
'The &&quick brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!']

What about regular expression? That could also help to find 2 or more consecutive punctuation symbols.
RegEx like \([\\\-\/\(\)!"+,'&]{2,})\g
{2,} stands for two or more
\g stands for global search, dont stop on first match

Thanks to #Heiko Oberdiek, here is the exact code I am using that solves the problem: (I added . to the punctuation list)
punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
x = 1
for t in test_cases:
match = punctuation.search(t)
if match:
print "{0:2} {1:24} => {2}".format(x, t, match.group(1))
x += 1
This accurately covers all of my test cases:
1 The quick '''brown fox => '''
2 The &&quick brown fox => &&
3 The quick\\brown fox => \\
4 The -quick brown// fox => //
5 The quick--brown fox => --
6 The (quick brown) fox,,, => ,,,
7 The quick ++brown fox => ++
8 The ""quick"" brown fox => ""
9 The quick,, brown fox => ,,
10 The quick brown fox... => ...
11 The ((quick brown fox => ((
12 The quick brown)) fox => ))
13 The quick brown fox!!! => !!!

Related

Python String split using a regex

We want to split a string multi line for example
|---------------------------------------------Title1(a)---------------------------------------------
Content goes here, the quick brown fox jumps over the lazy dog
|---------------------------------------------Title1(b)----------------------------------------------
Content goes here, the quick brown fox jumps over the lazy dog
here's our python split using regex code
import re
str1 = "|---------------------------------------------Title1(a)---------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"" \
"|---------------------------------------------Title1(b)----------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"|"
print(str1)
str2 = re.split("\|---------------------------------------------", str1)
print(str2)
We want the output to include only
str2[0]:
Content goes here, the quick brown fox jumps over the lazy dog
str2[1]:
Content goes here, the quick brown fox jumps over the lazy dog
what's the proper regex to use, or is there any other way to split using the format above
Instead of using split, you can match the lines and capture the part that you want in a group.
\|-{2,}[^-]+-{2,}([^-].*?)(?=\|)
Explanation
\| Match |
-{2,} Match 2 or more -
[^-]+ Match 1+ times any char except -
-{2,} Match 2 or more -
( Capture grou 1
[^-].*? match any char except -, then any char as least as possible
) Close group 1
(?=\|) Positive lookahead, assert a | to the right
Regex demo | Python demo
Example
import re
regex = r"\|-{2,}[^-]+-{2,}([^-].*?)(?=\|)"
str1 = "|---------------------------------------------Title1(a)---------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"" \
"|---------------------------------------------Title1(b)----------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"|"
str2 = re.findall(regex, str1);
print(str2[0])
print(str2[1])
Output
Content goes here, the quick brown fox jumps over the lazy dog
Content goes here, the quick brown fox jumps over the lazy dog
If Title should be part of the line, another option is to make the match a bit more precise.
\|-+Title\d+\([a-z]\)-+(.+?)(?=\||$)
Regex demo

Strip all whitespace in text except actual single space [duplicate]

This question already has answers here:
Is there a simple way to remove multiple spaces in a string?
(27 answers)
Closed 5 years ago.
How would I do this in python3?
The quick brown fox jumps over the lazy dog.
to...
The quick brown fox jumps over the lazy dog.
Where the above quote is a string.
You don't need to use regex here. You can achieve what you want like this ways:
a = "The quick brown fox jumps over the lazy dog."
final = " ".join(a.split())
print(final)
Output:
'The quick brown fox jumps over the lazy dog.'

How can I reverse parts of sentence in python?

I have a sentence, let's say:
The quick brown fox jumps over the lazy dog
I want to create a function that takes 2 arguments, a sentence and a list of things to ignore. And it returns that sentence with the reversed words, however it should ignore the stuff I pass to it in a second argument. This is what I have at the moment:
def main(sentence, ignores):
return ' '.join(word[::-1] if word not in ignores else word for word in sentence.split())
But this will only work if I pass a second list like so:
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
However, I want to pass a list like this:
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
expected result:
ehT quick brown xof spmuj revo eht lazy dog
So basically the second argument (the list) will have parts of the sentence that should be ignored. Not just single words.
Do I have to use regexp for this? I was trying to avoid it...
I'm the first person to recommend avoiding regular expressions, but in this case, the complexity of doing without is greater than the complexity added by using them:
import re
def main(sentence, ignores):
# Dedup and allow fast lookup for determining whether to reverse a component
ignores = frozenset(ignores)
# Make a pattern that will prefer matching the ignore phrases, but
# otherwise matches each space and non-space run (so nothing is dropped)
# Alternations match the first pattern by preference, so you'll match
# the ignores phrases if possible, and general space/non-space patterns
# otherwise
pat = r'|'.join(map(re.escape, ignores)) + r'|\S+|\s+'
# Returns the chopped up pieces (space and non-space runs, but ignore phrases stay together
parts = re.findall(pat, sentence)
# Reverse everything not found in ignores and then put it all back together
return ''.join(p if p in ignores else p[::-1] for p in parts)
Just another idea, reverse every word and then reverse the ignores right back:
>>> from functools import reduce
>>> def main(sentence, ignores):
def r(s):
return ' '.join(w[::-1] for w in s.split())
return reduce(lambda s, i: s.replace(r(i), i), ignores, r(sentence))
>>> main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog'])
'ehT quick brown xof spmuj revo eht lazy dog'
Instead of placeholders, why not just initially reverse any phrase that you want to be around the right way, then reverse the whole string:
def main(sentence, ignores):
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
returns:
ehT quick nworb xof spmuj revo eht lazy god
ehT quick brown xof spmuj revo eht lazy dog
I have attempted to solve the issue of overlapping ignore phrases e.g. ['brown fox', 'quick brown'] raised by #PadraicCunningham.
There's obviously a lot more looping and code feels less pythonic so I'd be interested in feedback on how to improve this.
import re
def _span_combiner(spans):
"""replace overlapping spans with encompasing single span"""
for i, s in enumerate(spans):
start = s[0]
end = s[1]
for x in spans[i:]:
if x[0] < end:
end = x[1]
yield (start, end)
def main(sentence, ignores):
# spans is a start and finish indices for each ignore phrase in order of occurence
spans = sorted(
[[m.span() for m in re.finditer(p, sentence)][0] for p in ignores if p in sentence]
)
# replace overlapping indices with single set of indices encompasing overlapped range
spans = [s for s in _span_combiner(spans)]
# recreate ignore list by slicing sentence with combined spans
ignores = [sentence[s[0]:s[1]] for s in spans]
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
if __name__ == "__main__":
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['brown fox', 'lazy dog']))
print(main('The quick brown fox jumps over the lazy dog', ['nonexistent' ,'brown fox', 'quick brown']))
print(main('The quick brown fox jumps over the brown fox', ['brown fox', 'quick brown']))
results:
ehT quick nworb xof spmuj revo eht lazy god
ehT kciuq brown fox spmuj revo eht lazy dog
ehT quick brown fox spmuj revo eht yzal god
ehT quick brown fox spmuj revo eht brown fox

Match optional string when random strings present between groups

I am very new to python and I was having trouble with matching optional strings when can be any number of strings between the groups. Here is an example of what I am looking for:
'The quick brown fox jumps over the lazy dog'
I want the word following 'brown' and if the word 'lazy' is present I want the word following it as well, i.e:
'The quick brown fox jumps over the lazy dog' --> ('fox', 'dog')
'The quick brown fox' --> ('fox', '')
'The quick brown fox dfjdnjcnjdn vvvv lazy mouse' --> ('fox', 'mouse')
'The quick brown fox lazy dog' --> ('fox', 'dog')
Here is what I tried, but it is not working
re.findall(r'brown (\S+)(.*?)(lazy )?(\S+)?', str)
What am I doing wrong and how to fix this?
You could use the following to get the words you're looking for:
brown (\S+)(?:.*lazy (\S+))?
Which would give a list of tuples, with the empty string if lazy is not present.
>>> import re
>>> s = """The quick brown fox jumps over the lazy dog
... The quick brown fox
... The quick brown fox dfjdnjcnjdn vvvv lazy mouse
... The quick brown fox lazy dog"""
>>> re.findall(r'brown (\S+)(?:.*lazy (\S+))?', s)
[('fox', 'dog'), ('fox', ''), ('fox', 'mouse'), ('fox', 'dog')]
>>>
(?: ... ) is used to make groups that won't get captured, so what's inside won't necessarily get into the tuple/list with re.findall unless it is itself within capture group(s).
You can use a pattern such as:
(?:brown|lazy)\s(\S+)
Below is a breakdown of what it matches:
(?:brown|lazy) # The words 'brown' or 'lazy'
\s # A whitespace character
(\S+) # One or more non-whitespace characters
And here is a demonstration:
>>> import re
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox jumps over the lazy dog')
['fox', 'dog']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox')
['fox']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox dfjdnjcnjdn vvvv lazy mouse')
['fox', 'mouse']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox lazy dog')
['fox', 'dog']
>>>

Python replace spaces in string iteratively [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm trying to replace spaces with hyphens one at a time in each possible position in python. For example the man said hi should produce a list of all the possible hyphen positions, including multiple hyphens:
the-man said hi
the man-said hi
the man said-hi
the-man said-hi
the-man-said hi
the man-said-hi
the-man-said-hi
The length of the strings varies in number of spaces, so it can't be a fix for just 3 spaces. I've been experimenting with re.search and re.sub in a while loop, but haven't found a nice way yet.
Use itertools.product() to produce all space-and-dash combinations, then recombine your string with those:
from itertools import product
def dashed_combos(inputstring):
words = inputstring.split()
for combo in product(' -', repeat=len(words) - 1):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
The last line zips the words together with the dashes and spaces (adding in an empty string at the end to make up the pairs), then flattens that and joins them into a single string.
Demo:
>>> for combo in dashed_combos('the man said hi'):
... print combo
...
the man said hi
the man said-hi
the man-said hi
the man-said-hi
the-man said hi
the-man said-hi
the-man-said hi
the-man-said-hi
You can always skip the first iteration of that loop (with only spaces) with itertools.islice():
from itertools import product, islice
def dashed_combos(inputstring):
words = inputstring.split()
for combo in islice(product(' -', repeat=len(words) - 1), 1, None):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
All this is extremely memory efficient; you can easily handle inputs with hundreds of words, provided you don't try and store all possible combinations in memory at once.
Slightly longer demo:
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 10):
... print combo
...
the quick brown fox jumped over the lazy-dog
the quick brown fox jumped over the-lazy dog
the quick brown fox jumped over the-lazy-dog
the quick brown fox jumped over-the lazy dog
the quick brown fox jumped over-the lazy-dog
the quick brown fox jumped over-the-lazy dog
the quick brown fox jumped over-the-lazy-dog
the quick brown fox jumped-over the lazy dog
the quick brown fox jumped-over the lazy-dog
the quick brown fox jumped-over the-lazy dog
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 200, 210):
... print combo
...
the-quick-brown fox jumped-over the lazy-dog
the-quick-brown fox jumped-over the-lazy dog
the-quick-brown fox jumped-over the-lazy-dog
the-quick-brown fox jumped-over-the lazy dog
the-quick-brown fox jumped-over-the lazy-dog
the-quick-brown fox jumped-over-the-lazy dog
the-quick-brown fox jumped-over-the-lazy-dog
the-quick-brown fox-jumped over the lazy dog
the-quick-brown fox-jumped over the lazy-dog
the-quick-brown fox-jumped over the-lazy dog

Categories