How can I reverse parts of sentence in python?

How can I reverse parts of sentence in python? - python

I have a sentence, let's say:
The quick brown fox jumps over the lazy dog
I want to create a function that takes 2 arguments, a sentence and a list of things to ignore. And it returns that sentence with the reversed words, however it should ignore the stuff I pass to it in a second argument. This is what I have at the moment:
def main(sentence, ignores):
return ' '.join(word[::-1] if word not in ignores else word for word in sentence.split())
But this will only work if I pass a second list like so:
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
However, I want to pass a list like this:
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
expected result:
ehT quick brown xof spmuj revo eht lazy dog
So basically the second argument (the list) will have parts of the sentence that should be ignored. Not just single words.
Do I have to use regexp for this? I was trying to avoid it...

I'm the first person to recommend avoiding regular expressions, but in this case, the complexity of doing without is greater than the complexity added by using them:
import re
def main(sentence, ignores):
# Dedup and allow fast lookup for determining whether to reverse a component
ignores = frozenset(ignores)
# Make a pattern that will prefer matching the ignore phrases, but
# otherwise matches each space and non-space run (so nothing is dropped)
# Alternations match the first pattern by preference, so you'll match
# the ignores phrases if possible, and general space/non-space patterns
# otherwise
pat = r'|'.join(map(re.escape, ignores)) + r'|\S+|\s+'
# Returns the chopped up pieces (space and non-space runs, but ignore phrases stay together
parts = re.findall(pat, sentence)
# Reverse everything not found in ignores and then put it all back together
return ''.join(p if p in ignores else p[::-1] for p in parts)

Just another idea, reverse every word and then reverse the ignores right back:
>>> from functools import reduce
>>> def main(sentence, ignores):
def r(s):
return ' '.join(w[::-1] for w in s.split())
return reduce(lambda s, i: s.replace(r(i), i), ignores, r(sentence))
>>> main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog'])
'ehT quick brown xof spmuj revo eht lazy dog'

Instead of placeholders, why not just initially reverse any phrase that you want to be around the right way, then reverse the whole string:
def main(sentence, ignores):
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
returns:
ehT quick nworb xof spmuj revo eht lazy god
ehT quick brown xof spmuj revo eht lazy dog

I have attempted to solve the issue of overlapping ignore phrases e.g. ['brown fox', 'quick brown'] raised by #PadraicCunningham.
There's obviously a lot more looping and code feels less pythonic so I'd be interested in feedback on how to improve this.
import re
def _span_combiner(spans):
"""replace overlapping spans with encompasing single span"""
for i, s in enumerate(spans):
start = s[0]
end = s[1]
for x in spans[i:]:
if x[0] < end:
end = x[1]
yield (start, end)
def main(sentence, ignores):
# spans is a start and finish indices for each ignore phrase in order of occurence
spans = sorted(
[[m.span() for m in re.finditer(p, sentence)][0] for p in ignores if p in sentence]
)
# replace overlapping indices with single set of indices encompasing overlapped range
spans = [s for s in _span_combiner(spans)]
# recreate ignore list by slicing sentence with combined spans
ignores = [sentence[s[0]:s[1]] for s in spans]
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
if __name__ == "__main__":
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['brown fox', 'lazy dog']))
print(main('The quick brown fox jumps over the lazy dog', ['nonexistent' ,'brown fox', 'quick brown']))
print(main('The quick brown fox jumps over the brown fox', ['brown fox', 'quick brown']))
results:
ehT quick nworb xof spmuj revo eht lazy god
ehT kciuq brown fox spmuj revo eht lazy dog
ehT quick brown fox spmuj revo eht yzal god
ehT quick brown fox spmuj revo eht brown fox

Related

How to extract the text between a word and its next occurrence?

I have the following sample text:
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
\item Item 1
\item Item 2
\end{itemize}
\end{document}'''
What I'm trying to accomplish is to create a regex expression which will extract the following lines from mystr:
['This is introduction paragraph','This is non-introduction paragraph',' This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']

For any reason you need to use regex. Perhaps the splitting string is more involved than just "a". The re module has a split function too:
import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"
print(re.split(r'\s?\ba\b\s?',str_))
# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']
EDIT: expanded answer with the new information you provided...
After your edit in which you write a better description of your problem and you include a text that looks like LaTeX, I think you need to extract those lines that do not start with a \, which are the latex commands. In other words, you need the lines with only text. Try the following, always using regular expressions:
import re
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''
pattern = r"^[^\\]*\n"
matches = re.findall(pattern, mystr, flags=re.M)
print(matches)
# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']

You can use the split method from str:
my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)
Results in:
['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']
Note: Don't use str as a variable name as it is a keyword in Python.

Python - removing some punctuation from text

I want Python to remove only some punctuation from a string, let's say I want to remove all the punctuation except '#'
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Here the output is
The quick brown fox like totally jumped man
But what I want is something like this
The quick brown fox like totally jumped #man
Is there a way to selectively remove punctuation from a text leaving out the punctuation that we want in the text intact?

str.punctuation contains all the punctuations. Remove # from it. Then replace with '' whenever you get that punctuation string.
>>> import re
>>> a = string.punctuation.replace('#','')
>>> re.sub(r'[{}]'.format(a),'','The quick brown fox, like, totally jumped, #man!')
'The quick brown fox like totally jumped #man'

Just remove the character you don't want to touch from the replacement string:
import string
remove = dict.fromkeys(map(ord, '\n' + string.punctuation.replace('#','')))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Also note that I changed '\n ' to '\n', as the former will remove spaces from your string.
Result:
The quick brown fox like totally jumped #man

Python regex to find multiple consecutive punctuations

I am streaming plain text records via MapReduce and need to check each plain text record for 2 or more consecutive punctuation symbols. The 12 symbols I need to check for are: -/\()!"+,'&..
I have tried translating this punctuation list into an array like this:
punctuation = [r'-', r'/', r'\\', r'\(', r'\)', r'!', r'"', r'\+', r',', r"'", r'&', r'\.']
I can find individual characters with nested for loops, for example:
for t in test_cases:
print t
for p in punctuation:
print p
if re.search(p, t):
print 'found a match!', p, t
else:
print 'no match'
However, the single backslash character is not found when I test this and I don't know how to get only results that are 2 or more consecutive occurrences in a row. I've read that I need to use the + symbol, but don't know the correct syntax to use this.
Here are some test cases:
The quick '''brown fox
The &&quick brown fox
The quick\brown fox
The quick\\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox
Which when translated into a Pythonic list looks like this:
test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
How do I use Python regex to identify and report all matches where the punctuation symbol appears 2 or more times in a row?

The punctuation characters can be put into a character class is square brackets.
Then it depends, whether the series of two or more punctuation characters consists of any punctuation character or whether the punctuation characters are the same.
In the first case curly braces can be appended to specify the number of minimum (2) and maximum repetitions. The latter is unbounded and left empty:
[...]{2,} # min. 2 or more
If only repetitions of the same character needs to be found, then the first matched punctuation character is put into a group. Then the same group (= same character) follows one or more:
([...])\1+
The back reference \1 means the first group in the expression. The groups, represented by the opening parentheses are numbered from left to right.
The next issue is escaping. There are escaping rules for Python strings and additional escaping is needed in the regular expression. The character class does not require much escaping, but the backslash must be doubled. Thus the
following example quadruplicates the backslash, one doubling because of the string, the second because of the regular expression.
Raw strings r'...' are useful for patterns, but here both the single and double quotation marks are needed.
>>> import re
>>> test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\'.]{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
>>> for t in test_cases:
match = pattern_same_punctuation.search(t)
if match:
print("{:24} => {}".format(t, match.group(1)))
else:
print(t)
The quick '''brown fox => '''
The &&quick brown fox => &&
The quick\brown fox
The quick\\brown fox => \\
The -quick brown// fox => //
The quick--brown fox => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox => ""
The quick,, brown fox => ,,
The quick brown fox... => ...
The quick-brown fox
The ((quick brown fox => ((
The quick brown)) fox => ))
The quick brown fox!!! => !!!
The 'quick' brown fox
>>>

You can use {2} in a regular expression to match two consecutive occurrences of a character class:
>>> regex = re.compile(r'[-/()!"+,\'&]{2}')
>>> [s for s in test_cases if regex.search(s)]
["The quick '''brown fox",
'The &&quick brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!']

What about regular expression? That could also help to find 2 or more consecutive punctuation symbols.
RegEx like \([\\\-\/\(\)!"+,'&]{2,})\g
{2,} stands for two or more
\g stands for global search, dont stop on first match

Thanks to #Heiko Oberdiek, here is the exact code I am using that solves the problem: (I added . to the punctuation list)
punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
x = 1
for t in test_cases:
match = punctuation.search(t)
if match:
print "{0:2} {1:24} => {2}".format(x, t, match.group(1))
x += 1
This accurately covers all of my test cases:
1 The quick '''brown fox => '''
2 The &&quick brown fox => &&
3 The quick\\brown fox => \\
4 The -quick brown// fox => //
5 The quick--brown fox => --
6 The (quick brown) fox,,, => ,,,
7 The quick ++brown fox => ++
8 The ""quick"" brown fox => ""
9 The quick,, brown fox => ,,
10 The quick brown fox... => ...
11 The ((quick brown fox => ((
12 The quick brown)) fox => ))
13 The quick brown fox!!! => !!!

Match optional string when random strings present between groups

I am very new to python and I was having trouble with matching optional strings when can be any number of strings between the groups. Here is an example of what I am looking for:
'The quick brown fox jumps over the lazy dog'
I want the word following 'brown' and if the word 'lazy' is present I want the word following it as well, i.e:
'The quick brown fox jumps over the lazy dog' --> ('fox', 'dog')
'The quick brown fox' --> ('fox', '')
'The quick brown fox dfjdnjcnjdn vvvv lazy mouse' --> ('fox', 'mouse')
'The quick brown fox lazy dog' --> ('fox', 'dog')
Here is what I tried, but it is not working
re.findall(r'brown (\S+)(.*?)(lazy )?(\S+)?', str)
What am I doing wrong and how to fix this?

You could use the following to get the words you're looking for:
brown (\S+)(?:.*lazy (\S+))?
Which would give a list of tuples, with the empty string if lazy is not present.
>>> import re
>>> s = """The quick brown fox jumps over the lazy dog
... The quick brown fox
... The quick brown fox dfjdnjcnjdn vvvv lazy mouse
... The quick brown fox lazy dog"""
>>> re.findall(r'brown (\S+)(?:.*lazy (\S+))?', s)
[('fox', 'dog'), ('fox', ''), ('fox', 'mouse'), ('fox', 'dog')]
>>>
(?: ... ) is used to make groups that won't get captured, so what's inside won't necessarily get into the tuple/list with re.findall unless it is itself within capture group(s).

You can use a pattern such as:
(?:brown|lazy)\s(\S+)
Below is a breakdown of what it matches:
(?:brown|lazy) # The words 'brown' or 'lazy'
\s # A whitespace character
(\S+) # One or more non-whitespace characters
And here is a demonstration:
>>> import re
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox jumps over the lazy dog')
['fox', 'dog']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox')
['fox']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox dfjdnjcnjdn vvvv lazy mouse')
['fox', 'mouse']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox lazy dog')
['fox', 'dog']
>>>

Python replace spaces in string iteratively [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm trying to replace spaces with hyphens one at a time in each possible position in python. For example the man said hi should produce a list of all the possible hyphen positions, including multiple hyphens:
the-man said hi
the man-said hi
the man said-hi
the-man said-hi
the-man-said hi
the man-said-hi
the-man-said-hi
The length of the strings varies in number of spaces, so it can't be a fix for just 3 spaces. I've been experimenting with re.search and re.sub in a while loop, but haven't found a nice way yet.

Use itertools.product() to produce all space-and-dash combinations, then recombine your string with those:
from itertools import product
def dashed_combos(inputstring):
words = inputstring.split()
for combo in product(' -', repeat=len(words) - 1):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
The last line zips the words together with the dashes and spaces (adding in an empty string at the end to make up the pairs), then flattens that and joins them into a single string.
Demo:
>>> for combo in dashed_combos('the man said hi'):
... print combo
...
the man said hi
the man said-hi
the man-said hi
the man-said-hi
the-man said hi
the-man said-hi
the-man-said hi
the-man-said-hi
You can always skip the first iteration of that loop (with only spaces) with itertools.islice():
from itertools import product, islice
def dashed_combos(inputstring):
words = inputstring.split()
for combo in islice(product(' -', repeat=len(words) - 1), 1, None):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
All this is extremely memory efficient; you can easily handle inputs with hundreds of words, provided you don't try and store all possible combinations in memory at once.
Slightly longer demo:
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 10):
... print combo
...
the quick brown fox jumped over the lazy-dog
the quick brown fox jumped over the-lazy dog
the quick brown fox jumped over the-lazy-dog
the quick brown fox jumped over-the lazy dog
the quick brown fox jumped over-the lazy-dog
the quick brown fox jumped over-the-lazy dog
the quick brown fox jumped over-the-lazy-dog
the quick brown fox jumped-over the lazy dog
the quick brown fox jumped-over the lazy-dog
the quick brown fox jumped-over the-lazy dog
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 200, 210):
... print combo
...
the-quick-brown fox jumped-over the lazy-dog
the-quick-brown fox jumped-over the-lazy dog
the-quick-brown fox jumped-over the-lazy-dog
the-quick-brown fox jumped-over-the lazy dog
the-quick-brown fox jumped-over-the lazy-dog
the-quick-brown fox jumped-over-the-lazy dog
the-quick-brown fox jumped-over-the-lazy-dog
the-quick-brown fox-jumped over the lazy dog
the-quick-brown fox-jumped over the lazy-dog
the-quick-brown fox-jumped over the-lazy dog

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I reverse parts of sentence in python? - python

Related

How to extract the text between a word and its next occurrence?

Python - removing some punctuation from text

Python regex to find multiple consecutive punctuations

Match optional string when random strings present between groups

Python replace spaces in string iteratively [closed]

Categories

Resources