Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a list of words in a txt file, each one in a line with its definition next to them. However, the definition sometimes gives a sentence using the word. I want to replace that word repeated in the example with the symbol ~. How could I do this with Python?
Ok, here is my example of replacing every instance of a word in a sentence with another character...
>>> my_string = "the quick brown fox jumped over the lazy dog"
>>> search_word = "the"
>>> replacement_symbol = "~"
>>> my_string.replace(search_word, replacement_symbol)
'~ quick brown fox jumped over ~ lazy dog'
Obviously this doesn't cover loading in the file, reading it line by line and omitting the first instance of the word... Lets extend it a little.
words.txt
fox the quick brown fox jumped over the lazy dog
the the quick brown fox jumped over the lazy dog
jumped the quick brown fox jumped over the lazy dog
And to read this, strip the first word and then replace that word in the rest of the line...
with open('words.txt') as f:
for line in f.readlines():
line = line.strip()
search_term = line.split(' ')[0]
sentence = ' '.join(line.split(' ')[1:])
sentence = sentence.replace(search_term, '~')
line = '%s %s' % (search_term, sentence)
print(line)
and the output...
fox the quick brown ~ jumped over the lazy dog
the ~ quick brown fox jumped over ~ lazy dog
jumped the quick brown fox ~ over the lazy dog
Assuming the word and definition is separated by #:
with open('file.txt','r') as f:
for line in f:
myword,mydefinition=line.split("#")
if myword in mydefinition
mydefinition.replace(myword, "~")
I have a sentence, let's say:
The quick brown fox jumps over the lazy dog
I want to create a function that takes 2 arguments, a sentence and a list of things to ignore. And it returns that sentence with the reversed words, however it should ignore the stuff I pass to it in a second argument. This is what I have at the moment:
def main(sentence, ignores):
return ' '.join(word[::-1] if word not in ignores else word for word in sentence.split())
But this will only work if I pass a second list like so:
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
However, I want to pass a list like this:
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
expected result:
ehT quick brown xof spmuj revo eht lazy dog
So basically the second argument (the list) will have parts of the sentence that should be ignored. Not just single words.
Do I have to use regexp for this? I was trying to avoid it...
I'm the first person to recommend avoiding regular expressions, but in this case, the complexity of doing without is greater than the complexity added by using them:
import re
def main(sentence, ignores):
# Dedup and allow fast lookup for determining whether to reverse a component
ignores = frozenset(ignores)
# Make a pattern that will prefer matching the ignore phrases, but
# otherwise matches each space and non-space run (so nothing is dropped)
# Alternations match the first pattern by preference, so you'll match
# the ignores phrases if possible, and general space/non-space patterns
# otherwise
pat = r'|'.join(map(re.escape, ignores)) + r'|\S+|\s+'
# Returns the chopped up pieces (space and non-space runs, but ignore phrases stay together
parts = re.findall(pat, sentence)
# Reverse everything not found in ignores and then put it all back together
return ''.join(p if p in ignores else p[::-1] for p in parts)
Just another idea, reverse every word and then reverse the ignores right back:
>>> from functools import reduce
>>> def main(sentence, ignores):
def r(s):
return ' '.join(w[::-1] for w in s.split())
return reduce(lambda s, i: s.replace(r(i), i), ignores, r(sentence))
>>> main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog'])
'ehT quick brown xof spmuj revo eht lazy dog'
Instead of placeholders, why not just initially reverse any phrase that you want to be around the right way, then reverse the whole string:
def main(sentence, ignores):
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
returns:
ehT quick nworb xof spmuj revo eht lazy god
ehT quick brown xof spmuj revo eht lazy dog
I have attempted to solve the issue of overlapping ignore phrases e.g. ['brown fox', 'quick brown'] raised by #PadraicCunningham.
There's obviously a lot more looping and code feels less pythonic so I'd be interested in feedback on how to improve this.
import re
def _span_combiner(spans):
"""replace overlapping spans with encompasing single span"""
for i, s in enumerate(spans):
start = s[0]
end = s[1]
for x in spans[i:]:
if x[0] < end:
end = x[1]
yield (start, end)
def main(sentence, ignores):
# spans is a start and finish indices for each ignore phrase in order of occurence
spans = sorted(
[[m.span() for m in re.finditer(p, sentence)][0] for p in ignores if p in sentence]
)
# replace overlapping indices with single set of indices encompasing overlapped range
spans = [s for s in _span_combiner(spans)]
# recreate ignore list by slicing sentence with combined spans
ignores = [sentence[s[0]:s[1]] for s in spans]
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
if __name__ == "__main__":
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['brown fox', 'lazy dog']))
print(main('The quick brown fox jumps over the lazy dog', ['nonexistent' ,'brown fox', 'quick brown']))
print(main('The quick brown fox jumps over the brown fox', ['brown fox', 'quick brown']))
results:
ehT quick nworb xof spmuj revo eht lazy god
ehT kciuq brown fox spmuj revo eht lazy dog
ehT quick brown fox spmuj revo eht yzal god
ehT quick brown fox spmuj revo eht brown fox
I am streaming plain text records via MapReduce and need to check each plain text record for 2 or more consecutive punctuation symbols. The 12 symbols I need to check for are: -/\()!"+,'&..
I have tried translating this punctuation list into an array like this:
punctuation = [r'-', r'/', r'\\', r'\(', r'\)', r'!', r'"', r'\+', r',', r"'", r'&', r'\.']
I can find individual characters with nested for loops, for example:
for t in test_cases:
print t
for p in punctuation:
print p
if re.search(p, t):
print 'found a match!', p, t
else:
print 'no match'
However, the single backslash character is not found when I test this and I don't know how to get only results that are 2 or more consecutive occurrences in a row. I've read that I need to use the + symbol, but don't know the correct syntax to use this.
Here are some test cases:
The quick '''brown fox
The &&quick brown fox
The quick\brown fox
The quick\\brown fox
The -quick brown// fox
The quick--brown fox
The (quick brown) fox,,,
The quick ++brown fox
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox
The quick,, brown fox
The quick brown fox…
The quick-brown fox
The ((quick brown fox
The quick brown)) fox
The quick brown fox!!!
The 'quick' brown fox
Which when translated into a Pythonic list looks like this:
test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
How do I use Python regex to identify and report all matches where the punctuation symbol appears 2 or more times in a row?
The punctuation characters can be put into a character class is square brackets.
Then it depends, whether the series of two or more punctuation characters consists of any punctuation character or whether the punctuation characters are the same.
In the first case curly braces can be appended to specify the number of minimum (2) and maximum repetitions. The latter is unbounded and left empty:
[...]{2,} # min. 2 or more
If only repetitions of the same character needs to be found, then the first matched punctuation character is put into a group. Then the same group (= same character) follows one or more:
([...])\1+
The back reference \1 means the first group in the expression. The groups, represented by the opening parentheses are numbered from left to right.
The next issue is escaping. There are escaping rules for Python strings and additional escaping is needed in the regular expression. The character class does not require much escaping, but the backslash must be doubled. Thus the
following example quadruplicates the backslash, one doubling because of the string, the second because of the regular expression.
Raw strings r'...' are useful for patterns, but here both the single and double quotation marks are needed.
>>> import re
>>> test_cases = [
"The quick '''brown fox",
'The &&quick brown fox',
'The quick\\brown fox',
'The quick\\\\brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The "quick brown" fox',
'The quick/brown fox',
'The quick&brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The quick brown fox...',
'The quick-brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!',
"The 'quick' brown fox" ]
>>> pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\'.]{2,})')
>>> pattern_same_punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
>>> for t in test_cases:
match = pattern_same_punctuation.search(t)
if match:
print("{:24} => {}".format(t, match.group(1)))
else:
print(t)
The quick '''brown fox => '''
The &&quick brown fox => &&
The quick\brown fox
The quick\\brown fox => \\
The -quick brown// fox => //
The quick--brown fox => --
The (quick brown) fox,,, => ,,,
The quick ++brown fox => ++
The "quick brown" fox
The quick/brown fox
The quick&brown fox
The ""quick"" brown fox => ""
The quick,, brown fox => ,,
The quick brown fox... => ...
The quick-brown fox
The ((quick brown fox => ((
The quick brown)) fox => ))
The quick brown fox!!! => !!!
The 'quick' brown fox
>>>
You can use {2} in a regular expression to match two consecutive occurrences of a character class:
>>> regex = re.compile(r'[-/()!"+,\'&]{2}')
>>> [s for s in test_cases if regex.search(s)]
["The quick '''brown fox",
'The &&quick brown fox',
'The -quick brown// fox',
'The quick--brown fox',
'The (quick brown) fox,,,',
'The quick ++brown fox',
'The ""quick"" brown fox',
'The quick,, brown fox',
'The ((quick brown fox',
'The quick brown)) fox',
'The quick brown fox!!!']
What about regular expression? That could also help to find 2 or more consecutive punctuation symbols.
RegEx like \([\\\-\/\(\)!"+,'&]{2,})\g
{2,} stands for two or more
\g stands for global search, dont stop on first match
Thanks to #Heiko Oberdiek, here is the exact code I am using that solves the problem: (I added . to the punctuation list)
punctuation = re.compile('(([-/\\\\()!"+,&\'.])\\2+)')
x = 1
for t in test_cases:
match = punctuation.search(t)
if match:
print "{0:2} {1:24} => {2}".format(x, t, match.group(1))
x += 1
This accurately covers all of my test cases:
1 The quick '''brown fox => '''
2 The &&quick brown fox => &&
3 The quick\\brown fox => \\
4 The -quick brown// fox => //
5 The quick--brown fox => --
6 The (quick brown) fox,,, => ,,,
7 The quick ++brown fox => ++
8 The ""quick"" brown fox => ""
9 The quick,, brown fox => ,,
10 The quick brown fox... => ...
11 The ((quick brown fox => ((
12 The quick brown)) fox => ))
13 The quick brown fox!!! => !!!
I am very new to python and I was having trouble with matching optional strings when can be any number of strings between the groups. Here is an example of what I am looking for:
'The quick brown fox jumps over the lazy dog'
I want the word following 'brown' and if the word 'lazy' is present I want the word following it as well, i.e:
'The quick brown fox jumps over the lazy dog' --> ('fox', 'dog')
'The quick brown fox' --> ('fox', '')
'The quick brown fox dfjdnjcnjdn vvvv lazy mouse' --> ('fox', 'mouse')
'The quick brown fox lazy dog' --> ('fox', 'dog')
Here is what I tried, but it is not working
re.findall(r'brown (\S+)(.*?)(lazy )?(\S+)?', str)
What am I doing wrong and how to fix this?
You could use the following to get the words you're looking for:
brown (\S+)(?:.*lazy (\S+))?
Which would give a list of tuples, with the empty string if lazy is not present.
>>> import re
>>> s = """The quick brown fox jumps over the lazy dog
... The quick brown fox
... The quick brown fox dfjdnjcnjdn vvvv lazy mouse
... The quick brown fox lazy dog"""
>>> re.findall(r'brown (\S+)(?:.*lazy (\S+))?', s)
[('fox', 'dog'), ('fox', ''), ('fox', 'mouse'), ('fox', 'dog')]
>>>
(?: ... ) is used to make groups that won't get captured, so what's inside won't necessarily get into the tuple/list with re.findall unless it is itself within capture group(s).
You can use a pattern such as:
(?:brown|lazy)\s(\S+)
Below is a breakdown of what it matches:
(?:brown|lazy) # The words 'brown' or 'lazy'
\s # A whitespace character
(\S+) # One or more non-whitespace characters
And here is a demonstration:
>>> import re
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox jumps over the lazy dog')
['fox', 'dog']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox')
['fox']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox dfjdnjcnjdn vvvv lazy mouse')
['fox', 'mouse']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox lazy dog')
['fox', 'dog']
>>>