Python String split using a regex - python

We want to split a string multi line for example
|---------------------------------------------Title1(a)---------------------------------------------
Content goes here, the quick brown fox jumps over the lazy dog
|---------------------------------------------Title1(b)----------------------------------------------
Content goes here, the quick brown fox jumps over the lazy dog
here's our python split using regex code
import re
str1 = "|---------------------------------------------Title1(a)---------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"" \
"|---------------------------------------------Title1(b)----------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"|"
print(str1)
str2 = re.split("\|---------------------------------------------", str1)
print(str2)
We want the output to include only
str2[0]:
Content goes here, the quick brown fox jumps over the lazy dog
str2[1]:
Content goes here, the quick brown fox jumps over the lazy dog
what's the proper regex to use, or is there any other way to split using the format above

Instead of using split, you can match the lines and capture the part that you want in a group.
\|-{2,}[^-]+-{2,}([^-].*?)(?=\|)
Explanation
\| Match |
-{2,} Match 2 or more -
[^-]+ Match 1+ times any char except -
-{2,} Match 2 or more -
( Capture grou 1
[^-].*? match any char except -, then any char as least as possible
) Close group 1
(?=\|) Positive lookahead, assert a | to the right
Regex demo | Python demo
Example
import re
regex = r"\|-{2,}[^-]+-{2,}([^-].*?)(?=\|)"
str1 = "|---------------------------------------------Title1(a)---------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"" \
"|---------------------------------------------Title1(b)----------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"|"
str2 = re.findall(regex, str1);
print(str2[0])
print(str2[1])
Output
Content goes here, the quick brown fox jumps over the lazy dog
Content goes here, the quick brown fox jumps over the lazy dog
If Title should be part of the line, another option is to make the match a bit more precise.
\|-+Title\d+\([a-z]\)-+(.+?)(?=\||$)
Regex demo

Related

How to delete a word in a string if it appears more than once in the same line? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a list of words in a txt file, each one in a line with its definition next to them. However, the definition sometimes gives a sentence using the word. I want to replace that word repeated in the example with the symbol ~. How could I do this with Python?
Ok, here is my example of replacing every instance of a word in a sentence with another character...
>>> my_string = "the quick brown fox jumped over the lazy dog"
>>> search_word = "the"
>>> replacement_symbol = "~"
>>> my_string.replace(search_word, replacement_symbol)
'~ quick brown fox jumped over ~ lazy dog'
Obviously this doesn't cover loading in the file, reading it line by line and omitting the first instance of the word... Lets extend it a little.
words.txt
fox the quick brown fox jumped over the lazy dog
the the quick brown fox jumped over the lazy dog
jumped the quick brown fox jumped over the lazy dog
And to read this, strip the first word and then replace that word in the rest of the line...
with open('words.txt') as f:
for line in f.readlines():
line = line.strip()
search_term = line.split(' ')[0]
sentence = ' '.join(line.split(' ')[1:])
sentence = sentence.replace(search_term, '~')
line = '%s %s' % (search_term, sentence)
print(line)
and the output...
fox the quick brown ~ jumped over the lazy dog
the ~ quick brown fox jumped over ~ lazy dog
jumped the quick brown fox ~ over the lazy dog
Assuming the word and definition is separated by #:
with open('file.txt','r') as f:
for line in f:
myword,mydefinition=line.split("#")
if myword in mydefinition
mydefinition.replace(myword, "~")

How can I reverse parts of sentence in python?

I have a sentence, let's say:
The quick brown fox jumps over the lazy dog
I want to create a function that takes 2 arguments, a sentence and a list of things to ignore. And it returns that sentence with the reversed words, however it should ignore the stuff I pass to it in a second argument. This is what I have at the moment:
def main(sentence, ignores):
return ' '.join(word[::-1] if word not in ignores else word for word in sentence.split())
But this will only work if I pass a second list like so:
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
However, I want to pass a list like this:
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
expected result:
ehT quick brown xof spmuj revo eht lazy dog
So basically the second argument (the list) will have parts of the sentence that should be ignored. Not just single words.
Do I have to use regexp for this? I was trying to avoid it...
I'm the first person to recommend avoiding regular expressions, but in this case, the complexity of doing without is greater than the complexity added by using them:
import re
def main(sentence, ignores):
# Dedup and allow fast lookup for determining whether to reverse a component
ignores = frozenset(ignores)
# Make a pattern that will prefer matching the ignore phrases, but
# otherwise matches each space and non-space run (so nothing is dropped)
# Alternations match the first pattern by preference, so you'll match
# the ignores phrases if possible, and general space/non-space patterns
# otherwise
pat = r'|'.join(map(re.escape, ignores)) + r'|\S+|\s+'
# Returns the chopped up pieces (space and non-space runs, but ignore phrases stay together
parts = re.findall(pat, sentence)
# Reverse everything not found in ignores and then put it all back together
return ''.join(p if p in ignores else p[::-1] for p in parts)
Just another idea, reverse every word and then reverse the ignores right back:
>>> from functools import reduce
>>> def main(sentence, ignores):
def r(s):
return ' '.join(w[::-1] for w in s.split())
return reduce(lambda s, i: s.replace(r(i), i), ignores, r(sentence))
>>> main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog'])
'ehT quick brown xof spmuj revo eht lazy dog'
Instead of placeholders, why not just initially reverse any phrase that you want to be around the right way, then reverse the whole string:
def main(sentence, ignores):
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
returns:
ehT quick nworb xof spmuj revo eht lazy god
ehT quick brown xof spmuj revo eht lazy dog
I have attempted to solve the issue of overlapping ignore phrases e.g. ['brown fox', 'quick brown'] raised by #PadraicCunningham.
There's obviously a lot more looping and code feels less pythonic so I'd be interested in feedback on how to improve this.
import re
def _span_combiner(spans):
"""replace overlapping spans with encompasing single span"""
for i, s in enumerate(spans):
start = s[0]
end = s[1]
for x in spans[i:]:
if x[0] < end:
end = x[1]
yield (start, end)
def main(sentence, ignores):
# spans is a start and finish indices for each ignore phrase in order of occurence
spans = sorted(
[[m.span() for m in re.finditer(p, sentence)][0] for p in ignores if p in sentence]
)
# replace overlapping indices with single set of indices encompasing overlapped range
spans = [s for s in _span_combiner(spans)]
# recreate ignore list by slicing sentence with combined spans
ignores = [sentence[s[0]:s[1]] for s in spans]
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
if __name__ == "__main__":
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['brown fox', 'lazy dog']))
print(main('The quick brown fox jumps over the lazy dog', ['nonexistent' ,'brown fox', 'quick brown']))
print(main('The quick brown fox jumps over the brown fox', ['brown fox', 'quick brown']))
results:
ehT quick nworb xof spmuj revo eht lazy god
ehT kciuq brown fox spmuj revo eht lazy dog
ehT quick brown fox spmuj revo eht yzal god
ehT quick brown fox spmuj revo eht brown fox

Python - removing some punctuation from text

I want Python to remove only some punctuation from a string, let's say I want to remove all the punctuation except '#'
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Here the output is
The quick brown fox like totally jumped man
But what I want is something like this
The quick brown fox like totally jumped #man
Is there a way to selectively remove punctuation from a text leaving out the punctuation that we want in the text intact?
str.punctuation contains all the punctuations. Remove # from it. Then replace with '' whenever you get that punctuation string.
>>> import re
>>> a = string.punctuation.replace('#','')
>>> re.sub(r'[{}]'.format(a),'','The quick brown fox, like, totally jumped, #man!')
'The quick brown fox like totally jumped #man'
Just remove the character you don't want to touch from the replacement string:
import string
remove = dict.fromkeys(map(ord, '\n' + string.punctuation.replace('#','')))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Also note that I changed '\n ' to '\n', as the former will remove spaces from your string.
Result:
The quick brown fox like totally jumped #man

Match optional string when random strings present between groups

I am very new to python and I was having trouble with matching optional strings when can be any number of strings between the groups. Here is an example of what I am looking for:
'The quick brown fox jumps over the lazy dog'
I want the word following 'brown' and if the word 'lazy' is present I want the word following it as well, i.e:
'The quick brown fox jumps over the lazy dog' --> ('fox', 'dog')
'The quick brown fox' --> ('fox', '')
'The quick brown fox dfjdnjcnjdn vvvv lazy mouse' --> ('fox', 'mouse')
'The quick brown fox lazy dog' --> ('fox', 'dog')
Here is what I tried, but it is not working
re.findall(r'brown (\S+)(.*?)(lazy )?(\S+)?', str)
What am I doing wrong and how to fix this?
You could use the following to get the words you're looking for:
brown (\S+)(?:.*lazy (\S+))?
Which would give a list of tuples, with the empty string if lazy is not present.
>>> import re
>>> s = """The quick brown fox jumps over the lazy dog
... The quick brown fox
... The quick brown fox dfjdnjcnjdn vvvv lazy mouse
... The quick brown fox lazy dog"""
>>> re.findall(r'brown (\S+)(?:.*lazy (\S+))?', s)
[('fox', 'dog'), ('fox', ''), ('fox', 'mouse'), ('fox', 'dog')]
>>>
(?: ... ) is used to make groups that won't get captured, so what's inside won't necessarily get into the tuple/list with re.findall unless it is itself within capture group(s).
You can use a pattern such as:
(?:brown|lazy)\s(\S+)
Below is a breakdown of what it matches:
(?:brown|lazy) # The words 'brown' or 'lazy'
\s # A whitespace character
(\S+) # One or more non-whitespace characters
And here is a demonstration:
>>> import re
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox jumps over the lazy dog')
['fox', 'dog']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox')
['fox']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox dfjdnjcnjdn vvvv lazy mouse')
['fox', 'mouse']
>>> re.findall(r'(?:brown|lazy)\s(\S+)', 'The quick brown fox lazy dog')
['fox', 'dog']
>>>

Python replace spaces in string iteratively [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm trying to replace spaces with hyphens one at a time in each possible position in python. For example the man said hi should produce a list of all the possible hyphen positions, including multiple hyphens:
the-man said hi
the man-said hi
the man said-hi
the-man said-hi
the-man-said hi
the man-said-hi
the-man-said-hi
The length of the strings varies in number of spaces, so it can't be a fix for just 3 spaces. I've been experimenting with re.search and re.sub in a while loop, but haven't found a nice way yet.
Use itertools.product() to produce all space-and-dash combinations, then recombine your string with those:
from itertools import product
def dashed_combos(inputstring):
words = inputstring.split()
for combo in product(' -', repeat=len(words) - 1):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
The last line zips the words together with the dashes and spaces (adding in an empty string at the end to make up the pairs), then flattens that and joins them into a single string.
Demo:
>>> for combo in dashed_combos('the man said hi'):
... print combo
...
the man said hi
the man said-hi
the man-said hi
the man-said-hi
the-man said hi
the-man said-hi
the-man-said hi
the-man-said-hi
You can always skip the first iteration of that loop (with only spaces) with itertools.islice():
from itertools import product, islice
def dashed_combos(inputstring):
words = inputstring.split()
for combo in islice(product(' -', repeat=len(words) - 1), 1, None):
yield ''.join(w for pair in zip(words, combo + ('',)) for w in pair)
All this is extremely memory efficient; you can easily handle inputs with hundreds of words, provided you don't try and store all possible combinations in memory at once.
Slightly longer demo:
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 10):
... print combo
...
the quick brown fox jumped over the lazy-dog
the quick brown fox jumped over the-lazy dog
the quick brown fox jumped over the-lazy-dog
the quick brown fox jumped over-the lazy dog
the quick brown fox jumped over-the lazy-dog
the quick brown fox jumped over-the-lazy dog
the quick brown fox jumped over-the-lazy-dog
the quick brown fox jumped-over the lazy dog
the quick brown fox jumped-over the lazy-dog
the quick brown fox jumped-over the-lazy dog
>>> for combo in islice(dashed_combos('the quick brown fox jumped over the lazy dog'), 200, 210):
... print combo
...
the-quick-brown fox jumped-over the lazy-dog
the-quick-brown fox jumped-over the-lazy dog
the-quick-brown fox jumped-over the-lazy-dog
the-quick-brown fox jumped-over-the lazy dog
the-quick-brown fox jumped-over-the lazy-dog
the-quick-brown fox jumped-over-the-lazy dog
the-quick-brown fox jumped-over-the-lazy-dog
the-quick-brown fox-jumped over the lazy dog
the-quick-brown fox-jumped over the lazy-dog
the-quick-brown fox-jumped over the-lazy dog

Categories