Replace single line-feed characters, keep multiples [duplicate] - python

This question already has answers here:
replacing only single instances of a character with python regexp
(4 answers)
Closed 6 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
I am parsing a text file and want to remove all in-paragraph line breaks, while actually keeping the double line feeds that form new paragraphs. e.g.
This is my first poem\nthat does not make sense\nhow far should it go\nnobody can know.\n\nHere is a seconds\nthat is not as long\ngoodbye\n\n
When printed out, this should look like this:
This is my first poem
that does not make sense
how far should it go
nobody can know.
Here is a seconds
that is not as long
goodbye
should become
This is my first poem that does not make sense how far should it go nobody can know.\n\nHere is a seconds that is not as long goodbye\n\n
Again, when printed, it should look like:
This is my first poem that does not make sense how far should it go nobody can know.
Here is a seconds that is not as long goodbye
The trick here is in removing single occurrances of '\n', while keeping the double line feed '\n\n', AND in preserving white space (i.e. "hello\nworld" becomes "hello world" and not "helloworld").
I can do this by first substituting the \n\n with a dummy string (like "$$$", or something equally ridiculous), then removing the \n followed by reconversion of "$$$" back to \n\n...but that seems overly circuitous. Can I make this conversion with a single regular expression call?

You may replace all newlines that are not enclosed with other newlines with a space:
re.sub(r"(?<!\n)\n(?!\n)", " ", s)
See the Python demo:
import re
s = "This is my first poem\nthat does not make sense\nhow far should it go\nnobody can know.\n\nHere is a seconds\nthat is not as long\ngoodbye\n\n"
res = re.sub(r"(?<!\n)\n(?!\n)", " ", s)
print(res)
Here, the (?<!\n) is a negative lookbehind that fails the match if a newline is receded with another newline, and (?!\n) is a negative lookahead that fils the match of the newline is followed with another newline.
See more about Lookahead and Lookbehind Zero-Length Assertions here.

Related

How to find all every element between text Python [duplicate]

This question already has answers here:
Find string between two substrings [duplicate]
(20 answers)
Closed last year.
I'd like to know how to find characters in between texts in python. What I mean is that you have for example:
cool_string = "I am a very cool string here is something: not cool 8+8 that's it"
and I want to save to another string everything in between something: to that's it.
So the result would be:
soultion_to_cool_string = ' not cool 8+8 '
You can use str.find()
start = "something:"
end = "that's it"
cool_string[cool_string.find(start) + len(start):cool_string.find(end)]
If you need to remove empty space str.strip()
You should look into regex it will do your job. https://docs.python.org/3/howto/regex.html
Now for your question we will first require the lookahead and lookbehind expressions
The lookahead:
Asserts that what immediately follows the current position in the string is foo.
Syntax: (?=foo)
The lookbehind:
Asserts that what immediately precedes the current position in the string is foo.
Syntax: (?<=foo)
We need to look behind for something: and lookahead for that's it
import re
regex = r"(?<=something:).*?(?=that\'s it)" # .*? is way to capture everything in b/w except line terminators
re.findall(regex, cool_string)

Searching multiple sub strings with special character as marker [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 4 years ago.
I have a string like :
myStr = "abcd123[ 45][12] cd [67]"
I want to fetch all the sub-strings between '[' and ']' markers.
I am using findall to fetch the same but all i get is everything between firsr '[' and ']' last character.
print re.findall('\[(.+)\]', myStr)
What wrong am i doing here ?
This will probably be marked as duplicate, but the simple fix here would be to just make your dot lazy:
print re.findall('\[(.+?)\]', myStr)
[' 45', '12', '67']
Here .+? means consume everything until hitting first, or nearest, closing square bracket. Your current pattern is consuming everything until the very last closing square bracket.
Another logically identical pattern which would also work is \[([^\]+)\]:
print re.findall('\[([^\]]+)\]', myStr)
The .+ is greedy and selects as much it can, including other [] characters.
You have two options: Make the selector non-greedy by using .+? which selects the least number of characters possible, or explicitly exclude [] from your match by using [^\[\]]+ instead of .+.
(Both of these options are about equally good in this case. Though the "non-greedy" option is preferable if your ending delimiter is a longer string instead of a single character, since the longer string is more difficult to exclude.)

Preserve key:value values in text while regex replacing non-word characters in keys (Notepad++)

Trying without luck in Notepad++ to replace any non-word characters \W with underscore _ from a block of multi-line text, with exception to (and right of) a colon : (which doesn't occur on every line- something of space-delineated hierarchy, terminating in a key-value pair). A python solution could be of use as well, as I'm trying to do other things with it once reformatted. Example:
This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
This_100_is_what_I_d_like: See?
Indentation_isn_t_necessary
_to_maintain_but_would_be_nice: :)<-preserved!
I_m_Mr_Conformist_over_here: |Whereas, I'm like whatever's clever.|
If_you_can_help: Thanks 100.1%!
I admit that I'm answering an off-topic question I just liked the problem. Hold CTRL+H, enable Regular Expressions in N++ then search for:
(:[^\r\n]*|^\s+)|\W(?<![\r\n])
And replace with:
(?1\1:_)
Regex has two main parts. First side of outer alternation which matches leading spaces of a line (indentation) or every thing after first occurrence of a colon, and second side which matches a non-word character except a carriage return \r or newline \n character (in negative lookbehind) to preserve linebreaks. Replacement string is a conditional block which says if first capturing group is matched replace it with itself and if not replace it with a _.
Seeing a better description of what you're trying to do, I don't think you'll be able to do it from inside notepad++ using a single regular expression. However, you could write a python script that scrolls through your document, one line at time, and sanitizes anything to the left of a colon (if one exists)
Here's a quick and dirty example (untested). This assumes doc is an open file pointer to the file you want to sanitize
import re
sanitized_lines = []
for line in doc:
line_match = re.match(r"^(\s*)([^:\n]*)(.*)", line)
indentation = line_match.group(1)
left_of_colon = line_match.group(2)
remainder = line_match.group(3)
left_of_colon = re.sub(r"\W", "_", left_of_colon)
sanitized_lines.append("".join((indentation, left_of_colon, remainder)))
sanitized_doc = "".join(sanitized_lines)
print(sanitized_doc)
You may try this python script,
ss="""This 100% isn't what I want
Yet, it's-what-I've got currently: D#rnit :(
If you can help: Thanks 100.1%!"""
import re
splitcapture=re.compile(r'(?m)^([^:\n]+)(:[^\n]*|)$')
subregx=re.compile(r'\W+')
print(splitcapture.sub(lambda m: subregx.sub('_', m.group(1))+m.group(2), ss))
in which first I tried to match each line and capture 2 parts separately(the one part not containing ':'character is capured to group 1, and the other possible part started with ':' and goes on to the end of the line is captured to group 2), and then implemented replacing process only on group 1 captured string and finally joined 2 parts, replaced group 1 + group 2
And output is
This_100_isn_t_what_I_want_
_Yet_it_s_what_I_ve_got_currently: D#rnit :(
If_you_can_help: Thanks 100.1%!

Sentence splitting based in regular expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I am trying to split an article into sentences. And using the following code (written by someone who has left organization). Help me understand the code
re.split(r' *[.,:-\#/_&?!;][\s ]+', x)
It looks for punctuation marks such as stops, commas and colons, optionally preceded by spaces and always followed by at least one whitespace character. In the commonest case that will be ". ". Then it splits the string x into pieces by removing the matched punctuation and returning whatever is left as a list.
>>> x = "First sentence. Second sentence? Third sentence."
>>> re.split(r' *[.,:-\#/_&?!;][\s ]+', x)
['First sentence', 'Second sentence', 'Third sentence.']
The regular expression is unnecessarily complex and doesn't do a very good job.
This bit: :-\# has a redundant quoting backslash, and means the characters between ascii 58 and 64, in other words : ; < = > ? #, but it would be better to list the 7 characters explicitly, because most people will not know what characters fall in that range. That includes me: I had to look it up. And it clearly also includes the code's author, since he redundantly specified ; again at the end.
This bit [\s ]+ means one or more spaces or whitespace characters but a space is a whitespace character so that could be more simply expressed as \s+.
Note the retained full stop in the 3rd element of the returned list. That is because when the full stop comes at the end of the line, it is not followed by a space, and the regular expression insists that it must be. Retaining the full stop is okay, but only if it is done consistently for all sentences, not just for the ones that end at a line break.
Throw away that bit of code and start from scratch. Or use nltk, which has power tools for splitting text into sentences and is likely to do a much more respectable job.
>>> import nltk
>>> sent_tokenizer=nltk.punkt.PunktSentenceTokenizer()
>>> sent_tokenizer.sentences_from_text(x)
['First sentence.', 'Second sentence?', 'Third sentence.']

Regex not working to get string between 2 strings. Python 27 [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
From this URL view-source:https://www.amazon.com/dp/073532753X?smid=A3P5ROKL5A1OLE
I want to get string between var iframeContent = and obj.onloadCallback = onloadCallback;
I have this regex iframeContent(.*?)obj.onloadCallback = onloadCallback;
But it does not work. I am not good at regex so please pardon my lack of knowledge.
I even tried iframeContent(.*?)obj.onloadCallback but it does not work.
It looks like you just want that giant encoded string. I believe yours is failing for two reasons. You're not running in DOTALL mode, which means your . won't match across multiple lines, and your regex is failing because of catastrophic backtracking, which can happen when you have a very long variable length match that matches the same characters as the ones following it.
This should get what you want
m = re.search(r'var iframeContent = \"([^"]+)\"', html_source)
print m.group(1)
The regex is just looking for any characters except double quotes [^"] in between two double quotes. Because the variable length match and the match immediately after it don't match any of the same characters, you don't run into the catastrophic backtracking issue.
I suspect that input string lies across multiple lines.Try adding re.M in search line (ie. re.findall('someString', text_Holder, re.M)).
You could try this regex too
(?<=iframeContent =)(.*)(?=obj.onloadCallback = onloadCallback)
you can check at this site the test.
Is it very important you use DOTALL mode, which means that you will have single-line

Categories