I need data to train a bot, so I have scraped SO questions. How can I replace new lines without removing \n from strings?
If I have the following string:
"""You can use \n to print a new line.
Text text text."""
How can I get: You can use \n to print a new line. Text text text.
I've tried this: string.replace("\n","")
But I end up with: 'You can use to print a new line.Text text text.'
Since I'm dealing with programming questions, I'm destined to run into \n in a string and wouldn't want to replace that.
you could print it as a real string
this is done with the letter r
example 1:
print(r"You can use \n to print a new line.")
# You can use \n to print a new line.
this will not remove it, but make it visible as you want in the output
example 2:
text = r"You can use \n to print a new line."
print(text)
# You can use \n to print a new line.
If you are printing the string and the output is:
You can use \n to print a new line.
Text text text.
then the \n visible in the output is actually the backslash character followed by the letter n, and not a newline character.
Doing replace("\n", "") should not remove the sequence of characters \n, because the replace pattern "\n" itself is not the sequence of characters \n, but rather the actual single newline character. So it does not match the \n sequence of characters visible in your string, but it does match (and replace) the newline characters.
This REPL snippet illustrates that:
>>> x = """You can use \\n to print a new line.\n\nText text text.""" # this string literal is how you would create the string you have shown in you question.
>>> x == r"""You can use \n to print a new line.
...
... Text text text.""" # or you can use a raw string literal to initialize your variable, it is exactly the same thing
True
>>> print(x)
You can use \n to print a new line.
Text text text.
>>> print(x.replace("\n", ""))
You can use \n to print a new line.Text text text.
If you mean that you are creating a string with the literal:
"""You can use \n to print a new line.
Text text text."""
Then it is impossible to distinguish between the typed \n and the result of pressing the Enter key in your string literal (unless you use a raw string initializer, as other answers have explained). Once the code is interpreted by Python they are identical. Consider escaping the newline character in your literal to have it included in your string as is:
myString = """You can use \\n to print a new line.
Text text text."""
If you want to convert new lines to literal string \n, you can escape the slash character:
string.replace("\n","\\n")
The \n in your string is an escape sequence that gets evaluated to the newline character.
In [1]: s = """You can use \n to print a new line.
...:
...: Text text text."""
In [2]: print(s)
You can use
to print a new line.
Text text text.
If you want to actually include the characters \ and n in your string, you need to escape the backslash with another backslash.
In [3]: s = """You can use \\n to print a new line.
...:
...: Text text text."""
In [4]: print(s)
You can use \n to print a new line.
Text text text.
In [5]: print(s.replace("\n", ""))
You can use \n to print a new line.Text text text.
Alternatively, you could use a "raw string", i.e. a string prefixed with r, e.g. r"..." or r"""...""" but then you would no longer be able to use escape sequences such as \n to insert a newline character, \t to insert a tab, etc.
Related
How can I remove escaped and escaped escaped newlines, tabs carriage returns, etc. ?
sentence = "\ndirty string \n \\n \\\\n \t\\t\\\\t \r\\r\\\\r"
A classic brute force approach is to do
" ".join(sentence.split())
but the escaped characters remain:
"dirty string \\n \\\\n \\t\\\\t \\r\\\\r"
how can I transform my string so that it will look like:
"dirty string"
While, for example, \n is an escape character, \\n is not. This is why you are left with strings like \\n \\\\n \\t\\\\t \\r\\\\r after sentence.split().
This will return the desired output:
result=" ".join(word for word in sentence.split() if not word.startswith("\\"))
It breaks the sentence down into words, striping any leading or trailing whitespace, but only considering words that do not start with a backslash. Remember things like \\n are not escape characters but representation of literal string \n.
Btw I wouldn't call your attempt "brute force", as string functions like split(), strip(), join(), replace() etc. are intended for solving exactly this type of problem.
Using a regex pattern such as (\\n|\\r|\\t|\\)
Input:
sentence = "\ndirty string \n \\n \\\\n \t\\t\\\\t \r\\r\\\\r"
Strip:
import re
x = re.sub(r"(\\n|\\r|\\t|\\)", "", sentence).strip()
Result:
'dirty string'
sentence = "\ndirty string \n \\n \\\\n \t\\t\\\\t \r\\r\\\\r"\
print(''.join(s for s in sentence if (s.isalnum() or (s == ' '))))
# Output: dirty string n n tt rr
On looking at your sentence, some of the letters have not been escaped. I've put brackets around the escaped characters that I can see:
"(\n)dirty string (\n) (\\)n (\\)(\\)n (\t)(\\)t(\\)(\\)t (\r)(\\)r(\\)(\\)r"
In this string literal, any characters outside the bracket have not been escaped, and you should consider whether you do want them to be thrown away.
This question already has answers here:
replacing only single instances of a character with python regexp
(4 answers)
Closed 2 years ago.
i have text that has excessive line breaks that i want to remove. the goal is to remove the single line breaks \n BUT leave the double line breaks \n\n (indicating new paragraph), this is ok.
i created this regex to isolate the single breaks and try to sub for a nothing, an empty space, even a backspace '\b' but nothing works. the goal is to NOT have the sentence break on the single \n and have the sentence naturally continue on screen or do a self word wrap but not forced to a new line for a single \n. The consecutive linebreaks \n\n (see the end of sentence) are ok.
i added the * * so you can see them easier. the regex is supposed to capture the single \n (\\n) only when it is in front of 2 consecutive letters (?<=[a-z][a-z])
text = "more information*\n*on options concepts and strategies.*\n* Also,*\n* George Fontanills publishes*\n*several options *\n*learning tools that deal*\n*primarily with the Delta Neutral approach.*\n\n*Page 14 shows and example of the tools"
text1= re.sub( r"(?<=[a-z][a-z])(\\n)" , " ", text)
import re
text = "more information*\n*on options concepts and strategies. Also, George Fontanills publishes*\n*several options learning tools that deal*\n*primarily with the Delta Neutral approach.*\n\n*Page 14 shows and example of the tools a\n"
text = text.replace("*", "")
text1= re.sub(r'(?<=[a-z., ]{2})\n(?!\n)', '', text)
print(text1)
Explanation:
Match a single character present in the list below [a-z]{2}. {2} Quantifier — Matches exactly 2 times.
\n matches a line-feed (newline) character (ASCII 10)
Negative Lookahead (?!\n). Assert that the Regex below does not match.
You could use {amount} to indicate amount \n do you want remove:
text = text.replace('*', '')
re.sub(r"(?<=\w{2})(\n{1})", " ", text)
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
I have a number of txt files in a format like this -
\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text
I'd like to get these into a dictionary that looks like this -
{'Intro': '\n text \n text \n',
'Body': '\n text \n text',
'Refs': '\n test \n text'}
I'm concerned about the time it is going to take to process all of the txt files so wanted an approach that would take as little time as possible and I don't care about splitting the text into lines.
I am trying to use regex, but am struggling to get it to work correctly - I think my last regex group is incorrect. Below is what I currently have. Any suggestions would be great.
pattern = r"(====.)(.+?\b)(.*)"
matches = re.findall(pattern, data, re.DOTALL)
my_dict = {b:c for a,b,c in matches}
You don’t need RegEx here, instead you can use classic split() function.
Here, I use textwrap for readability:
import textwrap
text = textwrap.dedent("""\
==== Intro
text
text
==== Body
text
text
==== Refs
test
text""")
You can do:
result = {}
for part in text.split("==== "):
if not part.isspace():
section, content = part.split(' ', 1)
result[section] = content
Or initialise a dict with a list of tuples in comprehension:
result = dict(part.split(' ', 1)
for part in text.split("==== ")
if not part.isspace())
This should do:
d = dict(re.findall('(?<=\n====\s)(\w+)(\s+[^=]+)', text, re.M | re.DOTALL))
print(d)
{'Body': ' \n text \n text \n',
'Intro': ' \n text \n text \n',
'Refs': ' \n test \n text'}
Regex Details
(?<= # lookbehind (must be fixed width)
\n # newline
==== # four '=' chars in succession
\s # single wsp character
)
( # first capture group
\w+ # 1 or more alphabets (or underscore)
)
( # second capture group
\s+ # one or more wsp characters
[^=]+ # match any char that is not an '='
)
You can try this:
import re
s = "\n==== Intro \n text \n text \n==== Body \n text \n text \n==== Refs \n test \n text"
final_data = re.findall("(?<=\n\=\=\=\=\s)[a-zA-Z]+\s", s)
text = re.findall("\n .*? \n .*?$|\n .*? \n .*? \n", s)
final_body = {a:b for a, b in zip(final_data, text)}
Output:
{'Body ': '\n text \n text \n', 'Intro ': '\n text \n text \n', 'Refs ': '\n test \n text'}
If you do not want to read the whole file into memory, you can process it line-by-line like this:
marker = "==== "
def read_my_custom_format(file):
current_header = None
current_contents = []
for line in file:
line = line.strip() # trim whitespace, including trailing newline
if line.startswith(marker):
yield current_header, current_contents # emit current section
current_header = line[len(marker):] # trim marker
current_contents = []
else:
current_contents.append(line)
This is a generator yielding tuples instead of building a dictionary.
This way it only holds one section at a time in memory.
Also, each key maps to a list of lines instead of one string, but you could easily just "".join(iterable) them.
If you want to produce a single dictionary, which again takes memory proportional to the input file, you can just do it like this:
with open("your_textfile.txt") as file:
data = dict(read_my_custom_format(file))
Because dict() can take an iterable of 2-tuples
Define a paragraph as a multi-line string delimited on both side with double new lines ('\n\n'). if there exist a paragraph which contains a certain string ('BAD'), i want to replace that paragraph (i.e. any text containing BAD up to the closest preceding and following double newlines) with some other token ('GOOD'). this should be with a python 3 regex.
i have text such as:
dfsdf\n
sdfdf\n
\n
blablabla\n
blaBAD\n
bla\n
\n
dsfsdf\n
sdfdf
should be:
dfsdf\n
sdfdf\n
\n
GOOD\n
\n
dsfsdf\n
sdfdf
Here you are:
/\n\n(?:[^\n]|\n(?!\n))*BAD(?:[^\n]|\n(?!\n))*/g
OK, to break it down a little (because it's nasty looking):
\n\n matches two literal line breaks.
(?:[^\n]|\n(?!\n))* is a non-capturing group that matches either a single non-line break character, or a line break character that isn't followed by another. We repeat the entire group 0 or more times (in case BAD appears at the beginning of the paragraph).
BAD will match the literal text you want. Simple enough.
Then we use the same construction as above, to match the rest of the paragraph.
Then, you just replace it with \n\nGOOD, and you're off to the races.
Demo on Regex101
Firstly, you're mixing actual newlines and '\n' characters in your example, I assume that you only meant either. Secondly, let me challenge your assumption that you need regex for this:
inp = '''dfsdf
sdadf
blablabla
blaBAD
bla
dsfsdf
sdfdf'''
replaced = '\n\n'.join(['GOOD' if 'BAD' in k else k for k in inp.split('\n\n')])
The result is
print(replaced)
'dfsdf\nsdadf\n\nGOOD\n\ndsfsdf\nsdfdf'