I get a string line:
>>> line = " abc\n def\n\n ghi\n jkl"
>>> print line
abc
def
ghi
jkl
and I want to convert it to "abcdef\n\n ghijkl", like:
>>> print " abcdef\n\n ghijkl"
abcdef
ghijkl
I tried python re module, and write something like this:
re.sub('(?P<word1>[^\n\s])\n\s*(?P<word2>[^\n\s])', '\g<word1>\g<word2>', line)
but I get this:
>>> re.sub('(?P<word1>[^\n\s])\n\s*(?P<word2>[^\n\s])', '\g<word1>\g<word2>', line)
Out: ' abcdefghijkl'
It seems to me that the \n\s* part is also matching \n\n. Can any one point out where I get it wrong?
\s matches space, \t, \n (and, depending on your regex engine) a few other whitespace characters.
So if you only want to replace single linebreaks + spaces/tabs, you can use this:
newline = re.sub(r"(?<!\n)\n[ \t]*(?!\n)", "", line)
Explanation:
(?<!\n) # Assert that the previous character isn't a newline
\n # Match a newline
[ \t]* # Match any number of spaces/tabs
(?!\n) # Assert that the next character isn't a newline
In Python:
>>> line = " abc\n def\n\n ghi\n jkl"
>>> newline = re.sub(r"(?<!\n)\n[ \t]*(?!\n)", "", line)
>>> print newline
abcdef
ghijkl
Try this,
line = " abc\n def\n\n ghi\n jkl"
print re.sub(r'\n(?!\n)\s*', '', line)
It gives,
abcdef
ghijkl
It says, "Replace a new line, followed by a space that is NOT a new line with nothing."
UPDATE: Here's a better version
>>> re.sub(r'([^\n])\n(?!\n)\s*', r'\1', line)
' abcdef\n\n ghijkl'
It gives exactly what you said in the first post.
You could simplify the regexp if you used \S, which matches any non-whitespace character:
>>> import re
>>> line = " abc\n def\n\n ghi\n jkl"
>>> print re.sub(r'(\S+)\n\s*(\S+)', r'\1\2', line)
abcdef
ghijkl
However, the reason why your own regexp is not working is because your <word1> and <word2> groups are only matching a single character (i.e. they're not using +). So with that simple correction, your regexp will produce the correct output:
>>> print re.sub(r'(?P<word1>[^\n\s]+)\n\s*(?P<word2>[^\n\s]+)', r'\g<word1>\g<word2>', line)
abcdef
ghijkl
Related
I want to eliminate all the whitespace from a string, on both ends, and in between words.
I have this Python code:
def my_handle(self):
sentence = ' hello apple '
sentence.strip()
But that only eliminates the whitespace on both sides of the string. How do I remove all whitespace?
If you want to remove leading and ending spaces, use str.strip():
>>> " hello apple ".strip()
'hello apple'
If you want to remove all space characters, use str.replace() (NB this only removes the “normal” ASCII space character ' ' U+0020 but not any other whitespace):
>>> " hello apple ".replace(" ", "")
'helloapple'
If you want to remove duplicated spaces, use str.split() followed by str.join():
>>> " ".join(" hello apple ".split())
'hello apple'
To remove only spaces use str.replace:
sentence = sentence.replace(' ', '')
To remove all whitespace characters (space, tab, newline, and so on) you can use split then join:
sentence = ''.join(sentence.split())
or a regular expression:
import re
pattern = re.compile(r'\s+')
sentence = re.sub(pattern, '', sentence)
If you want to only remove whitespace from the beginning and end you can use strip:
sentence = sentence.strip()
You can also use lstrip to remove whitespace only from the beginning of the string, and rstrip to remove whitespace from the end of the string.
An alternative is to use regular expressions and match these strange white-space characters too. Here are some examples:
Remove ALL spaces in a string, even between words:
import re
sentence = re.sub(r"\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the BEGINNING of a string:
import re
sentence = re.sub(r"^\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the END of a string:
import re
sentence = re.sub(r"\s+$", "", sentence, flags=re.UNICODE)
Remove spaces both in the BEGINNING and in the END of a string:
import re
sentence = re.sub("^\s+|\s+$", "", sentence, flags=re.UNICODE)
Remove ONLY DUPLICATE spaces:
import re
sentence = " ".join(re.split("\s+", sentence, flags=re.UNICODE))
(All examples work in both Python 2 and Python 3)
"Whitespace" includes space, tabs, and CRLF. So an elegant and one-liner string function we can use is str.translate:
Python 3
' hello apple '.translate(str.maketrans('', '', ' \n\t\r'))
OR if you want to be thorough:
import string
' hello apple'.translate(str.maketrans('', '', string.whitespace))
Python 2
' hello apple'.translate(None, ' \n\t\r')
OR if you want to be thorough:
import string
' hello apple'.translate(None, string.whitespace)
For removing whitespace from beginning and end, use strip.
>> " foo bar ".strip()
"foo bar"
' hello \n\tapple'.translate({ord(c):None for c in ' \n\t\r'})
MaK already pointed out the "translate" method above. And this variation works with Python 3 (see this Q&A).
In addition, strip has some variations:
Remove spaces in the BEGINNING and END of a string:
sentence= sentence.strip()
Remove spaces in the BEGINNING of a string:
sentence = sentence.lstrip()
Remove spaces in the END of a string:
sentence= sentence.rstrip()
All three string functions strip lstrip, and rstrip can take parameters of the string to strip, with the default being all white space. This can be helpful when you are working with something particular, for example, you could remove only spaces but not newlines:
" 1. Step 1\n".strip(" ")
Or you could remove extra commas when reading in a string list:
"1,2,3,".strip(",")
Be careful:
strip does a rstrip and lstrip (removes leading and trailing spaces, tabs, returns and form feeds, but it does not remove them in the middle of the string).
If you only replace spaces and tabs you can end up with hidden CRLFs that appear to match what you are looking for, but are not the same.
eliminate all the whitespace from a string, on both ends, and in between words.
>>> import re
>>> re.sub("\s+", # one or more repetition of whitespace
'', # replace with empty string (->remove)
''' hello
... apple
... ''')
'helloapple'
https://en.wikipedia.org/wiki/Whitespace_character
Python docs:
https://docs.python.org/library/stdtypes.html#textseq
https://docs.python.org/library/stdtypes.html#str.replace
https://docs.python.org/library/string.html#string.replace
https://docs.python.org/library/re.html#re.sub
https://docs.python.org/library/re.html#regular-expression-syntax
I use split() to ignore all whitespaces and use join() to concatenate
strings.
sentence = ''.join(' hello apple '.split())
print(sentence) #=> 'helloapple'
I prefer this approach because it is only a expression (not a statement).
It is easy to use and it can use without binding to a variable.
print(''.join(' hello apple '.split())) # no need to binding to a variable
import re
sentence = ' hello apple'
re.sub(' ','',sentence) #helloworld (remove all spaces)
re.sub(' ',' ',sentence) #hello world (remove double spaces)
In the following script we import the regular expression module which we use to substitute one space or more with a single space. This ensures that the inner extra spaces are removed. Then we use strip() function to remove leading and trailing spaces.
# Import regular expression module
import re
# Initialize string
a = " foo bar "
# First replace any number of spaces with a single space
a = re.sub(' +', ' ', a)
# Then strip any leading and trailing spaces.
a = a.strip()
# Show results
print(a)
I found that this works the best for me:
test_string = ' test a s test '
string_list = [s.strip() for s in str(test_string).split()]
final_string = ' '.join(string_array)
# final_string: 'test a s test'
It removes any whitespaces, tabs, etc.
try this.. instead of using re i think using split with strip is much better
def my_handle(self):
sentence = ' hello apple '
' '.join(x.strip() for x in sentence.split())
#hello apple
''.join(x.strip() for x in sentence.split())
#helloapple
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
Example String:
str = "test sdf sfwe \n \na dssdf
I want to replace the:
\na
with
a
Where 'a' could be any character.
I tried:
str = "test \n \na"
res = re.sub('[\n.]','a',str)
But how can I store the character behind the \n and use it as replacement?
You may use this regex with a capture group:
>>> s = "test sdf sfwe \n \na dssdf"
>>> >>> print re.sub(r'\n(.)', r'\1', s)
test sdf sfwe a dssdf
Search regex r'\n(.)' will match \n followed by any character and capture following character in group #1
Replacement r'\1' is back-reference to capture group #1 which is placed back in original string.
Better to avoid str as variable name since it is a reserve keyword (function) in python.
If by any character you meant any non-space character then use this regex with use of \S (non-whitespace character) instead of .:
>>> print re.sub(r'\n(\S)', r'\1', s)
test sdf sfwe
a dssdf
Also this lookahead based approach will also work that doesn't need any capture group:
>>> print re.sub(r'\n(?=\S)', '', s)
test sdf sfwe
a dssdf
Note that [\n.] will match any one of \n or literal dot only not \n followed by any character,
Find all the matches:
matches = re.findall( r'\n\w', str )
Replace all of them:
for m in matches :
str = str.replace( m, m[1] )
That's all, folks! =)
I think that the best way for you so you don't have more spaces in your text is the following:
string = "test sdf sfwe \n \na dssdf"
import re
' '.join(re.findall('\w+',string))
'test sdf sfwe a dssdf'
For example, I have strings like this:
string s = "chapter1 in chapters"
How can I replace it with regex to this:
s = "chapter 1 in chapters"
e.g. I need only to insert whitespace between "chapter" and it's number if it exists. re.sub(r'chapter\d+', r'chapter \d+ , s) doesn't work.
You can use lookarounds:
>>> s = "chapter1 in chapters"
>>> print re.sub(r"(?<=\bchapter)(?=\d)", ' ', s)
chapter 1 in chapters
RegEx Breakup:
(?<=\bchapter) # asserts a position where preceding text is chapter
(?=d) # asserts a position where next char is a digit
You can use capture groups, Something like this -
>>> s = "chapter1 in chapters"
>>> re.sub(r'chapter(\d+)',r'chapter \1',s)
'chapter 1 in chapters'
I have a long string which contains various combinations of \n, \r, \t and spaces in-between words and other characters.
I'd like to reduce all multiple spaces to a single space.
I want to reduce all \n, \r, \t combos to a single new-line character.
I want to reduce all \n, \r, \t and space combinations to a single new-line character as well.
I've tried ''.join(str.split()) in various ways to no success.
What is the correct Pythonic way here?
Would the solution be different for Python 3.x?
Ex. string:
ex_str = u'Word \n \t \r \n\n\n word2 word3 \r\r\r\r\nword4\n word5'
Desired output [new new-line = \n]:
new_str = u'Word\nword2 word3\nword4\nword5'
Use a combination str.splitlines() and splitting on all whitespace with str.split():
'\n'.join([' '.join(line.split()) for line in ex_str.splitlines() if line.strip()])
This treats each line separately, removes empty lines, and then collapses all whitespace per line into single spaces.
Provided the input is a Python 3 string, the same solution works across both Python versions.
Demo:
>>> ex_str = u'Word \n \t \r \n\n\n word2 word3 \r\r\r\r\nword4\n word5'
>>> '\n'.join([' '.join(line.split()) for line in ex_str.splitlines() if line.strip(' ')])
u'Word\nword2 word3\nword4\nword5'
To preserve tabs, you'd need to strip and split on just spaces and filter out empty strings:
'\n'.join([' '.join([s for s in line.split(' ') if s]) for line in ex_str.splitlines() if line.strip()])
Demo:
>>> '\n'.join([' '.join([s for s in line.split(' ') if s]) for line in ex_str.splitlines() if line.strip(' ')])
u'Word\n\t\nword2 word3\nword4\nword5'
Use simple regexps:
import re
new_str = re.sub(r'[^\S\n]+', ' ', re.sub(r'\s*[\n\t\r]\s*', '\n', ex_str))
Use a regex:
>>> s
u'Word \n \t \r \n\n\n word2 word3 \r\r\r\r\nword4\t word5'
>>> re.sub(r'[\n\r\t ]{2,}| {2,}', lambda x: '\n' if x.group().strip(' ') else ' ', s)
u'Word\nword2 word3\nword4\nword5'
>>>
Another solution using regex which replaces tabs with a space u'word1\t\tword2', or do you really want to add a line break here too?
import re
new_str = re.sub(r"[\n\ ]{2,}", "\n", re.sub(r"[\t\r\ ]+", " ", ex_str))
'\n'.join(str.split())
Output:
u'Word\nword2\nword3\nword4\nword5'