Remove all line breaks from a long string of text - python

Basically, I'm asking the user to input a string of text into the console, but the string is very long and includes many line breaks. How would I take the user's string and delete all line breaks to make it a single line of text. My method for acquiring the string is very simple.
string = raw_input("Please enter string: ")
Is there a different way I should be grabbing the string from the user? I'm running Python 2.7.4 on a Mac.
P.S. Clearly I'm a noob, so even if a solution isn't the most efficient, the one that uses the most simple syntax would be appreciated.

How do you enter line breaks with raw_input? But, once you have a string with some characters in it you want to get rid of, just replace them.
>>> mystr = raw_input('please enter string: ')
please enter string: hello world, how do i enter line breaks?
>>> # pressing enter didn't work...
...
>>> mystr
'hello world, how do i enter line breaks?'
>>> mystr.replace(' ', '')
'helloworld,howdoienterlinebreaks?'
>>>
In the example above, I replaced all spaces. The string '\n' represents newlines. And \r represents carriage returns (if you're on windows, you might be getting these and a second replace will handle them for you!).
basically:
# you probably want to use a space ' ' to replace `\n`
mystring = mystring.replace('\n', ' ').replace('\r', '')
Note also, that it is a bad idea to call your variable string, as this shadows the module string. Another name I'd avoid but would love to use sometimes: file. For the same reason.

You can try using string replace:
string = string.replace('\r', '').replace('\n', '')

You can split the string with no separator arg, which will treat consecutive whitespace as a single separator (including newlines and tabs). Then join using a space:
In : " ".join("\n\nsome text \r\n with multiple whitespace".split())
Out: 'some text with multiple whitespace'
https://docs.python.org/2/library/stdtypes.html#str.split

The canonic answer, in Python, would be :
s = ''.join(s.splitlines())
It splits the string into lines (letting Python doing it according to its own best practices). Then you merge it. Two possibilities here:
replace the newline by a whitespace (' '.join())
or without a whitespace (''.join())

updated based on Xbello comment:
string = my_string.rstrip('\r\n')
read more here

Another option is regex:
>>> import re
>>> re.sub("\n|\r", "", "Foo\n\rbar\n\rbaz\n\r")
'Foobarbaz'

If anybody decides to use replace, you should try r'\n' instead '\n'
mystring = mystring.replace(r'\n', ' ').replace(r'\r', '')

A method taking into consideration
additional white characters at the beginning/end of string
additional white characters at the beginning/end of every line
various end-line characters
it takes such a multi-line string which may be messy e.g.
test_str = '\nhej ho \n aaa\r\n a\n '
and produces nice one-line string
>>> ' '.join([line.strip() for line in test_str.strip().splitlines()])
'hej ho aaa a'
UPDATE:
To fix multiple new-line character producing redundant spaces:
' '.join([line.strip() for line in test_str.strip().splitlines() if line.strip()])
This works for the following too
test_str = '\nhej ho \n aaa\r\n\n\n\n\n a\n '

Regular expressions is the fastest way to do this
s='''some kind of
string with a bunch\r of
extra spaces in it'''
re.sub(r'\s(?=\s)','',re.sub(r'\s',' ',s))
result:
'some kind of string with a bunch of extra spaces in it'

The problem with rstrip() is that it does not work in all cases (as I myself have seen few). Instead you can use
text = text.replace("\n"," ")
This will remove all new line '\n' with a space.

You really don't need to remove ALL the signs: lf cr crlf.
# Pythonic:
r'\n', r'\r', r'\r\n'
Some texts must have breaks, but you probably need to join broken lines to keep particular sentences together.
Therefore it is natural that line breaking happens after priod, semicolon, colon, but not after comma.
My code considers above conditions. Works well with texts copied from pdfs.
Enjoy!:
def unbreak_pdf_text(raw_text):
""" the newline careful sign removal tool
Args:
raw_text (str): string containing unwanted newline signs: \\n or \\r or \\r\\n
e.g. imported from OCR or copied from a pdf document.
Returns:
_type_: _description_
"""
pat = re.compile((r"[, \w]\n|[, \w]\r|[, \w]\r\n"))
breaks = re.finditer(pat, raw_text)
processed_text = raw_text
raw_text = None
for i in breaks:
processed_text = processed_text.replace(i.group(), i.group()[0]+" ")
return processed_text

Related

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

Replacing a \t with something in a list

>>>user_sentence = "hello \t how are you?"
>>>import re
>>>user_sentenceSplit = re.findall(r"([\s]|[\w']+|[.,!?;])",user_sentence)
>>>print user_sentenceSplit
I get ['hello', '\t', 'how', 'are', 'you', '?']
I don't know how to create any code that will replace the '\t' with 'tab'.
I do not believe that replacing \t in the original string will ever work, you have two issues:
Your code also outputs spaces as tokens, but you do not want to have them
The \t in between letters will become a part of a word token.
So, you need to replace [\s] with [^\S ] pattern that matches any whitespace but a regular space (add more excluded whitespace symbols if necessary into the negated character class) and you need to iterate through all the tokens and check if a token is equal to a tab, and then replace it with tab value. So, the best is to use re.finditer and push the found values into a list variable, see sample code below:
import re
user_sentence = "hello \t how are you?"
user_sentenceSplit = []
for x in re.finditer(r"[^\S ]|[\w']+|[.,!?;]",user_sentence):
if x.group() == "\t": # if it is a tab, replace the value
user_sentenceSplit.append("tab")
else: # else, push the match value
user_sentenceSplit.append(x.group())
print(user_sentenceSplit)
See the Python demo
I think str.replace would do the job.
user_sentence.replace('\t', 'tab')
Do this before splitting the string.
It is behavior of Python's compiler. You should not be worrying about it. Pyhton's Compiler store tab as \t. You need not to do anything on it as it will treat it as tab while performing any action over it. For example:
>>> my_string = 'Yes Hello So?' # <- String with tab
>>> my_string
'Yes\tHello\tSo?' # <- Stored tab as '\t'
>>> print my_string
Yes Hello So? # While printing, again tab
However you exact requirement is not clear to me. In case you want to replace the value of \t with tab string, you may do:
>>> my_string = my_string.replace('\t', 'tab')
>>> my_string
'YestabHellotabSo?'
where my_string is holding the value I mentioned in previous example.

Remove all newlines from inside a string

I'm trying to remove all newline characters from a string. I've read up on how to do it, but it seems that I for some reason am unable to do so. Here is step by step what I am doing:
string1 = "Hello \n World"
string2 = string1.strip('\n')
print string2
And I'm still seeing the newline character in the output. I've tried with rstrip as well, but I'm still seeing the newline. Could anyone shed some light on why I'm doing this wrong? Thanks.
strip only removes characters from the beginning and end of a string. You want to use replace:
str2 = str.replace("\n", "")
re.sub('\s{2,}', ' ', str) # To remove more than one space
As mentioned by #john, the most robust answer is:
string = "a\nb\rv"
new_string = " ".join(string.splitlines())
Answering late since I recently had the same question when reading text from file; tried several options such as:
with open('verdict.txt') as f:
First option below produces a list called alist, with '\n' stripped, then joins back into full text (optional if you wish to have only one text):
alist = f.read().splitlines()
jalist = " ".join(alist)
Second option below is much easier and simple produces string of text called atext replacing '\n' with space;
atext = f.read().replace('\n',' ')
It works; I have done it. This is clean, easier, and efficient.
strip() returns the string after removing leading and trailing whitespace. see doc
In your case, you may want to try replace():
string2 = string1.replace('\n', '')
or you can try this:
string1 = 'Hello \n World'
tmp = string1.split()
string2 = ' '.join(tmp)
This should work in many cases -
text = ' '.join([line.strip() for line in text.strip().splitlines() if line.strip()])
text = re.sub('[\r\n]+', ' ', text)
strip() returns the string with leading and trailing whitespaces(by default) removed.
So it would turn " Hello World " to "Hello World", but it won't remove the \n character as it is present in between the string.
Try replace().
str = "Hello \n World"
str2 = str.replace('\n', '')
print str2
If the file includes a line break in the middle of the text neither strip() nor rstrip() will not solve the problem,
strip family are used to trim from the began and the end of the string
replace() is the way to solve your problem
>>> my_name = "Landon\nWO"
>>> print my_name
Landon
WO
>>> my_name = my_name.replace('\n','')
>>> print my_name
LandonWO

Removing the single quotes after using re.sub() in python

After replacing all word characters in a string with the character '^', using re.sub("\w", "^" , stringorphrase) I'm left with :
>>> '^^^ ^^ ^^^^'
Is there any way to remove the single quotes so it looks cleaner?
>>> ^^^ ^^ ^^^^
Are you sure it's just not how it's displayed in the interactive prompt or something (and there aren't actually apost's in your string)?
If the ' is actually part of the string, and is first/last then either:
string = string.strip("'")
or:
string = string[1:-1] # lop ending characters off
Use the print statement. The quotes aren't actually part of the string.
To remove all occurrences of single quotes:
mystr = some_string_with_single_quotes
answer = mystr.replace("'", '')
To remove single quotes ONLY at the ends of the string:
mystr = some_string_with_single_quotes
answer = mystr.strip("'")
Hope this helps

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

Categories