Split a Unicode string only at universal newlines (\n, \r, \r\n) - python

In Python 3, the str.splitlines method splits at many line boundaries, including the "universal newlines" "\r", "\n", and "\r\n", as well as others.
Suppose I have a Unicode string and I want to split it into lines, only recognizing universal newlines "\r", "\n", and "\r\n". Example:
my_text = 'Line 1\f\rLine 2\r\nLine 3\f...\nLine 4\n'
# Desired output:
lines = split_only_universal_newlines(my_text)
print(lines)
# ['Line 1\x0c\r', 'Line 2\r\n', 'Line 3\x0c...\n', 'Line 4\n']
# Note that the form feed character \f is printed as '\x0c'.
# Incorrect output produced by str.splitlines:
lines = my_text.splitlines(keepends=True)
print(lines)
# ['Line 1\x0c', '\r', 'Line 2\r\n', 'Line 3\x0c', '...\n', 'Line 4\n']
The reason I need to only recognize universal newlines is for consistency with other code/tools that follow that convention.
What is the cleanest or most Pythonic way of doing this?

Besides regular expressions, there are two approaches that I can think of. The first is to employ bytes.splitlines, which according to the doc splits only universal newlines.
A solution based on this idea is as follows.
lines = [l.decode() for l in my_text.encode().splitlines(keepends=True)]
Another approach is to use the Text IO classes:
import io
lines = list(io.StringIO(my_text, newline=''))
Here, the newline keyword works as follows according to the io.StringIO docs:
The newline argument works like that of TextIOWrapper.
and the io.TextIOWrapper docs:
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
The latter approach looks better to me because it does not need to create another copy of the input string (like my_text.encode() does). Meanwhile, if you want to iterate over each line in the input you can just write:
for line in io.StringIO(my_text, newline=''):
...

Use io.StringIO(my_text, newline='').readlines(). The newline='' means (only) universal newlines are treated as line separators, and furthermore line endings are returned to the caller unchanged.
import io
lines = io.StringIO(my_text, newline='').readlines()
print(lines)
# ['Line 1\x0c\r', 'Line 2\r\n', 'Line 3\x0c...\n', 'Line 4\n']
Python documentation:
io.StringIO
readlines()
Behavior of newline=''

Related

Removing '\n' character from print statement

I have the following code when writing to a text file:
def writequiz(quizname,grade,perscore,score,username):
details=[quizname,username,grade,perscore,score]
with open('quizdb','a') as userquiz:
print(details,file=userquiz)
Now the code is doing what I want it to (writing to a new line every time), however if I wanted to write every list to the same line in the text file how would I do this using the print method as used above? I know I could use file.write, but how do I remove the newline character in the print statement? Slightly hypothetical but it was bugging me.
If you are using python 2.x, you can do the following:
print >> userquiz, details, # <- notice the comma at the end
If using pytnon 3.x, you can do this:
print(details,file=userquiz, end = " ")
Check print documentation.
You can set the end parameter of print to be an empty string (or some other character):
print(details, file=userquiz, end='')
From the docs, you can see that it defaults to a newline:
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
Print objects to the text stream file, separated by sep and followed by end. sep, end, file and flush, if present, must be given
as keyword arguments.
This is what is being printed to the file currently.

Most efficient way to delete needless newlines in Python

I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.

Comments in continuation lines

Say I have a multiline command:
if 2>1 \
and 3>2:
print True
In an if block, I can add a comment next to one of the conditions by using parentheses to wrap the lines:
if (2>1 #my comment
and 3>2):
print True
And, in fact, it is aligned with the recommened way of doing this by PEP 8 guideline:
The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.
However, sometimes you need to use continuations. For example, long, multiple with-statements cannot use implicit continuation. Then, how can I add a comment next to a specific line? This does not work:
with open('a') as f1, #my comment\
open('b') as f2:
print True
More generally, is there a generic way to add a comment next to a specific continuation line?
You cannot. Find some extracts from Python reference manual (3.4):
A comment starts with a hash character (#) that is not part of a
string literal, and ends at the end of the physical line.
A line ending in a backslash cannot carry a comment
A comment signifies the end of the logical line unless the implicit
line joining rules are invoked
Implicit line joining : Expressions in parentheses, square brackets or
curly braces can be split over more than one physical line without
using backslashes
Implicitly continued lines can carry comments
So the reference manual explicitly disallows to add a comment in an explicit continuation line.
You can't have comments and backslash for line continuation on the same line. You need to use some other strategy.
The most basic would be to adjust the comment text to place it e.g. before the relevant section. You could also document your intentions without comments at all by refactoring the code returning the context into a function or method with a descriptive name.
I don't see any solution except nesting the with:
with open('a.txt', 'w') as f1: #comment1
with open('b.txt', 'w') as f2: #comment2
print True
You cannot combine end-of-line comment (#) and line continuation (\) on the same line.
I am not recommending this. -- However, sometimes you can masquerade your comment as a string:
with open(('a', '# COMMENT THIS')[0]) as f1, \
open(('b', '# COMMENT THAT')[0]) as f2:
print(f1, f2)

python convert multiline to single line

I want to convert Python multiline string to a single line. If I open the string in a Vim , I can see ^M at the start of each line. How do I process the string to make it all in a single line with tab separation between each line. Example in Vim it looks like:
Serialnumber
^MName Rick
^MAddress 902, A.street, Elsewhere
I would like it to be something like:
Serialnumber \t Name \t Rick \t Address \t 902, A.street,......
where each string is in one line. I tried
somestring.replace(r'\r','\t')
But it doesn't work. Also, once the string is in a single line if I wanted a newline(UNIX newline?) at the end of the string how would I do that?
Deleted my previous answer because I realized it was wrong and I needed to test this solution.
Assuming that you are reading this from the file, you can do the following:
f = open('test.txt', 'r')
lines = f.readlines()
mystr = '\t'.join([line.strip() for line in lines])
As ep0 said, the ^M represents '\r', which the carriage return character in Windows. It is surprising that you would have ^M at the beginning of each line since the windows new-line character is \r\n. Having ^M at the beginning of the line indicates that your file contains \n\r instead.
Regardless, the code above makes use of a list comprehension to loop over each of the lines read from test.txt. For each line in lines, we call str.strip() to remove any whitespace and non-printing characters from the ENDS of each line. Finally, we call '\t'.join() on the resulting list to insert tabs.
You can replace "\r" characters by "\t".
my_string.replace("\r", "\t")
I use splitlines() to detect all types of lines, and then join everything together. This way you don't have to guess to replace \r or \n etc.
"".join(somestring.splitlines())
it is hard coding. But it works.
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
lst=list(poem)
str=''
for i in lst:
str+=i
print(str)
lst1=str.split("\n")
str1=""
for i in lst1:
str1+=i+" "
str2=str1[:-2]
print(str2)
This occurs of how VIM interprets CR (carriage return), used by Windows to delimit new lines. You should use just one editor (I personally prefer VIM). Read this: VIM ^M
This trick also can be useful, write "\n" as a raw string. Like :
my_string = my_string.replace(r"\n", "\t")
this should do the work:
def flatten(multiline):
lst = multiline.split('\n')
flat = ''
for line in lst:
flat += line.replace(' ', '')+' '
return flat
This should do the job:
string = """Name Rick
Address 902, A.street, Elsewhere"""
single_line = string.replace("\n", "\t")

Python - Ignore FIRST character (tab) every line when reading

This is a continuation of my former questions (check them if you are curious).
I can already see the light at the end of the tunnel, but I've got a last problem.
For some reason, every line starts with a TAB character.
How can I ignore that first character ("tab" (\t) in my case)?
filename = "terem.txt"
OraRend = collections.namedtuple('OraRend', 'Nap, OraKezdese, OraBefejezese, Azonosito, Terem, OraNeve, Emelet')
csv.list_dialects()
for line in csv.reader(open(filename, "rb"), delimiter='\t', lineterminator='\t\t', doublequote=False, skipinitialspace=True):
print line
orar = OraRend._make(line) # Here comes the trouble!
The text file:
http://pastebin.com/UYg4P4J1
(Can't really paste it here with all the tabs.)
I have found lstrip, strip and other methods, all of them would eat all the chars, so the filling of the tuple would fail.
You could do line = line[1:] to just strip the first character. But if you do this, you should add an assertion that the first character is indeed a tab, to avoid mangling data without leading tab.
There is an easier alternative that also handles several other cases and doesn't break things if the things to be removed aren't there. You can strip all leading and trailing whitespace with line = line.strip(). Alternatively, use .lstrip() to strip only leading whitespace, and add '\t' as argument to either method call if you want to leave other whitespace in place and just remove tabs.
To remove the first character from a string:
>>> s = "Hello"
>>> s
'Hello'
>>> s[1:]
'ello'
From the docs:
str.lstrip([chars])
Return a copy of the string with leading characters removed. The chars
argument is a string specifying the
set of characters to be removed. If
omitted or None, the chars argument
defaults to removing whitespace. The
chars argument is not a prefix;
rather, all combinations of its values
are stripped
If you want to only remove the tab at the beginning of a line, use
str.lstrip("\t")
This has the benefit that you don't have to check to make sure the first character is, in fact, a tab. However, if there are cases when there are more than one tab, and you want to keep the second tab and on, you're going to have to use str[1:].
Consider this. You don't need to pass a "file" to csv.reader. A file-line object that is a sequence of string values works nicely.
filename = "terem.txt"
OraRend = collections.namedtuple('OraRend', 'Nap, OraKezdese, OraBefejezese, Azonosito, Terem, OraNeve, Emelet')
with open(filename, "rb") as source:
cleaned = ( line.lstrip() for line in source )
rdr= csv.reader( cleaned, delimiter='\t', lineterminator='\t\t', doublequote=False, skipinitialspace=True)
for line in rdr
print line
orar = OraRend._make(line) # Here comes the trouble!

Categories