python fileinput line endings - python

I've got some python code which is getting line endings all wrong:
command = 'svn cat -r {} "{}{}"'.format(svn_revision, svn_repo, svn_filename)
content = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read()
written = False
for line in fileinput.input(out_filename, inplace=1):
if line.startswith("INPUT_TAG") and not written:
print content
written = True
print line,
This fetches a copy of the file called svn_filename, and inserts the content into another file called out_filename at the "INPUT_TAG" location in the file.
The problem is the line endings in out_filename.
They're meant to be \r\n but the block I insert is \r\r\n.
Changing the print statement to:
print content, # just removes the newlines after the content block
or
print content.replace('\r\r','\r') # no change
has no effect. The extra carriage returns are inserted after the content leaves my code. It seems like something is deciding that because I'm on windows it should convert all \n to \r\n.
How can I get around this?

I can "solve" this problem by doing the following:
content = content.replace('\r\n', '\n')
converting the newlines to unix style so when the internal magic converts it again it ends up correct.
This can't be the right/best/pythonic way though....

CRLF = Carriage Return Line Feed.
Python on Windows makes a distinction between text and binary files;
the end-of-line characters in text files are automatically altered
slightly when data is read or written.
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Can you output as binary file and not as a text file?
If you prefix the string with r to open the file as raw, does this prevent extra \r in the output?

Related

Why does '\x01\x1A' (Start-of-Header and Substitute control characters) in a textfile line stop a for-loop prematurely?

I'm using Python 2.7.15, Windows 7
Context
I wrote a script to read and tokenize each line of a FileZilla log file (specifications here) for the IP address of the host that initiated the connection to the FileZilla server. I'm having trouble parsing the log text field that follows the > character. The script I wrote uses the:
with open('fz.log','r') as rh:
for lineno, line in rh:
pass
construct to read each line. That for-loop stopped prematurely when it encountered a log text field that contained the SOH and SUB characters. I can't show you the log file since it contains sensitive information but the crux of the problem can be reproduced by reading a textfile that contains those characters on a line.
My goal is to extract the IP addresses (which I can do using re.search()) but before that happens, I have to remove those control characters. I do this by creating a copy of the log file where the lines containing those control characters are removed. There's probably a better way, but I'm more curious why the for-loop just stops after encountering the control characters.
Reproducing the Issue
I reproduced the problem with this code:
if __name__ == '__main__':
fn = 'writetest.txt'
fn2 = 'writetest_NoControlChars.txt'
# Create the problematic textfile
with open(fn, 'w') as wh:
wh.write("This line comes first!\n");
wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line
wh.write("This comes after!")
# Try to read the file above, removing the SOH/SUB characters if encountered
with open(fn, 'r') as rh:
with open(fn2, 'w') as wh:
for lineno, line in enumerate(rh):
sline = line.translate(None,'\x01\x1A')
wh.write(sline)
print "Line #{}: {}".format(lineno, sline)
print "Program executed."
Output
The code above creates 2 output files and produces the following in a console window:
Line #0: This line comes first!
Line #1: Blah
Program executed.
I step-debugged through the code in Eclipse and immediately after executing the
for lineno, line in enumerate(rh):
statement, rh, the handle for that opened file was closed. I had expected it to move onto the third line, printing out This comes after! to console and writing it out to writetest_NoControlChars.txt but neither events happened. Instead, execution jumped to print "Program executed".
Picture of Local Variable values in Debug Console
You have to open this file in binary mode if you know it contains non-text data: open(fn, 'rb')

Is it possible to export a list to .txt such that line breakes can be read by notepad?

For accessability reasons I wonder it is possible, in python, to export a list to .txt such that line breaks can be read by notepad? Below is an example code that is read correctly in notepad++ but not in notepad. In notepad++ each entry of the list is on a separate line, in notepad all entries are on the same line.
string =['str1 123','str2 234','str3 345']
outF = open("outp.txt", "w")
for item in string:
outF.write("%s\n" % item)
outF.close()
Windows uses Carriage Return, Line Feed: \r\n to indicate line breaks, which is the only line-ending recognized by Windows notepad:
In [7]: s = ['hello', 'world']
In [8]: with open('test.txt', 'w') as f:
...: for item in s:
...: f.write('%s\r\n' % item)
Example:
Linux based systems use Line Feed to indicate line breaks, and old Mac OS's used to use just Carriage Return, and an editor like Notepad++ can be configured to recognize all of these, while notepad cannot.
I'll flesh out the comment's answer a little bit. Windows only recognizes carriage returns as valid points to make a new line. Therefore, it is best practice to use both carriage return and newline when making a line break in text.
So do:
outF.write("%s\r\n" % item)

Write new line at the end of a file

I am working with a numpy array in python. I want to print the array and its properties to a txt output. I want the text output to end with a blank line. How can I do this?
I have tried:
# Create a text document of the output
with open("demo_numpy.txt","w") as text:
text.write('\n'.join(map(str, [a,shape,size,itemsize,ndim,dtype])) + '\n')
And also:
# Create a text document of the output
with open("demo_numpy.txt","w") as text:
text.write('\n'.join(map(str, [a,shape,size,itemsize,ndim,dtype])))
text.write('\n')
However, when I open the file in GitHub desktop, I still get the indication that the last line of the file is "dtype"
When you do "\n".join( ... ) you will get a string of the following form:
abc\ndef\nghi\nhjk
-- in other words, it won't end with \n.
If your code writes another \n then your string will be of the form
abc\ndef\nghi\nhjk\n
But that does not put a blank line at the end of your file because textfiles are supposed to have lines that end in \n. That is what the Posix standard says.
So you need another \n so that the last two lines of your file are
hjk\n
\n
Python will not choke if you ask it to read a textfile where the final trailing \n is missing. But it also won't treat a single trailing \n in a textfile as a blank line. It would not surprise me to learn that GitHub does likewise.
This was solved using the Python 3.x print function, which automatically inserts a new line at the end of each print statement.
Here is the code:
with open("demo_numpy.txt","w") as text:
print(a, file = text)
text.close()
Note- apparently it is more appropriate to use the print function rather than .write when dealing with string values as opposed to binary files.

Python not splitting CRLF correctly

I'm writing a script to convert very simple function documentation to XML in python. The format I'm using would convert:
date_time_of(date) Returns the time part of the indicated date-time value, setting the date part to 0.
to:
<item name="date_time_of">
<arg>(date)</arg>
<help> Returns the time part of the indicated date-time value, setting the date part to 0.</help>
</item>
So far it works great (the XML I posted above was generated from the program) but the problem is that it should be working with several lines of documentation pasted, but it only works for the first line pasted into the application. I checked the pasted documentation in Notepad++ and the lines did indeed have CRLF at the end, so what is my problem?
Here is my code:
mainText = input("Enter your text to convert:\r\n")
try:
for line in mainText.split('\r\n'):
name = line.split("(")[0]
arg = line.split("(")[1]
arg = arg.split(")")[0]
hlp = line.split(")",1)[1]
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
print("Error!")
Any idea of what the issue is here?
Thanks.
input() only reads one line.
Try this. Enter a blank line to stop collecting lines.
lines = []
while True:
line = input('line: ')
if line:
lines.append(line)
else:
break
print(lines)
The best way to handle reading lines from standard input (the console) is to iterate over the sys.stdin object. Rewritten to do this, your code would look something like this:
from sys import stdin
try:
for line in stdin:
name = line.split("(")[0]
arg = line.split("(")[1]
arg = arg.split(")")[0]
hlp = line.split(")",1)[1]
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
except:
print("Error!")
That said, It's worth noting that your parsing code could be significantly simplified with a little help from regular expressions. Here's an example:
import re, sys
for line in sys.stdin:
result = re.match(r"(.*?)\((.*?)\)(.*)", line)
if result:
name = result.group(1)
arg = result.group(2).split(",")
hlp = result.group(3)
print('<item name="%s">\r\n<arg>(%s)</arg>\r\n<help>%s</help>\r\n</item>\r\n' % (name,arg,hlp))
else:
print "There was an error parsing this line: '%s'" % line
I hope this helps you simplify your code.
Patrick Moriarty,
It seems to me that you didn't particularly mention the console and that your main concern is to pass several lines together at one time to be treated. There's only one manner in which I could reproduce your problem: it is, executing the program in IDLE, to copy manually several lines from a file and pasting them to raw_input()
Trying to understand your problem led me to the following facts:
when data is copied from a file and pasted to raw_input() , the newlines \r\n are transformed into \n , so the string returned by raw_input() has no more \r\n . Hence no split('\r\n') is possible on this string
pasting in a Notepad++ window a data containing isolated \r and \n characters, and activating display of the special characters, it appears CR LF symbols at all the extremities of the lines, even at the places where there are \r and \n alone. Hence, using Notepad++ to verify the nature of the newlines leads to erroneous conclusion
.
The first fact is the cause of your problem. I ignore the prior reason of this transformation affecting data copied from a file and passed to raw_input() , that's why I posted a question on stackoverflow:
Strange vanishing of CR in strings coming from a copy of a file's content passed to raw_input()
The second fact is responsible of your confusion and despair. Not a chance....
.
So, what to do to solve your problem ?
Here's a code that reproduce this problem. Note the modified algorithm in it, replacing your repeated splits applied to each line.
ch = "date_time_of(date) Returns the time part.\r\n"+\
"divmod(a, b) Returns quotient and remainder.\r\n"+\
"enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
"A\rB\nC"
with open('funcdoc.txt','wb') as f:
f.write(ch)
print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)
print "open 'funcdoc.txt' to manually copy its content, and paste it on the following line"
mainText = raw_input("Enter your text to convert:\n")
print "OK, copy-paste of file 'funcdoc.txt' ' s content has been performed"
print "\nrepr(mainText)==",repr(mainText)
try:
for line in mainText.split('\r\n'):
name,_,arghelp = line.partition("(")
arg,_,hlp = arghelp.partition(") ")
print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
print("Error!")
.
Here's the solution mentioned by delnan : « read from the source instead of having a human copy and paste it. »
It works with your split('\r\n') :
ch = "date_time_of(date) Returns the time part.\r\n"+\
"divmod(a, b) Returns quotient and remainder.\r\n"+\
"enumerate(sequence[, start=0]) Returns an enumerate object.\r\n"+\
"A\rB\nC"
with open('funcdoc.txt','wb') as f:
f.write(ch)
print "Having just recorded the following string in a file named 'funcdoc.txt' :\n"+repr(ch)
#####################################
with open('funcdoc.txt','rb') as f:
mainText = f.read()
print "\nfile 'funcdoc.txt' has just been opened and its content copied and put to mainText"
print "\nrepr(mainText)==",repr(mainText)
print
try:
for line in mainText.split('\r\n'):
name,_,arghelp = line.partition("(")
arg,_,hlp = arghelp.partition(") ")
print('<item name="%s">\n<arg>(%s)</arg>\n<help>%s</help>\n</item>\n' % (name,arg,hlp))
except:
print("Error!")
.
And finally, here's the solution of Python to process the altered human copy: providing the splitlines() function that treat all kind of newlines (\r or \n or \r\n) as splitters. So replace
for line in mainText.split('\r\n'):
by
for line in mainText.splitlines():

How can I detect DOS line breaks in a file?

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?
Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program
You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this
(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)
As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).
dos linebreaks are \r\n, unix only \n. So just search for \r\n.
Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'
You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

Categories