How can I detect DOS line breaks in a file? - python

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?

Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program

You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this

(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)

As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).

dos linebreaks are \r\n, unix only \n. So just search for \r\n.

Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'

You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

Related

Using Mac Automator to run Python Find & Replace. Multiline (\n\r) not working

I'm using Mac Automator to run a 'Find & Replace' python script. (Below is a simplified version.) It's working really well, with the exception of line breaks...
I've tried multiple variants of \r \n \r\n but it's not removing the line breaks.
replacements = {
'class="spec-container"':'class="row"',
'</span>\n\n':'</span>',
'</div>\n\n\n':'</div>\n'
}
with open('/../TEMP/INPUT.txt') as infile,
open('/../TEMP/OUTPUT.txt', 'w') as outfile:
for line in infile:
for find, replace in replacements.iteritems():
line = line.replace(find, replace)
outfile.write(line)
Really only just find my feet with Python, so apologies in advance. But any help gratefully received.
You can use the strip() method for removing whitespace around a string in Python. "hello\n".strip() will return "hello". You can also use lstrip() or rstrip() if you only want it to strip the string on one side.

Is it possible to export a list to .txt such that line breakes can be read by notepad?

For accessability reasons I wonder it is possible, in python, to export a list to .txt such that line breaks can be read by notepad? Below is an example code that is read correctly in notepad++ but not in notepad. In notepad++ each entry of the list is on a separate line, in notepad all entries are on the same line.
string =['str1 123','str2 234','str3 345']
outF = open("outp.txt", "w")
for item in string:
outF.write("%s\n" % item)
outF.close()
Windows uses Carriage Return, Line Feed: \r\n to indicate line breaks, which is the only line-ending recognized by Windows notepad:
In [7]: s = ['hello', 'world']
In [8]: with open('test.txt', 'w') as f:
...: for item in s:
...: f.write('%s\r\n' % item)
Example:
Linux based systems use Line Feed to indicate line breaks, and old Mac OS's used to use just Carriage Return, and an editor like Notepad++ can be configured to recognize all of these, while notepad cannot.
I'll flesh out the comment's answer a little bit. Windows only recognizes carriage returns as valid points to make a new line. Therefore, it is best practice to use both carriage return and newline when making a line break in text.
So do:
outF.write("%s\r\n" % item)

When piping a file in windows to a python script, my \r are deleting my characters

I have a file like this:
A\r
B\n
C\r\n.
(By \r I'm referring to CR, and \n is LF)
And this script:
import fileinput
for line in fileinput.input(mode='rU'):
print(line)
When I call python script.py myfile.txt I get the correct output:
A
B
C
But when I call it like this: type myfile.txt|python script.py, I get this:
B
C
You see? No more "A".
What is happening? I thought the mode='rU' would take care of every newline problem...
EDIT: In Python 3 there is no such problem! Only in Python 2. But that does not solve the problem.
Thanks
EDIT:
Just for the sake of completeness.
- It happens also in Linux.
Python 3 handles every newline type (\n, \r or \r\n) transparently to the
user. Doesn't matter which one your file got, you don't have to worry.
Python 2 needs the parameter mode='rU' passed to fileinput.input to allow it
to handle every newline transparently.
The thing is, in Python 2 this does not work correctly when piping content
to it.
Having tried to pipe a file like this:
CR: \r
LF: \n
CRLF: \r\n
Python 2 just treats these two lines as just one line and if you try to print
every line with this code:
for i,line in enumerate(fileinput.input(mode='rU')):
print("Line {}: {}".format(i,line), end='')
It outputs this:
Line 0: CR:
LF:
Line 1: CRLF:
This doesn't happen in Python 3. There, these are 2 different lines.
When passing this text as a file, it works ok though.
Piping data like this:
LF: \n
CR: \r
CRLF: \r\n
Gives me a similar result:
Line 0: LF:
Line 1: CR:
CRLF:
My conclusion is the following:
For some reason, when piping data, Python 2 looks for the first newline
symbol it encounters and then on, it just considers that specific character
as a newline. In this example Python 2 encounters \r as the first newline
character and all the others (\n or \r\n) are just common characters.

python fileinput line endings

I've got some python code which is getting line endings all wrong:
command = 'svn cat -r {} "{}{}"'.format(svn_revision, svn_repo, svn_filename)
content = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read()
written = False
for line in fileinput.input(out_filename, inplace=1):
if line.startswith("INPUT_TAG") and not written:
print content
written = True
print line,
This fetches a copy of the file called svn_filename, and inserts the content into another file called out_filename at the "INPUT_TAG" location in the file.
The problem is the line endings in out_filename.
They're meant to be \r\n but the block I insert is \r\r\n.
Changing the print statement to:
print content, # just removes the newlines after the content block
or
print content.replace('\r\r','\r') # no change
has no effect. The extra carriage returns are inserted after the content leaves my code. It seems like something is deciding that because I'm on windows it should convert all \n to \r\n.
How can I get around this?
I can "solve" this problem by doing the following:
content = content.replace('\r\n', '\n')
converting the newlines to unix style so when the internal magic converts it again it ends up correct.
This can't be the right/best/pythonic way though....
CRLF = Carriage Return Line Feed.
Python on Windows makes a distinction between text and binary files;
the end-of-line characters in text files are automatically altered
slightly when data is read or written.
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Can you output as binary file and not as a text file?
If you prefix the string with r to open the file as raw, does this prevent extra \r in the output?

File reading and regex - Python

I read a file which has a line : Fixes: Saurabh Likes python
I want to remove the Fixes: part of above line. I am employing regex for that
but the snippet below returns output like
Saurabh Likes python\r
I am wondering where \r is coming from. I tried all strip options for removing it like rstrip(), lstrip(), etc. But nothing worked. Could anybody suggest me the way to get rid of \r.
patternFixes ='\s*'+'Fixes'+':'+'\s*'
matchFixes= re.search(patternFixes,line, re.IGNORECASE)
if matchFixes:
patternCompiled = re.compile(patternFixes)
line=patternCompiled.sub("", line)
#line=line.lstrip()
relevantInfo = relevantInfo+line
continue
Thanks in advance!
-Saurabh
Suggestion to get rid of \r:
I suppose you have opened your file using open(filename). Following the manual of open:
If mode is omitted, it defaults to 'r'. ... In addition to the
standard fopen() values mode may be 'U' or 'rU'. Python is usually
built with universal newlines support; supplying 'U' opens the file as
a text file, but lines may be terminated by any of the following: the
Unix end-of-line convention '\n', the Macintosh convention '\r', or
the Windows convention '\r\n'. All of these external representations
are seen as '\n' by the Python program.
So, in short, please try to open your file using 'rU' and see if the \r vanishes:
with open(filename, "rU") as f:
# do your stuff here.
...
Does the \r vanish in your output?
Of course your code looks rather clunky, but other have already commented on this part.
You probably opened the file in binary mode (open(filename, "rb") or something like that). Don't do this if you're working with text files.
Use open(filename) instead. Now Python will automatically normalize all newlines to \n, regardless of the current platform.
Also, why not simply patternFixes = r'\s*Fixes:\s*'? Why all the +es?
Then, you're doing a lot of unnecessary stuff like recompiling a regex over and over.
So, my suggestion (which does the same thing as your code (plus the file handling):
r = re.compile(r'\s*Fixes:\s*')
with open(filename) as infile:
relevantInfo = "".join(r.sub("", line) for line in infile if "Fixes:" in line)
>>> import re
>>> re.sub('Fixes:\s*', '', 'Fixes: Saurabh Likes python')
'Saurabh Likes python'
No '\r'
>>> re.sub('\s*'+'Fixes'+':'+'\s*', '', 'Fixes: Saurabh Likes python')
'Saurabh Likes python'
No '\r' again
can you provide more details on how to reproduce?
EDIt cannot reproduce with your code neither
>>> line = 'Fixes: Saurabh Likes python'
>>> patternFixes ='\s*'+'Fixes'+':'+'\s*'
>>> matchFixes= re.search(patternFixes,line, re.IGNORECASE)
>>> if matchFixes:
... patternCompiled = re.compile(patternFixes)
... line=patternCompiled.sub("", line)
... print line
... line=line.lstrip()
... print line
...
Saurabh Likes python
Saurabh Likes python
>>>
The '\r' is a carriage return -- http://en.wikipedia.org/wiki/Carriage_return, and it's being picked up from your file.
I will note that if all the lines you need to 'fix' actually DO start with "Fixes: " and that's all you want to change, you could just do something like:
line = line[line.find('Fixes: ')+7:-1]
Saves you all the regex stuff. Not sure on performance, though. And this SHOULD kill your '\r's at the same time.

Categories