replace ^M(control M character) in a text file in python - python

The file is like this:
This line has control character ^M this is bad
I will try it
I want to remove control M characters in the file, and create a new file like this using Python
This line has control character this is bad
I will try it
I tried the methods I found in stack overflow and use regular expression like this:
line.replace("\r", "r")
and
line.replace("\r\n", "r")
Here is part of the code snippet:
with open(file_path, "r") as input_file:
lines = input_file.readlines()
new_lines = []
for line in lines:
new_line = line.replace("\r", "")
new_lines.append(new_line)
new_file_name = "replace_control_char.dat"
new_file_path = os.path.join(here, data_dir, new_file_name)
with open(new_file_path, "w") as output_file:
for line in new_lines:
output_file.write(line)
However, the new file I got is:
This line has control character
this is bad
I will try it
"This line has control character" and " this is bad" are not on the same line. I expect remove control M character will make these two phrases on the same line.
Can someone help me solve this issue?
Thanks,
Arthur

You cannot rely on text mode in that case.
On Windows understands sole \r as linefeeds (even if the "official" line terminator is \r\n) and on Macintosh, the line terminator can be only \r. Text mode converts linefeeds as \n or remove them if followed by \n, so it destroys the information you need.
Universal newlines by default makes this code also fail on Unix/Linux. Python behaves the same on all platforms
Python doesn’t depend on the underlying operating system’s notion of text files; all the the processing is done by Python itself, and is therefore platform-independent.
If you want to remove those, you have to use binary mode.
with open(file_path, "rb") as input_file:
contents = input_file.read().replace(b"\r",b"")
with open(file_path, "wb") as output_file:
output_file.write(contents)
That code will remove all \r characters (including line terminators). That works but if your aim is just to remove stray \r and preserve endlines, another method is required.
One way to do it is to use a regular expression, which can accept binary (bytes) as well:
re.sub(rb"\r([^\n])",rb"\1",contents)
That regular expression removes \r chars only if not followed by \n chars, efficiently preserving CR+LF windows end-of-line sequences.

Related

Replacing \n while keeping \r\n intact

I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact.
I've tried line.replace but seems like it is not recognizing \r\n so next I tried regex
with open(filetoread, "r") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
line = re.sub("(?<!\r)\n", " ", line)
fixed.write(line)
but it is not keeping \r\n it is removing everything. I can't do it in Notepad++ it is crashing on this file.
You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open with r mode, and to keep them all in the input, you can read the file in in the binary mode using b. Then, you need to remember to also use the b prefix with the regex pattern and replacement.
You can use
with open(filetoread, "rb") as inf:
with open(filetowrite, "wb") as fixed:
fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))
Now, the whole file will be read into a single string (with inf.read()) and the line breaks will be matched, and eventually replaced.
Pay attention to
"rb" when reading file in
"wb" to write file out
re.sub(b"(?<!\r)\n", b" ", inf.read()) contains b prefixes with string literals, and inf.read() reads in the file contents into single variable.
When you open a file with a naive open() call, it will load a view of the file with a variety of newlines to be simply \n via TextIOWrapper
Explicitly setting newline="\r\n" should allow you to read and write the newlines the way you expect
with open(path_src, newline="\r\n") as fh_src:
with open(path_dest, "w", newline="\r\n") as fh_dest:
for line in fh_src: # file-likes are iterable by-lines
fh_dest.write(line[:-2].replace("\n", " "))
fh_dest.write("\r\n")
content example
>>> with open("test.data", "wb") as fh:
... fh.write(b"""foo\nbar\r\nbaz\r\n""")
...
14
>>> with open("test.data", newline="\r\n") as fh:
... for line in fh:
... print(repr(line))
...
'foo\nbar\r\n'
'baz\r\n'

Specify Newline character ('\n') in reading csv using Python

I want to read a csv file with each line dictated by a newline character ('\n') using Python 3. This is my code:
import csv
with open(input_data.csv, newline ='\n') as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
This above code gave error:
batch_data = [line for line in csvread].
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Reading these posts: CSV new-line character seen in unquoted field error, also tried these alternatives that I could think about:
with open(input_data.csv, 'rU', newline ='\n') as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
with open(input_data.csv, 'rU', newline ="\n") as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
No luck of geting this correct yet. Any suggestions?
I am also reading the documentation about newline: if newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n line on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
So my understanding of this newline method is:
1) it is a necessity,
2) does it indicate the input file would be split into lines by empty space character?
newline='' is correct in all csv cases, and failing to specify it is an error in many cases. The docs recommend it for the very reason you're encountering.
newline='' doesn't mean "empty space" is used for splitting; it's specifically documented on the open function:
If [newline] is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
So with newline='' all original \r and \n characters are returned unchanged. Normally, in universal newlines mode, any newline like sequence (\r, \n, or \r\n) is converted to \n in the input. But you don't want this for CSV input, because CSV dialects are often quite picky about what constitutes a newline (Excel dialect requires \r\n only).
Your code should be:
import csv
with open('input_data.csv', newline='') as f:
csvread = csv.reader(f)
batch_data = list(csvread)
If that doesn't work, you need to look at your CSV dialect and make sure you're initializing csv.reader correctly.
Sir I am understand line terminator or '/n' meaning is .CSV file concept la comma use for separated values in a newline
a notepad or excel method

Python 2.7: len() returns wrong value for line from file with newline character

I am using WinPython 2.7 on Windows 7 64bit.
I want to open a file, read its contents line by line and when encountering a certain sequence, I want to continue operating on the file contents from there on.
To save the current position, I am appending the length of the current line to a list of line lengths. However, len(line) returns a value that is too small by 1. I suspect this is somehow because of Windows' newline character \r\n.
Consider the following code for an example.
testfile.txt:
Line1
Line2
Line3
test.py
fn = 'testfile.txt'
f = open(fn)
line_offsets = []
for line in f:
line_offsets.append(len(line))
f.seek(line_offsets[1])
print '%r' % f.read()
Output:
'\nLine2\nLine3'
Expected Output:
'Line2\nLine3'
I tried opening the file by specifying the read method (with universal newlines): f = open(fn, 'rU')
but this didn't do the trick either. I can get it to work if I open the file in binary mode, but this is in fact a text file, not a binary file, so I would like to avoid that and I also want to understand what is going on here.
Open the file in binary mode, and the '\r' won't be stripped from the line. Then the len will return the proper byte count.
f = open(fn, 'rb')
This will be especially important if you port to Python 3, since non-binary files will decode the bytes into Unicode characters as you read them and the count could be way off.
You could use splitlines() to get the lines off your file. It has a tolerance for the various newline characters as per the documentation.
In order to get the behavior you want here, you can explicitly call f.tell() before reading each line, and use f.readline() to read the line afterwards. You will likely have to work in binary mode as well due to a Windows issue with tell(), and deal with any line ending issues yourself. Using the file as an iterator will not work as it is buffered and may advance the file pointer beyond the line you are reading in the file.
>>> with open('testfile.txt', 'rb') as f:
... while True:
... here = f.tell()
... line = f.readline()
... if not line:
... break
... print('%02d\t%r' % (here, line))
...
00 'Line1\n'
06 'Line2\n'
12 'Line3\n'

Python is adding extra newline to the output

The input file: a.txt
aaaaaaaaaaaa
bbbbbbbbbbb
cccccccccccc
The python code:
with open("a.txt") as f:
for line in f:
print line
The problem:
aaaaaaaaaaaa
bbbbbbbbbbb
cccccccccccc
as you can see the output has extra line between each item.
How to prevent this?
print appends a newline, and the input lines already end with a newline.
A standard solution is to output the input lines verbatim:
import sys
with open("a.txt") as f:
for line in f:
sys.stdout.write(line)
PS: For Python 3 (or Python 2 with the print function), abarnert's print(…, end='') solution is the simplest one.
As the other answers explain, each line has a newline; when you print a bare string, it adds a line at the end. There are two ways around this; everything else is a variation on the same two ideas.
First, you can strip the newlines as you read them:
with open("a.txt") as f:
for line in f:
print line.rstrip()
This will strip any other trailing whitespace, like spaces or tabs, as well as the newline. Usually you don't care about this. If you do, you probably want to use universal newline mode, and strip off the newlines:
with open("a.txt", "rU") as f:
for line in f:
print line.rstrip('\n')
However, if you know the text file will be, say, a Windows-newline file, or a native-to-whichever-platform-I'm-running-on-right-now-newline file, you can strip the appropriate endings explicitly:
with open("a.txt") as f:
for line in f:
print line.rstrip('\r\n')
with open("a.txt") as f:
for line in f:
print line.rstrip(os.linesep)
The other way to do it is to leave the original newline, and just avoid printing an extra one. While you can do this by writing to sys.stdout with sys.stdout.write(line), you can also do it from print itself.
If you just add a comma to the end of the print statement, instead of printing a newline, it adds a "smart space". Exactly what that means is a bit tricky, but the idea is supposed to be that it adds a space when it should, and nothing when it shouldn't. Like most DWIM algorithms, it doesn't always get things right—but in this case, it does:
with open("a.txt") as f:
for line in f:
print line,
Of course we're now assuming that the file's newlines match your terminal's—if you try this with, say, classic Mac files on a Unix terminal, you'll end up with each line printing over the last one. Again, you can get around that by using universal newlines.
Anyway, you can avoid the DWIM magic of smart space by using the print function instead of the print statement. In Python 2.x, you get this by using a __future__ declaration:
from __future__ import print_function
with open("a.txt") as f:
for line in f:
print(line, end='')
Or you can use a third-party wrapper library like six, if you prefer.
What happens is that each line as a newline at the end, and print statement in python also adds a newline. You can strip the newlines:
with open("a.txt") as f:
for line in f:
print line.strip()
You could also try the splitlines() function, it strips automatically:
f = open('a.txt').read()
for l in f.splitlines():
print l
It is not adding a newline, but each scanned line from your file has a trailing one.
Try:
with open ("a.txt") as f:
for line in (x.rstrip ('\n') for x in f):
print line

Convert \r text to \n so readlines() works as intended

In Python, you can read a file and load its lines into a list by using
f = open('file.txt','r')
lines = f.readlines()
Each individual line is delimited by \n but if the contents of a line have \r then it is not treated as a new line. I need to convert all \r to \n and get the correct list lines.
If I do .split('\r') inside the lines I'll get lists inside the list.
I thought about opening a file, replace all \r to \n, closing the file and reading it in again and then use the readlines() but this seems wasteful.
How should I implement this?
f = open('file.txt','rU')
This opens the file with Python's universal newline support and \r is treated as an end-of-line.
If it's a concern, open in binary format and convert with this code:
from __future__ import with_statement
with open(filename, "rb") as f:
s = f.read().replace('\r\n', '\n').replace('\r', '\n')
lines = s.split('\n')

Categories