Python write lines to a UCS-2 LE BOM encoded text file - python

I have an array of text strings in Python 3.7.
Now I want to write them all to a text file. The problem is, that textfile has to be in encoding UCS-2 LE BOM (thats what it says about its encoding in Notepad++), otherwise the file won't work in further processing.
How do I write the text strings to the file in that encoding while the strings staying readable?
with open(textpath, "w", encoding='utf-16-le') as f:
for line in newlines:
f.write(line)
This does not work because it generates gibberish text...

Try writing an explicit BOM:
with open(textpath, "w", encoding='utf-16-le') as f:
f.write('\ufeff')
for line in newlines:
f.write(line)
# Perhaps you also need to add a newline after each line?
f.write('\n')
Obviously revert the last addition if your lines already have newlines.

Related

replace ^M(control M character) in a text file in python

The file is like this:
This line has control character ^M this is bad
I will try it
I want to remove control M characters in the file, and create a new file like this using Python
This line has control character this is bad
I will try it
I tried the methods I found in stack overflow and use regular expression like this:
line.replace("\r", "r")
and
line.replace("\r\n", "r")
Here is part of the code snippet:
with open(file_path, "r") as input_file:
lines = input_file.readlines()
new_lines = []
for line in lines:
new_line = line.replace("\r", "")
new_lines.append(new_line)
new_file_name = "replace_control_char.dat"
new_file_path = os.path.join(here, data_dir, new_file_name)
with open(new_file_path, "w") as output_file:
for line in new_lines:
output_file.write(line)
However, the new file I got is:
This line has control character
this is bad
I will try it
"This line has control character" and " this is bad" are not on the same line. I expect remove control M character will make these two phrases on the same line.
Can someone help me solve this issue?
Thanks,
Arthur
You cannot rely on text mode in that case.
On Windows understands sole \r as linefeeds (even if the "official" line terminator is \r\n) and on Macintosh, the line terminator can be only \r. Text mode converts linefeeds as \n or remove them if followed by \n, so it destroys the information you need.
Universal newlines by default makes this code also fail on Unix/Linux. Python behaves the same on all platforms
Python doesn’t depend on the underlying operating system’s notion of text files; all the the processing is done by Python itself, and is therefore platform-independent.
If you want to remove those, you have to use binary mode.
with open(file_path, "rb") as input_file:
contents = input_file.read().replace(b"\r",b"")
with open(file_path, "wb") as output_file:
output_file.write(contents)
That code will remove all \r characters (including line terminators). That works but if your aim is just to remove stray \r and preserve endlines, another method is required.
One way to do it is to use a regular expression, which can accept binary (bytes) as well:
re.sub(rb"\r([^\n])",rb"\1",contents)
That regular expression removes \r chars only if not followed by \n chars, efficiently preserving CR+LF windows end-of-line sequences.

Replace an arrow character, repeating headers and blank lines in text file and paste the data cleanly in Excel sheet

My attempt to remove arrow character, blank lines and headers from this text file is as below -
I am trying to ignore arrow character and blank lines and write in the new file MICnew.txt but my code doesn't do it. Nothing changes in the new file.
Please help, Thanks so much
I have attached sample file as well.
import re
with open('MIC.txt') as oldfile, open('MICnew.txt', 'w') as newfile:
for line in oldfile:
newfile.write(re.sub(r'[^\x00-\x7f]',r' ',line))
with open('MICnew.txt','r+') as file:
for line in file:
if not line.isspace():
file.write(line)
You can't read from and write to the same file simultaneously. When you open a file with mode r+, the I/O pointer is initially at the beginning but reading will push it to the end (as explained in this answer). So in your case, you read the first line of the file, which moves the pointer to the end of the file. Then you write out that line (unless it's all whitespace) but crucially, the pointer stays at the end. That means on the next iteration of the loop you will have reached the end of the file and your program stops.
To avoid this, read in all the contents of the file first, then loop over them and write out what you want:
file_data = Path('MICnew.txt').read_text()
with open('MICnew.txt', 'w') as out_handle: # THIS WILL OVERWRITE THE FILE!
for line in file_data.splitlines():
if not line.isspace():
file.write(line)
But that double loop is a bit clumsy and you can instead combine the two steps into one:
with open('MIC.txt', errors='ignore') as oldfile,
open('MICnew.txt', 'w') as newfile:
for line in oldfile:
clean_line = re.sub(r'[^\x00-\x7f]', ' ', line.strip('\x0c'))
if not clean_line.isspace():
newfile.write(clean_line)
In order to remove non-Unicode characters, the file is opened with errors='ignore' which will omit the improperly encoded characters. Since the sample file contains a number of rogue form feed characters throughout, it explicitly removes them (ASCII code 12 or \x0c in hex).

Replacing \n while keeping \r\n intact

I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact.
I've tried line.replace but seems like it is not recognizing \r\n so next I tried regex
with open(filetoread, "r") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
line = re.sub("(?<!\r)\n", " ", line)
fixed.write(line)
but it is not keeping \r\n it is removing everything. I can't do it in Notepad++ it is crashing on this file.
You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open with r mode, and to keep them all in the input, you can read the file in in the binary mode using b. Then, you need to remember to also use the b prefix with the regex pattern and replacement.
You can use
with open(filetoread, "rb") as inf:
with open(filetowrite, "wb") as fixed:
fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))
Now, the whole file will be read into a single string (with inf.read()) and the line breaks will be matched, and eventually replaced.
Pay attention to
"rb" when reading file in
"wb" to write file out
re.sub(b"(?<!\r)\n", b" ", inf.read()) contains b prefixes with string literals, and inf.read() reads in the file contents into single variable.
When you open a file with a naive open() call, it will load a view of the file with a variety of newlines to be simply \n via TextIOWrapper
Explicitly setting newline="\r\n" should allow you to read and write the newlines the way you expect
with open(path_src, newline="\r\n") as fh_src:
with open(path_dest, "w", newline="\r\n") as fh_dest:
for line in fh_src: # file-likes are iterable by-lines
fh_dest.write(line[:-2].replace("\n", " "))
fh_dest.write("\r\n")
content example
>>> with open("test.data", "wb") as fh:
... fh.write(b"""foo\nbar\r\nbaz\r\n""")
...
14
>>> with open("test.data", newline="\r\n") as fh:
... for line in fh:
... print(repr(line))
...
'foo\nbar\r\n'
'baz\r\n'

Specify Newline character ('\n') in reading csv using Python

I want to read a csv file with each line dictated by a newline character ('\n') using Python 3. This is my code:
import csv
with open(input_data.csv, newline ='\n') as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
This above code gave error:
batch_data = [line for line in csvread].
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Reading these posts: CSV new-line character seen in unquoted field error, also tried these alternatives that I could think about:
with open(input_data.csv, 'rU', newline ='\n') as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
with open(input_data.csv, 'rU', newline ="\n") as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
No luck of geting this correct yet. Any suggestions?
I am also reading the documentation about newline: if newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n line on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
So my understanding of this newline method is:
1) it is a necessity,
2) does it indicate the input file would be split into lines by empty space character?
newline='' is correct in all csv cases, and failing to specify it is an error in many cases. The docs recommend it for the very reason you're encountering.
newline='' doesn't mean "empty space" is used for splitting; it's specifically documented on the open function:
If [newline] is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
So with newline='' all original \r and \n characters are returned unchanged. Normally, in universal newlines mode, any newline like sequence (\r, \n, or \r\n) is converted to \n in the input. But you don't want this for CSV input, because CSV dialects are often quite picky about what constitutes a newline (Excel dialect requires \r\n only).
Your code should be:
import csv
with open('input_data.csv', newline='') as f:
csvread = csv.reader(f)
batch_data = list(csvread)
If that doesn't work, you need to look at your CSV dialect and make sure you're initializing csv.reader correctly.
Sir I am understand line terminator or '/n' meaning is .CSV file concept la comma use for separated values in a newline
a notepad or excel method

Convert \r text to \n so readlines() works as intended

In Python, you can read a file and load its lines into a list by using
f = open('file.txt','r')
lines = f.readlines()
Each individual line is delimited by \n but if the contents of a line have \r then it is not treated as a new line. I need to convert all \r to \n and get the correct list lines.
If I do .split('\r') inside the lines I'll get lists inside the list.
I thought about opening a file, replace all \r to \n, closing the file and reading it in again and then use the readlines() but this seems wasteful.
How should I implement this?
f = open('file.txt','rU')
This opens the file with Python's universal newline support and \r is treated as an end-of-line.
If it's a concern, open in binary format and convert with this code:
from __future__ import with_statement
with open(filename, "rb") as f:
s = f.read().replace('\r\n', '\n').replace('\r', '\n')
lines = s.split('\n')

Categories