Specify Newline character ('\n') in reading csv using Python

Specify Newline character ('\n') in reading csv using Python - python

I want to read a csv file with each line dictated by a newline character ('\n') using Python 3. This is my code:
import csv
with open(input_data.csv, newline ='\n') as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
This above code gave error:
batch_data = [line for line in csvread].
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Reading these posts: CSV new-line character seen in unquoted field error, also tried these alternatives that I could think about:
with open(input_data.csv, 'rU', newline ='\n') as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
with open(input_data.csv, 'rU', newline ="\n") as f:
csvread = csv.reader(f)
batch_data = [line for line in csvread]
No luck of geting this correct yet. Any suggestions?
I am also reading the documentation about newline： if newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n line on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
So my understanding of this newline method is:
1) it is a necessity,
2) does it indicate the input file would be split into lines by empty space character?

newline='' is correct in all csv cases, and failing to specify it is an error in many cases. The docs recommend it for the very reason you're encountering.
newline='' doesn't mean "empty space" is used for splitting; it's specifically documented on the open function:
If [newline] is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
So with newline='' all original \r and \n characters are returned unchanged. Normally, in universal newlines mode, any newline like sequence (\r, \n, or \r\n) is converted to \n in the input. But you don't want this for CSV input, because CSV dialects are often quite picky about what constitutes a newline (Excel dialect requires \r\n only).
Your code should be:
import csv
with open('input_data.csv', newline='') as f:
csvread = csv.reader(f)
batch_data = list(csvread)
If that doesn't work, you need to look at your CSV dialect and make sure you're initializing csv.reader correctly.

Sir I am understand line terminator or '/n' meaning is .CSV file concept la comma use for separated values in a newline
a notepad or excel method

Related

replace ^M(control M character) in a text file in python

The file is like this:
This line has control character ^M this is bad
I will try it
I want to remove control M characters in the file, and create a new file like this using Python
This line has control character this is bad
I will try it
I tried the methods I found in stack overflow and use regular expression like this:
line.replace("\r", "r")
and
line.replace("\r\n", "r")
Here is part of the code snippet:
with open(file_path, "r") as input_file:
lines = input_file.readlines()
new_lines = []
for line in lines:
new_line = line.replace("\r", "")
new_lines.append(new_line)
new_file_name = "replace_control_char.dat"
new_file_path = os.path.join(here, data_dir, new_file_name)
with open(new_file_path, "w") as output_file:
for line in new_lines:
output_file.write(line)
However, the new file I got is:
This line has control character
this is bad
I will try it
"This line has control character" and " this is bad" are not on the same line. I expect remove control M character will make these two phrases on the same line.
Can someone help me solve this issue?
Thanks,
Arthur

You cannot rely on text mode in that case.
On Windows understands sole \r as linefeeds (even if the "official" line terminator is \r\n) and on Macintosh, the line terminator can be only \r. Text mode converts linefeeds as \n or remove them if followed by \n, so it destroys the information you need.
Universal newlines by default makes this code also fail on Unix/Linux. Python behaves the same on all platforms
Python doesn’t depend on the underlying operating system’s notion of text files; all the the processing is done by Python itself, and is therefore platform-independent.
If you want to remove those, you have to use binary mode.
with open(file_path, "rb") as input_file:
contents = input_file.read().replace(b"\r",b"")
with open(file_path, "wb") as output_file:
output_file.write(contents)
That code will remove all \r characters (including line terminators). That works but if your aim is just to remove stray \r and preserve endlines, another method is required.
One way to do it is to use a regular expression, which can accept binary (bytes) as well:
re.sub(rb"\r([^\n])",rb"\1",contents)
That regular expression removes \r chars only if not followed by \n chars, efficiently preserving CR+LF windows end-of-line sequences.

Replacing \n while keeping \r\n intact

I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact.
I've tried line.replace but seems like it is not recognizing \r\n so next I tried regex
with open(filetoread, "r") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
line = re.sub("(?<!\r)\n", " ", line)
fixed.write(line)
but it is not keeping \r\n it is removing everything. I can't do it in Notepad++ it is crashing on this file.

You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open with r mode, and to keep them all in the input, you can read the file in in the binary mode using b. Then, you need to remember to also use the b prefix with the regex pattern and replacement.
You can use
with open(filetoread, "rb") as inf:
with open(filetowrite, "wb") as fixed:
fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))
Now, the whole file will be read into a single string (with inf.read()) and the line breaks will be matched, and eventually replaced.
Pay attention to
"rb" when reading file in
"wb" to write file out
re.sub(b"(?<!\r)\n", b" ", inf.read()) contains b prefixes with string literals, and inf.read() reads in the file contents into single variable.

When you open a file with a naive open() call, it will load a view of the file with a variety of newlines to be simply \n via TextIOWrapper
Explicitly setting newline="\r\n" should allow you to read and write the newlines the way you expect
with open(path_src, newline="\r\n") as fh_src:
with open(path_dest, "w", newline="\r\n") as fh_dest:
for line in fh_src: # file-likes are iterable by-lines
fh_dest.write(line[:-2].replace("\n", " "))
fh_dest.write("\r\n")
content example
>>> with open("test.data", "wb") as fh:
... fh.write(b"""foo\nbar\r\nbaz\r\n""")
...
14
>>> with open("test.data", newline="\r\n") as fh:
... for line in fh:
... print(repr(line))
...
'foo\nbar\r\n'
'baz\r\n'

Python write lines to a UCS-2 LE BOM encoded text file

I have an array of text strings in Python 3.7.
Now I want to write them all to a text file. The problem is, that textfile has to be in encoding UCS-2 LE BOM (thats what it says about its encoding in Notepad++), otherwise the file won't work in further processing.
How do I write the text strings to the file in that encoding while the strings staying readable?
with open(textpath, "w", encoding='utf-16-le') as f:
for line in newlines:
f.write(line)
This does not work because it generates gibberish text...

Try writing an explicit BOM:
with open(textpath, "w", encoding='utf-16-le') as f:
f.write('\ufeff')
for line in newlines:
f.write(line)
# Perhaps you also need to add a newline after each line?
f.write('\n')
Obviously revert the last addition if your lines already have newlines.

CSV Writer truncates characters in sequence in Excel 2013

I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.

You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)

more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())

Convert \r text to \n so readlines() works as intended

In Python, you can read a file and load its lines into a list by using
f = open('file.txt','r')
lines = f.readlines()
Each individual line is delimited by \n but if the contents of a line have \r then it is not treated as a new line. I need to convert all \r to \n and get the correct list lines.
If I do .split('\r') inside the lines I'll get lists inside the list.
I thought about opening a file, replace all \r to \n, closing the file and reading it in again and then use the readlines() but this seems wasteful.
How should I implement this?

f = open('file.txt','rU')
This opens the file with Python's universal newline support and \r is treated as an end-of-line.

If it's a concern, open in binary format and convert with this code:
from __future__ import with_statement
with open(filename, "rb") as f:
s = f.read().replace('\r\n', '\n').replace('\r', '\n')
lines = s.split('\n')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Specify Newline character ('\n') in reading csv using Python - python

Sir I am understand line terminator or '/n' meaning is .CSV file concept la comma use for separated values in a newline a notepad or excel method

Related

replace ^M(control M character) in a text file in python

Replacing \n while keeping \r\n intact

Python write lines to a UCS-2 LE BOM encoded text file

CSV Writer truncates characters in sequence in Excel 2013

Convert \r text to \n so readlines() works as intended

Categories

Resources