open a new line separated text file in python - python

Using the python open built-in function in this way:
with open('myfile.csv', mode='r') as rows:
for r in rows:
print(r.__repr__())
I obtain this ouput
'col1,col2,col3\n'
'fst,snd,trd\n'
'1,2,3\n'
I don't want the \n character. Do you know some efficient way to remove that char (in place of the obvious r.replace('\n',''))?

If you are trying to read and parse csv file, Python's csv module might serve better:
import csv
reader = csv.reader(open('myfile.csv', 'r'))
for row in reader:
print(', '.join(row))
Although you cannot change the line terminator for reader here, it ends a row with either '\r' or '\n', which works for your case.
https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator
Again, for most of the cases, I don't think you need to parse csv file manually. There are a few issues/reasons that makes csv module easier for you: field containing separator, field containing newline character, field containing quote character, etc.

You can use string.strip(), which (with no arguments) removes any whitespace from the start and end of a string:
for r in rows:
print(r.strip())
If you want to remove only newlines, you can pass that character as an argument to strip:
for r in rows:
print(r.strip('\n'))
For a clean solution, you could use a generator to wrap open, like this:
def open_no_newlines(*args, **kwargs):
with open(*args, **kwargs) as f:
for line in f:
yield line.strip('\n')
You can then use open_no_newlines like this:
for line in open_no_newlines('myfile.csv', mode='r'):
print(line)

Related

CSV Writer truncates characters in sequence in Excel 2013

I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rb") as text:
for line in text:
line=line.strip()
with open(csvfile, "ab") as f:
if line.startswith("# Online_Resource"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
if line.startswith("##"):
write = csv.writer(f, dialect='excel',
delimiter='\t',
lineterminator="\t",
)
write.writerow([line.lstrip("# ")])
Here is a sample of some strings from the original text file:
# Online_Resource: https://www.ncdc.noaa.gov/
## Corg% percent organic carbon,,,%,,paleoceanography,,,N
What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:
Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.
Thanks.
You've selected excel dialect but you overrode it with weird parameters:
You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
Also quotechar shouldn't be a space.
This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!
The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.
Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:
import os
import csv
def csv_save_use(textfile, csvfile):
with open(textfile, "rU") as text, open(csvfile, "wb") as f:
write = csv.writer(f, dialect='excel',
delimiter='\t')
reader = csv.reader(text, delimiter=",")
for row in reader:
if not row:
continue # skip possible empty rows
if row[0].startswith("# Online_Resource"):
write.writerow([row[0].lstrip("# ")])
elif row[0].startswith("##"):
write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)
Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.
here's how the output looks now (note that the empty cells are because there are several commas in a row)
more problems:
line=line.strip(" ") removes leading and trailing spaces. It doesn't remove \r or \n ... try line=line.strip() which removes leading and trailing whitespace
you get all your line including commas in one cell because you haven't split it up somehow ... like using a csv.reader instance. See here:
https://docs.python.org/2/library/csv.html#csv.reader
str.lstrip non-default arg is treated as a set of characters to be removed, so '## ' has the same effect as '# '. if guff.startswith('## ') then do guff = guff[3:] to get rid of the unwanted text
It is not very clear at all what the sentence containing "bizarre" means. We need to see exactly what is in the output csv file. Create a small test file with 3 records (1) with '# Online_Resource' (2) with "## " (3) none of the above, run your code, and show the output, like this:
print repr(open('testout.csv', 'rb').read())

How to remove newline within a column in delimited file?

I have a file that looks like this:
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
3333,CCC\nC,cccc\n
...
Where \n represents a newline.
When I read this line-by-line, it's read as:
1111,AAAA,aaaa\n
2222,BB\n
BB,bbbb\n
3333,CCC\n
C,cccc\n
...
This is a very large file. Is there a way to read a line until a specific number of delimiters, or remove the newline character within a column in Python?
I think after you read the line, you need to count the number of commas
aStr.count(',')
While the number of commas is too small (there can be more than one \n in the input), then read the next line and concatenate the strings
while aStr.count(',') < Num:
another = file.readline()
aStr = aStr + another
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
According to your file \n here is not actually a newline character, it is plain text.
For actually stripping newline characters you could use strip() or other variations like rstrip() ot lstrip().
If you work with large files you don't need to load full content in memory. You could iterate line by line until some counter or anything else.
I think perhaps you are parsing a CSV file that has embedded newlines in some of the text fields. Further, I suppose that the program that created the file put quotation marks (") around the fields.
That is, I supposed that your text file actually looks like this:
1111,AAAA,aaaa
2222,"BB
BB",bbbb
3333,"CCC
C",cccc
If that is the case, you might want to use code with better CSV support than just line.split(','). Consider this program:
import csv
with open('foo.csv') as fp:
reader = csv.reader(fp)
for row in reader:
print row
Which produces this output:
['1111', 'AAAA', 'aaaa']
['2222', 'BB\nBB', 'bbbb']
['3333', 'CCC\nC', 'cccc']
Notice the five lines (delimited by newline characters) of the CSV file become 3 rows (some with embedded newline characters) in the CSV data structure.

Python: \n added after splitting csv file

Find this really weird, for some reason '\n' is added to the last entry in my list when I split a line from a .csv file.
Script
f = open("temp.csv")
lines = f.readlines()
headings = lines[0]
global heading_list
heading_list = headings.split(";")
print headings
I've printed out just headings itself and it doesn't have '\n' when at the end of it, it seems to be only when it's split at the semi colon.
.csv file
timestamp;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
10-20-39;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
10-20-41;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-43;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
10-20-45;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-47;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
10-20-49;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
Output from script
When you read a line in Python, the end of line character is not removed. You have to do this manually, for example with line.rstrip("\r\n"). It's not a problem with split, but with readlines.
Short answer - use the csv module. See below.
The new line character is present in the data that was read from the file. readlines() does not remove it, and in fact you will find that the new line character is present in headings :
>>> headings = lines[0]
>>> headings
'timestamp;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle\n'
A better way is to use splitlines() on the data read from the file. This will remove new lines, regardless of the type ('\n', '\r\n', '\r'):
>>> with open("temp.csv") as f:
>>> lines = f.read().splitlines()
>>> headings = lines[0]
>>> headings
'timestamp;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle'
readlines() fails for Mac newlines ('\r'), so you should open the file with universal newline support by specifying 'rU' as the mode:
with open('temp.csv', 'rU') as f:
...
One other thing worth mentioning is that processing files this way can consume a lot of memory if the file is large because the whole file is read in one go. Instead it is more efficient to iterate over the file like this:
with open('temp.csv', 'rU') as f:
heading_list = next(f).rstrip().split(';') # headings on the first line
for line in f:
process_data_row(line.rstrip().split(';'))
Finally, the real answer. You can avoid all of the mess above by using the csv module:
import csv
with open('temp.csv', 'rU') as csv_file: # NB. 'rU' is important for handling mac newlines
csv_data = csv.reader(csv_file, delimiter=';')
heading_list = next(csv_data)
for row in csv_data:
process_data_row(row)

CSV writing strings of text that need a unique delimiter

I wrote an HTML parser in python used to extract data to look like this in a csv file:
itemA, itemB, itemC, Sentence that might contain commas, or colons: like this,\n
so I used a delmiter ":::::" thinking that it wouldn't be mined in the data
itemA, itemB, itemC, ::::: Sentence that might contain commas, or colons: like this,::::\n
This works for most of the thousands of lines, however, apparently a colon : offset this when I imported the csv in Calc.
My question is, what is the best or a unique delimiter to use when creating a csv with many variations of sentences that need to be separated with some delimiter? Am I understanding delimiters correctly in that they separate the values within a CSV?
As I suggested informally in a comment, unique just means you need to use some character that won't be in the data — chr(255) might be a good choice. For example:
Note: The code shown is for Python 2.x — see comments for a Python 3 version.
import csv
DELIMITER = chr(255)
data = ["itemA", "itemB", "itemC",
"Sentence that might contain commas, colons: or even \"quotes\"."]
with open('data.csv', 'wb') as outfile:
writer = csv.writer(outfile, delimiter=DELIMITER)
writer.writerow(data)
with open('data.csv', 'rb') as infile:
reader = csv.reader(infile, delimiter=DELIMITER)
for row in reader:
print row
Output:
['itemA', 'itemB', 'itemC', 'Sentence that might contain commas, colons: or even "quotes".']
If you're not using the csv module and instead are writing and/or reading the data manually, then it would go something like this:
with open('data.csv', 'wb') as outfile:
outfile.write(DELIMITER.join(data) + '\n')
with open('data.csv', 'rb') as infile:
row = infile.readline().rstrip().split(DELIMITER)
print row
Yes, delimiters separate values within each line of a CSV file. There are two strategies to delimiting text that has a lot of punctuation marks. First, you can quote the values, e.g.:
Value 1, Value 2, "This value has a comma, <- right there", Value 4
The second strategy is to use tabs (i.e., '\t').
Python's built-in CSV module can both read and write CSV files that use quotes. Check out the example code under the csv.reader function. The built-in csv module will handle quotes correctly, e.g. it will escape quotes that are in the value itself.
CSV files usually use double quotes " to wrap long fields that might contain a field separator like a comma. If the field contains a double quote it's escaped with a backslash: \".

how to properly read and modify a file using python

I'm trying to remove all (non-space) whitespace characters from a file and replace all spaces with commas. Here is my current code:
def file_get_contents(filename):
with open(filename) as f:
return f.read()
content = file_get_contents('file.txt')
content = content.split
content = str(content).replace(' ',',')
with open ('file.txt', 'w') as f:
f.write(content)
when this is run, it replaces the contents of the file with:
<built-in,method,split,of,str,object,at,0x100894200>
The main issue you have is that you're assigning the method content.split to content, rather than calling it and assigning its return value. If you print out content after that assignment, it will be: <built-in method split of str object at 0x100894200> which is not what you want. Fix it by adding parentheses, to make it a call of the method, rather than just a reference to it:
content = content.split()
I think you might still have an issue after fixing that through. str.split returns a list, which you're then tuning back into a string using str (before trying to substitute commas for spaces). That's going to give you square brackets and quotation marks, which you probably don't want, and you'll get a bunch of extra commas. Instead, I suggest using the str.join method like this:
content = ",".join(content) # joins all members of the list with commas
I'm not exactly sure if this is what you want though. Using split is going to replace all the newlines in the file, so you're going to end up with a single line with many, many words separated by commas.
When you split the content, you forgot to call the function. Also once you split, its an array so you should loop to replace things.
def file_get_contents(filename):
with open(filename) as f:
return f.read()
content = file_get_contents('file.txt')
content = content.split() <- HERE
content = [c.replace(' ',',') for c in content]
content = "".join(content)
with open ('file.txt', 'w') as f:
f.write(content)
if you are looking to replace characters i think you would be better off using python's re module for regular expressions. sample code would be as follows:
import re
def file_get_contents(filename):
with open(filename) as f:
return f.read()
if __name__=='__main__':
content = file_get_contents('file.txt')
# First replace any spaces with commas, then remove any other whitespace
new_content = re.sub('\s', '', re.sub(' ', ',', content))
with open ('new_file.txt', 'w') as f:
f.write(new_content)
its more succinct then trying to split all the time and gives you a little bit more flexibility. just also be careful with how large of a file you are opening and reading with your code - you may want to consider using a line iterator or something instead of reading all the file contents at once

Categories