I was creating a CSV from python's CSV writer, where I want the same data as the input CSV but some texts with double-quotes.
I have successfully added the text I wanted but I'm struggling with double quotes in the text.
The output file is giving me 3 double quotes instead of just 1.
Here is my code until now:
with open('test.txt',newline='') as f:
r = csv.reader(f,delimiter='\t')
data = [line for line in r]
with open('abc.csv','w',newline='') as f:
w = csv.writer(f, delimiter=',')
w.writerow(["some of my text"])
w.writerow(["some more: 123456"])
w.writerow(["even more: 5555"])
w.writerow([f"with a variable: {time}"])
w.writerows(data)
the output of the inserted text of this code is like this:
"""some of my text"""
"""some more: 123456"""
"""even more: 5555"""
"""with a variable: 28th oct"""
Please suggest where am I missing to remove these triple quotes.
3 double quotes instead of just 1
That is correct. In CSV, the " is used to surround values containing special characters. In order to include a literal " it must be escaped by doubling it.
So you get one " for the start/end of the value, and then two "" to encode the quote in the value.
RFC 4180 §2 ¶7
Related
I need to prepare a .csv file so that a double quotation marks gets ignored by the program processing it (ArcMap). Arc was blending the contents of all following cells on that line into any previous one containing double quotation marks. For example:
...and no further rows would get processed at all.
How does one make a CSV escape Double Quotation Marks for successful processing in ArcMap (10.2)?
Let's say df is the DataFrame created for the csv files as follows
df = pd.read_csv('filename.csv')
Let us assume that comments is the name of the column where the issue occurs, i.e. you want to replace every double quotes (") with a null string ().
The following one-liner does that for you. It will replace every double quotes for every row in df['comments'] with null string.
df['comments'] = df['comments'].apply(lambda x: x.replace('"', ''))
The lambda captures every row in df['comments'] in variable x.
EDIT: To escape the double quotes you need to convert the string to it's raw format. Again another one-liner very similar to the one above.
df['comments'] = df['comments'].apply(lambda x: r'{0}'.format(x))
The r before the string is an escape to escape characters in python.
You could try reading the file with the csv module and writing it back in the hopes that the output format will be more digestible for your other tool. See the docs for formatting options.
import csv
with open('in.csv', 'r') as fin, open('out.csv', 'w') as fout:
reader = csv.reader(fin, delimiter='\t')
writer = csv.writer(fout, delimiter='\t')
# alternative:
# writer = csv.writer(fout, delimiter='\t', escapechar='\\', doublequote=False)
for line in reader:
writer.writerow(line)
What worked for me was writing a module to do some "pre-processing" of the CSV file as follows. The key line is where the "writer" has the parameter "quoting=csv.QUOTE_ALL". Hopefully this is useful to others.
def work(Source_CSV):
from __main__ import *
import csv, arcpy, os
# Derive name and location for newly-formatted .csv file
Head = os.path.split(Source_CSV)[0]
Tail = os.path.split(Source_CSV)[1]
name = Tail[:-4]
new_folder = "formatted"
new_path = os.path.join(Head,new_folder)
Formatted_CSV = os.path.join(new_path,name+"_formatted.csv")
#arcpy.AddMessage("Formatted_CSV = "+Formatted_CSV)
# Populate the new .csv file with quotation marks around all field contents ("quoting=csv.QUOTE_ALL")
with open(Source_CSV, 'rb') as file1, open(Formatted_CSV,'wb') as file2:
# Instantiate the .csv reader
reader = csv.reader(file1, skipinitialspace=True)
# Write column headers without quotes
headers = reader.next() # 'next' function actually begins at the first row of the .csv.
str1 = ''.join(headers)
writer = csv.writer(file2)
writer.writerow(headers)
# Write all other rows wrapped in double quotes
writer = csv.writer(file2, delimiter=',', quoting=csv.QUOTE_ALL)
# Write all other rows, at first quoting none...
#writer = csv.writer(file2, quoting=csv.QUOTE_NONE,quotechar='\x01')
for row in reader:
# ...then manually doubling double quotes and wrapping 3rd column in double quotes.
#row[2] = '"' + row[2].replace('"','""') + '"'
writer.writerow(row)
return Formatted_CSV
Want to find the delimiter in the text file.
The text looks:
ID; Name
1; John Mak
2; David H
4; Herry
The file consists of tabs with the delimiter.
I tried with following: by referring
with open(filename, 'r') as f1:
dialect = csv.Sniffer().sniff(f1.read(1024), "\t")
print 'Delimiter:', dialect.delimiter
The result shows: Delimiter:
Expected result: Delimiter: ;
sniff can conclude with only one single character as the delimiter. Since your CSV file contains two characters as the delimiter, sniff will simply pick one of them. But since you also pass in the optional second argument to sniff, it will only pick what's contained in that value as a possible delimiter, which in your case, is '\t' (which is not visible from your print output).
From sniff's documentation:
If the optional delimiters parameter is given, it is interpreted as a
string containing possible valid delimiter characters.
Sniffing is not guaranteed to work.
Here is one approach that will work with any kind of delimiter.
You start with what you assume is the most common delimiter ; if that fails, then you try others until you manage to parse the row.
import csv
with open('sample.csv') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
try:
a,b = row
except ValueError:
try:
a,b = row[0].split(None, 1)
except ValueError:
a,b = row[0].split('\t', 1)
print('{} - {}'.format(a.strip(), b.strip()))
You can play around with this at this replt.it link, play with the sample.csv file if you want to try out different delimiters.
You can combine sniffing with this to catch any odd delimiters that are not known to you.
I have a string from another variable str1 and they already have values like 'a','b','c' (comma separated).
So I want this string to be the header of a csv file that I am writing.
This works fine, when I use writer.writerow of csv reader
but it puts a "" around the whole " 'a','b','c' ".
But, I want 'a' to be header for col1 and 'b' to be header for col2 and so on...
Pasting data:
printing only str1 prints :
'AV, AZ$$38060','BB, BZ$$31100','CO, CZ$$31120'.... till X.
But when I use it in writer.writerow,
It gives -
'YEAR'," 'AV, AZ$$38060','BB, BZ$$31100','CO, CZ$$31120' ", 'Index'.
How to make csv writer understand that I want the str1 without the enclosed double quotes in the end?
Code: (pretty basic so far)
with open('478558_output_new.csv') as sample,
open('478558_output_final.csv','w') as output:
reader = csv.reader(sample)
writer = csv.writer(output)
# discard input header
next(reader)
# write new output header
writer.writerow(['YEAR',str1,'Index'])
You can split a string into a list with the split('CHARACTER') method:
yourSring = " 'a','b','c' "
yourList = yourString.split(',')
In your code:
writer.writerow(['YEAR']+str1.split(',')+['Index'])
I wrote a small script in Python to accept a csv file similar to what comes from Excel and output it as a pipe delimited file. When encountering a cell containing multiple lines, it currently adds a backslash (as that is what I specified as the escape character) at the end of the line and continues the cell on the next line. What I want to do though is be able to specify a space character or a string that the new line would be replaced with instead of the backslash and continue the record on the same line. I am having some trouble accomplishing this though. Is there an easy way to do this using the csv module? What I have so far:
fout = open (tfile, "wt")
cout = csv.writer(fout, delimiter = '|', quotechar = '', quoting = csv.QUOTE_NONE, lineterminator = '\n', escapechar='\\')
cin = csv.reader(fin)
for row in cin:
cout.writerow(row)
I worked out the answer to my issue. Instead of using a backslash as the escape character, I use '\n' and then do a find and replace of the new line for every field in each row. I.e.
cout = csv.writer(fout, delimiter = '|', quotechar = '', quoting = csv.QUOTE_NONE, lineterminator = '\n', escapechar='\n')
and
for row in cin:
newrow = row
row[:] = [str.replace("\n", " ") for str in row]
cout.writerow(newrow)
The find and replace of the list is based on the response to How to modify list entries during for loop?
I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions