Make CSV escape Double Quotation Marks - python

I need to prepare a .csv file so that a double quotation marks gets ignored by the program processing it (ArcMap). Arc was blending the contents of all following cells on that line into any previous one containing double quotation marks. For example:
...and no further rows would get processed at all.
How does one make a CSV escape Double Quotation Marks for successful processing in ArcMap (10.2)?

Let's say df is the DataFrame created for the csv files as follows
df = pd.read_csv('filename.csv')
Let us assume that comments is the name of the column where the issue occurs, i.e. you want to replace every double quotes (") with a null string ().
The following one-liner does that for you. It will replace every double quotes for every row in df['comments'] with null string.
df['comments'] = df['comments'].apply(lambda x: x.replace('"', ''))
The lambda captures every row in df['comments'] in variable x.
EDIT: To escape the double quotes you need to convert the string to it's raw format. Again another one-liner very similar to the one above.
df['comments'] = df['comments'].apply(lambda x: r'{0}'.format(x))
The r before the string is an escape to escape characters in python.

You could try reading the file with the csv module and writing it back in the hopes that the output format will be more digestible for your other tool. See the docs for formatting options.
import csv
with open('in.csv', 'r') as fin, open('out.csv', 'w') as fout:
reader = csv.reader(fin, delimiter='\t')
writer = csv.writer(fout, delimiter='\t')
# alternative:
# writer = csv.writer(fout, delimiter='\t', escapechar='\\', doublequote=False)
for line in reader:
writer.writerow(line)

What worked for me was writing a module to do some "pre-processing" of the CSV file as follows. The key line is where the "writer" has the parameter "quoting=csv.QUOTE_ALL". Hopefully this is useful to others.
def work(Source_CSV):
from __main__ import *
import csv, arcpy, os
# Derive name and location for newly-formatted .csv file
Head = os.path.split(Source_CSV)[0]
Tail = os.path.split(Source_CSV)[1]
name = Tail[:-4]
new_folder = "formatted"
new_path = os.path.join(Head,new_folder)
Formatted_CSV = os.path.join(new_path,name+"_formatted.csv")
#arcpy.AddMessage("Formatted_CSV = "+Formatted_CSV)
# Populate the new .csv file with quotation marks around all field contents ("quoting=csv.QUOTE_ALL")
with open(Source_CSV, 'rb') as file1, open(Formatted_CSV,'wb') as file2:
# Instantiate the .csv reader
reader = csv.reader(file1, skipinitialspace=True)
# Write column headers without quotes
headers = reader.next() # 'next' function actually begins at the first row of the .csv.
str1 = ''.join(headers)
writer = csv.writer(file2)
writer.writerow(headers)
# Write all other rows wrapped in double quotes
writer = csv.writer(file2, delimiter=',', quoting=csv.QUOTE_ALL)
# Write all other rows, at first quoting none...
#writer = csv.writer(file2, quoting=csv.QUOTE_NONE,quotechar='\x01')
for row in reader:
# ...then manually doubling double quotes and wrapping 3rd column in double quotes.
#row[2] = '"' + row[2].replace('"','""') + '"'
writer.writerow(row)
return Formatted_CSV

Related

Python CSV: nested double quotes

I have a test.csv file as follows:
"N";"INFO"
"1";"www.google.it"
I use the following program to print out the contents of the CSV file
import csv
with open('test.csv', newline='') as csvfile:
reader=csv.DictReader(csvfile, delimiter=';')
for p in reader:
print("%s %s" % (p['N'], p['INFO']))
The output is
1 www.google.it"
The reason lies probably in the fact that the csv file has some "nested" double quotes. However, the separating character is ";", and so I would like the library to simply remove the double quote " at the beginning and at the end of the field INFO, keeping the rest of the string intact.
In other words, I would like the output of the program to be
1 www.google.it
How can I fix that, without modifying the test.csv file?
One possibility is to use the csv module with csv.QUOTE_NONE, then handle the removal of the quotes (on both the fieldnames and the values) manually:
import csv
def strip_outer_quotes(s):
""" Strip an outer pair of quotes (only) from a string. If not quoted,
string is returned unchanged. """
if s[0] == s[-1] == '"':
return s[1:-1]
else:
return s
def my_csv_reader(fh):
""" Thin wrapper around csv.DictReader to handle fields which are
quoted but contain unquoted " characters. """
reader = csv.DictReader(fh, delimiter=';', quoting=csv.QUOTE_NONE)
reader.fieldnames = [strip_outer_quotes(fn) for fn in reader.fieldnames]
for row in reader:
yield {k: strip_outer_quotes(v) for k, v in row.items()}
with open('test.csv', newline='') as csvfile:
reader = my_csv_reader(csvfile)
for p in reader:
print("%s %s" % (p['N'], p['INFO']))
Note: instead of my_csv_reader, probably name the function after the source of this particular variant of CSV; acme_csv_reader or similar

Python – cleaning CSV file with split records

I have a delimited file in which some of the fields contain line termination characters. They can be LF or CR/LF.
The line terminators cause the records to split over multiple lines.
My objective is to read the file, remove the line termination characters, then write out a delimited file with quotes around the fields.
Sample input record:
444,2018-04-06,19:43:47,43762485,"Request processed"CR\LF
555,2018-04-30,19:17:56,43762485,"Added further note:LF
email customer a receipt" CR\LF
The first record is fine but the second has a LF (line feed) causing the record to fold.
import csv
with open(raw_data, 'r', newline='') as inp, open(csv_data, 'w') as out:
csvreader = csv.reader(inp, delimiter=',', quotechar='"')
for row in csvreader:
print(str(row))
out.write(str(row)[1:-1] + '\n')
My code nearly works but I don’t think it is correct.
The output I get is:
['444', '2020-04-06', '19:43:47', '344376882485', 'Request processed']
['555', '2020-04-30', '19:17:56', '344376882485', 'Added further note:\nemail customer a receipt']
I use the substring to remove the square brackets at the start and end of the line which I think is not the correct way.
Notice on the second record the new line character has been converted to \n. I would like to know how to get rid of that and also incorporate a csv writer to the code to place double quoting around the fields.
To remove the line terminators I tried replace but did not work.
(row.replace('\r', '').replace('\n', '') for row in csvreader)
I also tried to incorporate a csv writer but could not get it working with the list.
Any advice would be appreciated.
This snippet does what you want:
with open('raw_data.csv', 'r', newline='') as inp, open('csv_data.csv', 'w') as out:
reader = csv.reader(inp, delimiter=',', quotechar='"')
writer = csv.writer(out, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
for row in reader:
fixed = [cell.replace('\n', '') for cell in row]
writer.writerow(fixed)
Quoting all cells is handled by passing csv.QUOTE_ALL as the writer's "quoting" argument.
The line
fixed = [cell.replace('\n', '') for cell in row]
creates a new list of cells where embedded '\n' characters are replaced by the empty string.
By default, Python will set the end-of-line to your platform's default. If you want to override this you can pass a lineterminator argument to the writer.
To me the original csv seems fine: it's normal to have embedded newlines ("soft line-breaks") inside quoted cells, and csv-aware applications should as spreadsheets will handle them correctly. However they will look wrong in applications that don't understand csv formatting and so treat the embedded newlines as actual end of line characters.

remove non ascii characters from csv file using Python

I am trying to remove non-ascii characters from a file. I am actually trying to convert a text file which contains these characters (eg. hello§‚å½¢æˆ äº†å¯¹æ¯”ã€‚ 花å) into a csv file.
However, I am unable to iterate through these characters and hence I want to remove them (i.e chop off or put a space). Here's the code (researched and gathered from various sources)
The problem with the code is, after running the script, the csv/txt file has not been updated. Which means the characters are still there. Have absolutely no idea how to go about doing this anymore. Researched for a day :(
Would kindly appreciate your help!
import csv
txt_file = r"xxx.txt"
csv_file = r"xxx.csv"
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
for row in in_txt:
for i in row:
i = "".join([a if ord(a)<128 else''for a in i])
out_csv.writerows(in_txt)
Variable assignment is not magically transferred to the original source; you have to build up a new list of your changed rows:
import csv
txt_file = r"xxx.txt"
csv_file = r"xxx.csv"
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_txt = []
for row in in_txt:
out_txt.append([
"".join(a if ord(a) < 128 else '' for a in i)
for i in row
]
out_csv.writerows(out_txt)

Python: writing an entire row to a CSV file. Why does it work this way?

I had exported a csv from Nokia Suite.
"sms","SENT","","+12345678901","","2015.01.07 23:06","","Text"
Reading from the PythonDoc, I tried
import csv
with open(sourcefile,'r', encoding = 'utf8') as f:
reader = csv.reader(f, delimiter = ',')
for line in reader:
# write entire csv row
with open(filename,'a', encoding = 'utf8', newline='') as t:
a = csv.writer(t, delimiter = ',')
a.writerows(line)
It didn't work, until I put brackets around 'line' as so i.e. [line].
So at the last part I had
a.writerows([line])
Why is that so?
The writerows method accepts a container object. The line object isn't a container. [line] turns it into a list with one item in it.
What you probably want to use instead is writerow.

Remove double quotes from iterator when using csv writer

I want to create a csv from an existing csv, by splitting its rows.
Input csv:
A,R,T,11,12,13,14,15,21,22,23,24,25
Output csv:
A,R,T,11,12,13,14,15
A,R,T,21,22,23,24,25
So far my code looks like:
def update_csv(name):
#load csv file
file_ = open(name, 'rb')
#init first values
current_a = ""
current_r = ""
current_first_time = ""
file_content = csv.reader(file_)
#LOOP
for row in file_content:
current_a = row[0]
current_r = row[1]
current_first_time = row[2]
i = 2
#Write row to new csv
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
writer.writerow((current_a,
current_r,
current_first_time,
",".join((row[x] for x in range(i+1,i+5)))
))
#do only one row, for debug purposes
return
But the row contains double quotes that I can't get rid of:
A002,R051,02-00-00,"05-21-11,00:00:00,REGULAR,003169391"
I've tried to use writer = csv.writer(f,quoting=csv.QUOTE_NONE) and got a _csv.Error: need to escape, but no escapechar set.
What is the correct approach to delete those quotes?
I think you could simplify the logic to split each row into two using something along these lines:
def update_csv(name):
with open(name, 'rb') as file_:
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
# read one row from input csv
for row in csv.reader(file_):
# write 2 rows to new csv
writer.writerow(row[:8])
writer.writerow(row[:3] + row[8:])
writer.writerow is expecting an iterable such that it can write each item within the iterable as one item, separate by the appropriate delimiter, into the file. So:
writer.writerow([1, 2, 3])
would write "1,2,3\n" to the file.
Your call provides it with an iterable, one of whose items is a string that already contains the delimiter. It therefore needs some way to either escape the delimiter or a way to quote out that item. For example,
write.writerow([1, '2,3'])
Doesn't just give "1,2,3\n", but e.g. '1,"2,3"\n' - the string counts as one item in the output.
Therefore if you want to not have quotes in the output, you need to provide an escape character (e.g. '/') to mark the delimiters that shouldn't be counted as such (giving something like "1,2/,3\n").
However, I think what you actually want to do is include all of those elements as separate items. Don't ",".join(...) them yourself, try:
writer.writerow((current_a, current_r,
current_first_time, *row[i+2:i+5]))
to provide the relevant items from row as separate items in the tuple.

Categories