I have a test.csv file as follows:
"N";"INFO"
"1";"www.google.it"
I use the following program to print out the contents of the CSV file
import csv
with open('test.csv', newline='') as csvfile:
reader=csv.DictReader(csvfile, delimiter=';')
for p in reader:
print("%s %s" % (p['N'], p['INFO']))
The output is
1 www.google.it"
The reason lies probably in the fact that the csv file has some "nested" double quotes. However, the separating character is ";", and so I would like the library to simply remove the double quote " at the beginning and at the end of the field INFO, keeping the rest of the string intact.
In other words, I would like the output of the program to be
1 www.google.it
How can I fix that, without modifying the test.csv file?
One possibility is to use the csv module with csv.QUOTE_NONE, then handle the removal of the quotes (on both the fieldnames and the values) manually:
import csv
def strip_outer_quotes(s):
""" Strip an outer pair of quotes (only) from a string. If not quoted,
string is returned unchanged. """
if s[0] == s[-1] == '"':
return s[1:-1]
else:
return s
def my_csv_reader(fh):
""" Thin wrapper around csv.DictReader to handle fields which are
quoted but contain unquoted " characters. """
reader = csv.DictReader(fh, delimiter=';', quoting=csv.QUOTE_NONE)
reader.fieldnames = [strip_outer_quotes(fn) for fn in reader.fieldnames]
for row in reader:
yield {k: strip_outer_quotes(v) for k, v in row.items()}
with open('test.csv', newline='') as csvfile:
reader = my_csv_reader(csvfile)
for p in reader:
print("%s %s" % (p['N'], p['INFO']))
Note: instead of my_csv_reader, probably name the function after the source of this particular variant of CSV; acme_csv_reader or similar
Related
I need to prepare a .csv file so that a double quotation marks gets ignored by the program processing it (ArcMap). Arc was blending the contents of all following cells on that line into any previous one containing double quotation marks. For example:
...and no further rows would get processed at all.
How does one make a CSV escape Double Quotation Marks for successful processing in ArcMap (10.2)?
Let's say df is the DataFrame created for the csv files as follows
df = pd.read_csv('filename.csv')
Let us assume that comments is the name of the column where the issue occurs, i.e. you want to replace every double quotes (") with a null string ().
The following one-liner does that for you. It will replace every double quotes for every row in df['comments'] with null string.
df['comments'] = df['comments'].apply(lambda x: x.replace('"', ''))
The lambda captures every row in df['comments'] in variable x.
EDIT: To escape the double quotes you need to convert the string to it's raw format. Again another one-liner very similar to the one above.
df['comments'] = df['comments'].apply(lambda x: r'{0}'.format(x))
The r before the string is an escape to escape characters in python.
You could try reading the file with the csv module and writing it back in the hopes that the output format will be more digestible for your other tool. See the docs for formatting options.
import csv
with open('in.csv', 'r') as fin, open('out.csv', 'w') as fout:
reader = csv.reader(fin, delimiter='\t')
writer = csv.writer(fout, delimiter='\t')
# alternative:
# writer = csv.writer(fout, delimiter='\t', escapechar='\\', doublequote=False)
for line in reader:
writer.writerow(line)
What worked for me was writing a module to do some "pre-processing" of the CSV file as follows. The key line is where the "writer" has the parameter "quoting=csv.QUOTE_ALL". Hopefully this is useful to others.
def work(Source_CSV):
from __main__ import *
import csv, arcpy, os
# Derive name and location for newly-formatted .csv file
Head = os.path.split(Source_CSV)[0]
Tail = os.path.split(Source_CSV)[1]
name = Tail[:-4]
new_folder = "formatted"
new_path = os.path.join(Head,new_folder)
Formatted_CSV = os.path.join(new_path,name+"_formatted.csv")
#arcpy.AddMessage("Formatted_CSV = "+Formatted_CSV)
# Populate the new .csv file with quotation marks around all field contents ("quoting=csv.QUOTE_ALL")
with open(Source_CSV, 'rb') as file1, open(Formatted_CSV,'wb') as file2:
# Instantiate the .csv reader
reader = csv.reader(file1, skipinitialspace=True)
# Write column headers without quotes
headers = reader.next() # 'next' function actually begins at the first row of the .csv.
str1 = ''.join(headers)
writer = csv.writer(file2)
writer.writerow(headers)
# Write all other rows wrapped in double quotes
writer = csv.writer(file2, delimiter=',', quoting=csv.QUOTE_ALL)
# Write all other rows, at first quoting none...
#writer = csv.writer(file2, quoting=csv.QUOTE_NONE,quotechar='\x01')
for row in reader:
# ...then manually doubling double quotes and wrapping 3rd column in double quotes.
#row[2] = '"' + row[2].replace('"','""') + '"'
writer.writerow(row)
return Formatted_CSV
I am trying to import csv data from files where sometimes the enclosing char " is missing.
So I have rows like this:
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
# In the second row the closing " after 2200.00 is missing
# also the closing " before EUR" is missing
Now I am reading the csv data with this:
csv.reader(
codecs.open(filename, 'r', encoding='latin-1'),
delimiter=";",
dialect=csv.excel_tab)
And the data I get for the second row is this:
["MacBookPro", "2200.00;EUR"]
Aside from pre-processing my csv files with a unix command like sed and removing all closing chars " and relying on the semicolon to seperate the columns, what else can I do?
This might work:
import csv
import io
file = io.StringIO(u'''
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
'''.strip())
reader = csv.reader((line.replace('"', '') for line in file), delimiter=';', quotechar='"')
for row in reader:
print(row)
The problem is that if there are any legitimate quoted line, e.g.
"MacBookPro;Awesome Edition";"2200.00";"EUR"
Or, worse:
"MacBookPro:
Description: Awesome Edition";"2200.00";"EUR"
Your output is going to produce too few/many columns. But if you know that's not a problem then it will work fine. You could pre-screen the file by adding this before the read part, which would give you the malformed line:
for line in file:
if line.count(';') != 2:
raise ValueError('No! This file has broken data on line {!r}'.format(line))
file.seek(0)
Or alternatively you could screen as you're reading:
for row in reader:
if any(';' in _ for _ in row):
print('Error:')
print(row)
Ultimately your best option is to fix whatever is producing your garbage csv file.
If you're looping through all the lines/rows of the file, you can use string's .replace() function to get rid off the quotes (if you don't need them later-on for other purposes.).
>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
... my_file = csv.reader(codecs.open(filename, 'r', encoding='latin-1')
... delimiter=";",
... dialect=csv.excel_tab)
... )
... for row in my_file:
... (model,price,currency) = row
... model.replace('"','')
... price.replace('"','')
... currency.replace('"','')v
... print 'Model is: %s (costs %s%s).' % (model,price,currency)
>>>
Model is: MacBookPro (costs 2200.00EUR).
I've a large csv file(comma delimited). I would like to replace/rename few random cell with the value "NIL" to an empty string "".
I tried this to find the keyword "NIL" and replace with '' empty
string. But it's giving me an empty csv file
ifile = open('outfile', 'rb')
reader = csv.reader(ifile,delimiter='\t')
ofile = open('pp', 'wb')
writer = csv.writer(ofile, delimiter='\t')
findlist = ['NIL']
replacelist = [' ']
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
ofile.write(s)
From seeing you code i fell you directly should
read the file
with open("test.csv") as opened_file:
data = opened_file.read()
then use regex to change all NIL to "" or " " and save back the data to the file.
import re
data = re.sub("NIL"," ",data) # this code will replace NIL with " " in the data string
NOTE: you can give any regex instead of NIL
for more info see re module.
EDIT 1: re.sub returns a new string so you need to return it to data.
A few tweaks and your example works. I edited your question to get rid of some indenting errors - assuming those were a cut/paste problem. The next problem is that you don't import csv ... but even though you create a reader and writer, you don't actually use them, so it could just be removed. So, opening in text instead of binary mode, we have
ifile = open('outfile') # 'outfile' is the input file...
ofile = open('pp', 'w')
findlist = ['NIL']
replacelist = [' ']
s = ifile.read()
for item, replacement in zip(findlist, replacelist):
s = s.replace(item, replacement)
ofile.write(s)
We could add 'with' clauses and use a dict to make replacements more clear
replace_this = { 'NIL': ' '}
with open('outfile') as ifile, open('pp', 'w') as ofile:
s = ifile.read()
for item, replacement in replace_this.items:
s = s.replace(item, replacement)
ofile.write(s)
The only real problem now is that it also changes things like "NILIST" to "IST". If this is a csv with all numbers except for "NIL", that's not a problem. But you could also use the csv module to only change cells that are exactly "NIL".
with open('outfile') as ifile, open('pp', 'w') as ofile:
reader = csv.reader(ifile)
writer = csv.writer(ofile)
for row in reader:
# row is a list of columns. The following builds a new list
# while checking and changing any column that is 'NIL'.
writer.writerow([c if c.strip() != 'NIL' else ' '
for c in row])
I want to create a csv from an existing csv, by splitting its rows.
Input csv:
A,R,T,11,12,13,14,15,21,22,23,24,25
Output csv:
A,R,T,11,12,13,14,15
A,R,T,21,22,23,24,25
So far my code looks like:
def update_csv(name):
#load csv file
file_ = open(name, 'rb')
#init first values
current_a = ""
current_r = ""
current_first_time = ""
file_content = csv.reader(file_)
#LOOP
for row in file_content:
current_a = row[0]
current_r = row[1]
current_first_time = row[2]
i = 2
#Write row to new csv
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
writer.writerow((current_a,
current_r,
current_first_time,
",".join((row[x] for x in range(i+1,i+5)))
))
#do only one row, for debug purposes
return
But the row contains double quotes that I can't get rid of:
A002,R051,02-00-00,"05-21-11,00:00:00,REGULAR,003169391"
I've tried to use writer = csv.writer(f,quoting=csv.QUOTE_NONE) and got a _csv.Error: need to escape, but no escapechar set.
What is the correct approach to delete those quotes?
I think you could simplify the logic to split each row into two using something along these lines:
def update_csv(name):
with open(name, 'rb') as file_:
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
# read one row from input csv
for row in csv.reader(file_):
# write 2 rows to new csv
writer.writerow(row[:8])
writer.writerow(row[:3] + row[8:])
writer.writerow is expecting an iterable such that it can write each item within the iterable as one item, separate by the appropriate delimiter, into the file. So:
writer.writerow([1, 2, 3])
would write "1,2,3\n" to the file.
Your call provides it with an iterable, one of whose items is a string that already contains the delimiter. It therefore needs some way to either escape the delimiter or a way to quote out that item. For example,
write.writerow([1, '2,3'])
Doesn't just give "1,2,3\n", but e.g. '1,"2,3"\n' - the string counts as one item in the output.
Therefore if you want to not have quotes in the output, you need to provide an escape character (e.g. '/') to mark the delimiters that shouldn't be counted as such (giving something like "1,2/,3\n").
However, I think what you actually want to do is include all of those elements as separate items. Don't ",".join(...) them yourself, try:
writer.writerow((current_a, current_r,
current_first_time, *row[i+2:i+5]))
to provide the relevant items from row as separate items in the tuple.
I have similar problem to this guy: find position of a substring in a string
The difference is that I don't know what my "mystr" is. I know my substring but my string in the input file could be random amount of words in any order, but i know one of those words include substring cola.
For example a csv file: fanta,coca_cola,sprite in any order.
If my substring is "cola", then how can I make a code that says
mystr.find('cola')
or
match = re.search(r"[^a-zA-Z](cola)[^a-zA-Z]", mystr)
or
if "cola" in mystr
When I don't know what my "mystr" is?
this is my code:
import csv
with open('first.csv', 'rb') as fp_in, open('second.csv', 'wb') as fp_out:
reader = csv.DictReader(fp_in)
rows = [row for row in reader]
writer = csv.writer(fp_out, delimiter = ',')
writer.writerow(["new_cola"])
def headers1(name):
if "cola" in name:
return row.get("cola")
for row in rows:
writer.writerow([headers1("cola")])
and the first.csv:
fanta,cocacola,banana
0,1,0
1,2,1
so it prints out
new_cola
""
""
when it should print out
new_cola
1
2
Here is a working example:
import csv
with open("first.csv", "rb") as fp_in, open("second.csv", "wb") as fp_out:
reader = csv.DictReader(fp_in)
writer = csv.writer(fp_out, delimiter = ",")
writer.writerow(["new_cola"])
def filter_cola(row):
for k,v in row.iteritems():
if "cola" in k:
yield v
for row in reader:
writer.writerow(list(filter_cola(row)))
Notes:
rows = [row for row in reader] is unnecessary and inefficient (here you convert a generator to list which consumes a lot of memory for huge data)
instead of return row.get("cola") you meant return row.get(name)
in the statement return row.get("cola") you access a variable outside of the current scope
you can also use the unix tool cut. For example:
cut -d "," -f 2 < first.csv > second.csv