Distinguish between "" and empty value when reading csv file using python - python

CSV file contains values such as "","ab,abc",,"abc". Note, I am referring to empty value ,, as in unknown value. This is different from "", where a value has not been set yet. I am treating these two values differently.
I need a way to read "" and empty value ,, and distinguish between the two. I am mapping data to numbers such that "" is mapped to 0 and ,, is mapped to NaN.
Note, I am not having a parsing issue and field such as "ab,abc" is being parsed just fine with comma as the delimiter. The issue is python reads "" and empty value,, as empty string such as ' '. And these two values are not same and should not be grouped into empty string.
Not only this, but I also need to write csv file such that "" is written as "" and not ,, and NaN should be written as ,, (empty value).
I have looked into csv Dialects such as doublequote, escapechar, quotechar, quoting. This is NOT what I want. These are all cases where delimiter appears within data ie "ab,abc" and as I mentioned, parsing with special characters is not an issue.
I don't want to use Pandas. The only thing I can think of is regex? But that's an overhead if I have millions of lines to process.
The behaviour I want is this:
a = "\"\"" (or it could be a="" or a="ab,abc")
if (a=="\"\""):
map[0]=0
elif(a==""):
map[0]=np.nan
else:
map[0] = a
My csv reader is as follows:
import csv
f = open(filepath, 'r')
csvreader = csv.reader(f)
for row in csvreader:
print(row)
I want above behaviour when reading csv files though. currently only two values are read: ' ' (empty string) or 'ab,abc'.
I want 3 different values to be read. ' ' empty string, '""' string with double quotes, and actual string 'ab,abc'

looking through the csv module in CPython source (search for IN_QUOTED_FIELD), it doesn't have any internal state that would let you do this. for example, parsing:
"a"b"c"d
is parsed as: 'ab"c"d', which might not be what you expect. e.g:
import csv
from io import StringIO
[row] = csv.reader(StringIO(
'"a"b"c"d'))
print(row)
specifically, quotes are only handled specially at the beginning of fields, and all characters are just added to the field as they are encountered, rather than any allowing any special behaviour to be triggered when "un-quote"ing fields

The solution I figured is this:
If I change the input file such that quoted strings have escapechar '\' ,
below is input file:
col1,col2,col3
"",a,b
\cde \,f,g
,h,i
\j,kl\,mno,p
Then double-quoted empty field and unquoted empty field are separable
csvreader = csv.reader(f, quotechar='\\')
for row in csvreader:
print(row)
That's my best solution so far...

Related

How can I validate the double quotes of values in a pipe delimited file using Python?

I have a pipe delimited file in S3, where rows look like this:
123 | "val 2" | "" | """ | | val5
I'm reading the bytestream and converting it to a dictionary using csv.DictReader:
data_iter = stream_from_s3_utf8(s3_stream)
csv_iter = csv.DictReader(data_iter)
When I use packages to convert the contents of a file to python code, these packages (sensibly) infer that double quotes are just an indicator that some value is supposed to be a string, so "val 2" (with literal double quotes in the file) goes into my dictionary as a string value without any quotes. And both an empty value (the fifth value above) and a pair of double quotes (the third value above) go into my dictionary as an empty string. But I need to validate the quoting in my file, so I need access to the literal quotes. (For example, the third value above is not valid, but the fifth is.) Is there any way in Python to read the contents of a file while preserving the quotes?
You can control how csv.reader and, by extension, the csv.DictReader handles quoting by passing the quoting parameter to its constructor. The whole range of possibilities is defined in the csv module, but here you need the csv.QUOTE_NONE
data_iter = stream_from_s3_utf8(s3_stream)
csv_iter = csv.DictReader(data_iter,quoting=csv.QUOTE_NONE)
I know you've found an answer, but here's one that does it from scratch:
f = open('filename.foo')
raw = f.read()
f.close()
data = [i.strip() for i in raw.strip('|')]

Python write to csv with comas in each field

I'm trying to export a list of strings to csv using comma as a separator. Each of the fields may contain a comma. What I obtain after writing to a csv file is that every comma is treated as a separator.
My question is: is it possible to ignore the commas (as separators) in each field?
Here is what I'm trying, as an example.
import csv
outFile = "output.csv"
f = open(outFile, "w")
row = ['hello, world' , '43333' , '44141']
with open(outFile, 'w') as writeFile:
writer = csv.writer(writeFile)
writer.writerow(row)
writeFile.close()
The output looks like this: outputObtained
What I would like is something like this: outputExpected
I think a way to solve this would be to use a different separator, as I read in some sites. But my question is if there is a way to solve this using comma (',') separators.
Thanks
You just need to tell Calc/Excel that your quoting character is a "
You can do .replace(',','') which will replace the commas in the text with a space (or whatever else you want.
I don't know a way to keep the comma in the text but not let it act as a separator, like \, maybe? Somebody else could probably shed some light on that.
I believe you can try wrapping the fields which contain a non-delimiting comma in double quotes. If that does not work, you are probably out of options. Afterall, you need to somehow convey the information about which comma is doing what to the software that is doing the CSV display for you or the user.
Perhaps this will be helpful:
https://www.rfc-editor.org/rfc/rfc4180#page-3
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx

Remove unwanted commas from CSV using Python

I need some help, I have a CSV file that contains an address field, whoever input the data into the original database used commas to separate different parts of the address - for example:
Flat 5, Park Street
When I try to use the CSV file it treats this one entry as two separate fields when in fact it is a single field. I have used Python to strip commas out where they are between inverted commas as it is easy to distinguish them from a comma that should actually be there, however this problem has me stumped.
Any help would be gratefully received.
Thanks.
You can define the separating and quoting characters with Python's CSV reader. For example:
With this CSV:
1,`Flat 5, Park Street`
And this Python:
import csv
with open('14144315.csv', 'rb') as csvfile:
rowreader = csv.reader(csvfile, delimiter=',', quotechar='`')
for row in rowreader:
print row
You will see this output:
['1', 'Flat 5, Park Street']
This would use commas to separate values but inverted commas for quoted commas
The CSV file was not generated properly. CSV files should have some form of escaping of text, usually using double-quotes:
1,John Doe,"City, State, Country",12345
Some CSV exports do this to all fields (this is an option when exporting from Excel/LibreOffice), but ambiguous fields (such as those including commas) must be escaped.
Either fix this manually or properly regenerate the CSV. Naturally, this cannot be fixed programatically.
Edit: I just noticed something about "inverted commas" being used for escaping - if that is the case see Jason Sperske's answer, which is spot on.

No Newlines in the CSV when the field of the Dict is empty [Python, CSV]

I'm a python beginner, (I actually use the 2.7 version)
My code is actually working well (it's a simple one)
but it insert new lines between the reslut in my output file. I've noticed that they mentionned it in the documentation (see further) but no the way to avoid inserting automatically a newline when a field is empty.
Anyone who could help me?
new= csv.DictReader(ifile,fieldnames=None)
output = csv.DictWriter(ofile,fieldnames=None,dialect='excel',delimiter="\t")
for item in new:
output.writer.writerow(item.values())
print item
Changed in version 2.5: The parser is now stricter with respect to multi-line quoted fields. Previously, if a line ended within a quoted field without a terminating newline character, a newline would be inserted into the returned field. This behavior caused problems when reading files which contained carriage return characters within fields. The behavior was changed to return the field without inserting newlines. As a consequence, if newlines embedded within fields are important, the input should be split into lines in a manner which preserves the newline characters.
Results:
{'Url;Site Source;Communaute': 'lafranceagricole.fr;Site Media;'} {'Url;Site Source;Communaute': 'economiesolidaire.com;Blog;'} {'Url;Site Source;Communaute': 'arpentnourricier.org;Blog;'} {'Url;Site Source;Communaute': 'arpentnourricier.org;Blog;'} {'Url;Site Source;Communaute': 'mamienne.canalblog.com;Blog;'}
So I have an error my header is becoming the key and not the simple header
as I wanted (Url; Site source; Communauté)
{lafranceagricole.fr; Site Media}
No blank line in my csv so the new line might not be produced by a blank field.
Maybe the field Community by defaut empty create the new line
If you want the ; to be the field delimiter, you have to specify that:
new = csv.reader(ifile, delimiter=';')
output = csv.writer(ofile,dialect='excel',delimiter="\t")
for item in new:
output.writerow(item)
This works just as well with a DictReader to separate the fields.
Writes the fields tab delimited; I don't see any blank rows at all.
Old answer from before the question was clarified
It's not clear to me why you're using a DictReader to read your input, if you only want the cell values, not the field names. However, if for some reason you do need to use a DictReader:
from itertools import chain
output.writer.writerow(list(chain.from_iterable(item.values() for item in new)))
Will write all of the values, tab separated, into the output file, with no newlines in between.
If you don't really need a DictReader,
new = csv.reader(ifile)
output = csv.writer(ofile,dialect='excel',delimiter="\t")
next(new) # to skip the field names
output.writerow(list(chain.from_iterable(new)))
Will again write the cells as one row, tab separated, into the output file.

How to read a CSV line with "?

A trivial CSV line could be spitted using string split function. But some lines could have ", e.g.
"good,morning", 100, 300, "1998,5,3"
thus directly using string split would not solve the problem.
My solution is to first split out the line using , and then combining the strings with " at then begin or end of the string.
What's the best practice for this problem?
I am interested if there's a Python or F# code snippet for this.
EDIT: I am more interested in the implementation detail, rather than using a library.
There's a csv module in Python, which handles this.
Edit: This task falls into "build a lexer" category. The standard way to do such tasks is to build a state machine (or use a lexer library/framework that will do it for you.)
The state machine for this task would probably only need two states:
Initial one, where it reads every character except comma and newline as part of field (exception: leading and trailing spaces) , comma as the field separator, newline as record separator. When it encounters an opening quote it goes into
read-quoted-field state, where every character (including comma & newline) excluding quote is treated as part of field, a quote not followed by a quote means end of read-quoted-field (back to initial state), a quote followed by a quote is treated as a single quote (escaped quote).
By the way, your concatenating solution will break on "Field1","Field2" or "Field1"",""Field2".
From python's CSV module:
reading a normal CSV file:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
Reading a file with an alternate format:
import csv
reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
print row
There are some nice usage examples in LinuxJournal.com.
If you're interested with the details, read "split string at commas respecting quotes when string not in csv format" showing some nice regexen related to this problem, or simply read the csv module source.
Chapter 4 of The Practice of Programming gave both C and C++ implementations of the CSV parser.
The generic implementation detail would be something like this (untested)
def csvline2fields(line):
fields = []
quote = None
while line.strip():
line = line.strip()
if line[0] in ("'", '"'):
# Find the next quote:
end = line.find(line[0])
fields.append(line[1:end])
# Find the beginning of the next field
next = line.find(SEPARATOR)
if next == -1:
break
line = line[next+1:]
continue
# find the next separator:
next = line.find(SEPARATOR)
fields.append(line[0:next])
line = line[next+1:]

Categories