Why do some rows in csv file have an invalid format? - python

I am currently fetching data from an API and I would like to store those data as csv.
However, some lines are always invalid which means I cannot split them via Excel's text-in-columns functionality.
I create the csv file as follows:
with open(directory_path + '/' + file_name + '-data.csv', 'a', newline='') as file:
# Setup a writer
csvwriter = csv.writer(file, delimiter='|')
# Write headline row
if not headline_exists:
csvwriter.writerow(['Title', 'Text', 'Tip'])
# Build the data row
record = data['title'] + '|' + data['text'] + '|' + data['tip']
csvwriter.writerow([record])
If you open the csv file in Excel you also immediately see that the row is invalid. While the valid one takes the default height and the whole width, the invalid one takes more height but less width.
Does anyone know a reason for that problem?

The rows are not invalid, but what you do is.
So first of all: You use pipes as delimeters. Its fine in some scenarios, but given the fact you want to load it into excel immediately it seems wiser to me to export the data in "excel" dialect:
csvwriter = csv.writer(file, dialect='excel')
Second, look at the following lines:
record = data['title'] + '|' + data['text'] + '|' + data['tip']
csvwriter.writerow([record])
This way you basically tell the csv writer that you want a single column, with pipes in it. If you use a csv writer you must not concatenate the delimeters on your own, it voids the point of using a writer. So this is how it should be done instead:
record = [data['title'], data['text'], data['tip']]
csvwriter.writerow(record)
Hope it helps.

I have finally found out that I had to strip the text and the tip because they sometimes contain whitespaces which would break the format.
Additionally, I also followed the recommendation to use the excel dialect since I think this will make it easier to process the data later on.

Related

Parse pipe delimited CSV Python [duplicate]

I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file.
Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR:
the Microsoft version of CSV is a textbook example of how not to design a textual file format.
The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task.
So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant.
Edit:
Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row):
a|b|c
"a|b"|c|d
foo|"bar|baz"|qux
You can do this:
import csv
csvfile = open("csvfile.csv")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.DictReader(csvfile, dialect=dialect)
for row in reader:
print row,
# => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'}
# write records using other dialect
Your strategy could be the following:
parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader
calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation.
In the following example you find the simpler statistic (total number of fields)
import csv
piperows= []
tabrows = []
#parsing | delimiter
f = open("file", "rb")
readerpipe = csv.reader(f, delimiter = "|")
for row in readerpipe:
piperows.append(row)
f.close()
#parsing TAB delimiter
f = open("file", "rb")
readertab = csv.reader(f, delimiter = "\t")
for row in readerpipe:
tabrows.append(row)
f.close()
#in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data)
#count total fields
totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows])
totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows])
if totfieldspipe > totfieldstab:
yourrows = piperows
else:
yourrows = tabrows
#the var yourrows contains the rows, now just write them in any format you like
Like this
from __future__ import with_statement
import csv
import re
with open( input, "r" ) as source:
with open( output, "wb" ) as destination:
writer= csv.writer( destination )
for line in input:
writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that.
If you actually have lots of files, then you need to try to find a way to detect which file is which.
One of the examples has this:
if "|" in line:
This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file.
Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"):
line=line.strip()
if "|" in line:
print ','.join(line.split("|"))
else:
print ','.join(line.split("\t"))

Write data with special chars/quotation into CSV, using python

My data, that I have in a Python list, can contain quotes, etc.:
the Government's
projections`
indexation" that
Now I'd like to write it into a CSV file but it seems the special chars "break" the CSV structure.
csv.register_dialect('doublequote', quotechar='"', delimiter=';', quoting=csv.QUOTE_ALL)
with open ( 'csv_data.csv', 'r+b' ) as f:
header = next (csv.reader(f))
dict_writer = csv.DictWriter(f, header, -999, dialect='doublequote')
dict_writer.writerow(csv_data_list)
It usually can write up to the first 50 lines or so.
I tried to delete the next row from the source list and it wrote to 60 lines.
Is there any "better" way of writing all sorts of data into a CSV?
I'm trying sth like this:
data['title'] = data['title'].replace("'", '`' )
but that doesn't seem to be right?

writing to csv but not getting desired formatting

I want to remove quoting all together and when I write to my csv file I'm getting an extra \ between the name and ip.
with open('test.csv', 'w') as csvfile:
info = csv.writer(csvfile, quoting=csv.QUOTE_NONE, delimiter=',', escapechar='\\')
for json_object in json_data:
if len(json_object['names']) != 0:
name = json_object['names'][0]
ip = json_object['ip_address']
combined = (name + ',' + ip)
print combined
info.writerow([combined])
this is what I'm seeing in my csv file:
ncs1.aol.net\,10.136.0.2
the formatting i'm trying to achieve is:
ncs1.aol.net,10.136.0.2
You don't need to write the row yourself, thats the point of using csv. Instead of creating combined just use:
info.writerow([name, ip])
Currently you're writing a single item to the row and that means the module is escaping , for you and you get \,.
You could also just strip your combined expression:
combine=combined.strip('\')

Two Tab-Separated Delimiter in Python Writing to TSV

I have the following code:
f = open("test.tsv", 'wt')
try:
writer = csv.writer(f, delimiter='\t')
for x in range(0, len(ordered)):
writer.writerow((ordered[x][0],"\t", ordered[x][1]))
finally:
f.close()
I need the TSV file to have the ordered[x][0] separated by two tabs with ordered[x][1]
the "\t" adds space, but its not a tab and the parentheses are shown on the output.
Thank You!
You could replace the "\t" by "" to obtain what you want:
writer.writerow((ordered[x][0],"", ordered[x][1]))
Indeed, the empty string in the middle will then be surrounded by a tab on both sides, effectively putting two tabs between ordered[x][0] and ordered[x][1].
However, a more natural code doing exactly the same thing would be:
with open("test.tsv", "w") as fh:
for e in ordered:
fh.write("\t\t".join(map(str, e[:2])) + "\n")
where I:
used a with statement (explained here) instead of the try ... finally construct
removed the t mode in the open function (t is the default behavior)
iterated over the elements in ordered using for ... in instead of using an index
used join instead of a csv writer: those are suited in cases where the delimiter is a single character

Remove double quotes from iterator when using csv writer

I want to create a csv from an existing csv, by splitting its rows.
Input csv:
A,R,T,11,12,13,14,15,21,22,23,24,25
Output csv:
A,R,T,11,12,13,14,15
A,R,T,21,22,23,24,25
So far my code looks like:
def update_csv(name):
#load csv file
file_ = open(name, 'rb')
#init first values
current_a = ""
current_r = ""
current_first_time = ""
file_content = csv.reader(file_)
#LOOP
for row in file_content:
current_a = row[0]
current_r = row[1]
current_first_time = row[2]
i = 2
#Write row to new csv
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
writer.writerow((current_a,
current_r,
current_first_time,
",".join((row[x] for x in range(i+1,i+5)))
))
#do only one row, for debug purposes
return
But the row contains double quotes that I can't get rid of:
A002,R051,02-00-00,"05-21-11,00:00:00,REGULAR,003169391"
I've tried to use writer = csv.writer(f,quoting=csv.QUOTE_NONE) and got a _csv.Error: need to escape, but no escapechar set.
What is the correct approach to delete those quotes?
I think you could simplify the logic to split each row into two using something along these lines:
def update_csv(name):
with open(name, 'rb') as file_:
with open("updated_"+name, 'wb') as f:
writer = csv.writer(f)
# read one row from input csv
for row in csv.reader(file_):
# write 2 rows to new csv
writer.writerow(row[:8])
writer.writerow(row[:3] + row[8:])
writer.writerow is expecting an iterable such that it can write each item within the iterable as one item, separate by the appropriate delimiter, into the file. So:
writer.writerow([1, 2, 3])
would write "1,2,3\n" to the file.
Your call provides it with an iterable, one of whose items is a string that already contains the delimiter. It therefore needs some way to either escape the delimiter or a way to quote out that item. For example,
write.writerow([1, '2,3'])
Doesn't just give "1,2,3\n", but e.g. '1,"2,3"\n' - the string counts as one item in the output.
Therefore if you want to not have quotes in the output, you need to provide an escape character (e.g. '/') to mark the delimiters that shouldn't be counted as such (giving something like "1,2/,3\n").
However, I think what you actually want to do is include all of those elements as separate items. Don't ",".join(...) them yourself, try:
writer.writerow((current_a, current_r,
current_first_time, *row[i+2:i+5]))
to provide the relevant items from row as separate items in the tuple.

Categories