trouble with unicodecsv reader in python

trouble with unicodecsv reader in python - python

I'm having trouble using the unicodecsv reader. I keep looking for different examples of how to use the module, but everyone keeps referencing the exact sample from the unicodecsv website (or some similar variation).
import unicodecsv as csv
from io import BytesIO
f = BytesIO()
w = csv.writer(f, encoding='utf-8')
_ = w.writerow((u'é', u'ñ'))
_ = f.seek(0)
r = csv.reader(f, encoding='utf-8')
next(r) == [u'é', u'ñ']
>>> True
For me this example makes too many assumptions about our understanding. It doesn't look like a csv file is being passed. I've completely missed the plot.
What I want to do is:
Read the first line of the csv file which are headers
Read the remaining lines and put them in a dictionary
My broken code:
import unicodecsv
#
i = 0
myCSV = "$_input.csv"
dic = {}
#
f = open(myCSV, "rb")
reader = unicodecsv.reader(f, delimiter=',')
strHeader = reader.next()
#
# read the first line of csv
# use custom function to parse the header
myHeader = FNC.PARSE_HEADER(strHeader)
#
# read the remaining lines
# put data into dictionary of class objects
for row in reader:
i += 1
dic[i] = cDATA(myHeader, row)
And, as expected, I get the 'UnicodeDecodeError'. Maybe the example above has the answers, but they are just completely going over my head.
Can someone please fix my code? I'm running out of hair to pull out.
I switched the reader line to:
reader = unicodecsv.reader(f, encoding='utf-8')
Traceback:
for row in reader:
File "C:\Python27\unicodecsv\py2.py", line 128 in next
for value in row]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 48: invalide start byte
When I strictly print the data using:
f = open(myCSV, "rb")
reader = csv.reader(f, delimiter=',')
for row in reader:
print(str[row[9]] + '\n')
print(repr(row[9] + '\n')
>>> UTAS ? Offline
>>> 'UTAS ? Offline'

You need to declare the encoding of the input file when creating the reader, just like you did when creating the writer:
>>> import unicodecsv as csv
>>> with open('example.csv', 'wb') as f:
... writer = csv.writer(f, encoding='utf-8')
... writer.writerow(('heading0', 'heading1'))
... writer.writerow((u'é', u'ñ'))
... writer.writerow((u'ŋ', u'ŧ'))
...
>>> with open('example.csv', 'rb') as f:
... reader = csv.reader(f, encoding='utf-8')
... headers = next(reader)
... print headers
... data = {i: v for (i, v) in enumerate(reader)}
... print data
...
[u'heading0', u'heading1']
{0: [u'\xe9', u'\xf1'], 1: [u'\u014b', u'\u0167']}
Printing the dictionary shows the escaped representation of the data, but you can see the characters by printing them individually:
>>> for v in data.values():
... for s in v:
... print s
...
é
ñ
ŋ
ŧ
EDIT:
If the encoding of the file is unknown, then it's best to use some like chardet to determine the encoding before processing.

If your final goal is read csv file and convert data into dicts then I would recommend using csv.DictReader. DictRead will take care of reading header and converting rest of the rows into Dict (rowdicts). This uses CSV moduels, which contains lots of documentation/example available.
>>> import csv
>>> with open('names.csv') as csvfile:
... reader = csv.DictReader(csvfile)
... for row in reader:
... print(row['first_name'], row['last_name'])
To get more clarity you check examples here https://docs.python.org/2/library/csv.html#csv.DictReader

Related

decoding breaks lines into characters in python 3

I am reading a CSV file through samba share. My CSV file format
hello;world
1;2;
Python code
import urllib
from smb.SMBHandler import SMBHandler
PATH = 'smb://myusername:mypassword#192.168.1.200/myDir/'
opener = urllib.request.build_opener(SMBHandler)
fh = opener.open(PATH + 'myFileName')
data = fh.read().decode('utf-8')
print(data) // This prints the data right
csvfile = csv.reader(data, delimiter=';')
for myrow in csvfile:
print(myrow) // This just prints ['h']. however it should print(hello;world)
break
fh.close()
The problem is that after decoding to utf-8, the rows are not the actual lines in the file
Desired output of a row after reading the file: hello;world
Current output of a row after reading the file: h
Any help is appreciated.

csv.reader takes an iterable that returns lines. Strings, when iterated, yield characters. The fix is simple:
csvfile = csv.reader(data.splitlines(), delimiter=';')

Issue in writing result into csv file using python 2.7

I am using python 2.7 in my Windows 10(64-bit) system. I have a string str, when executed it shows result as :-
'abcd'
'wxyz'
Now, I want to write this result into result.csv file. So I wrote this following python scrip:-
import csv
with open('result.csv', 'w') as csv_file:
csv_write = csv.writer(csv_file)
csv_write.writerow([str])
But whenever I execute this script, I am finding only wxyz in result.csv file.
Help me with this issue.
Thanks in advance.

Python 2.7 csv likes the 'b' mode for writing (in Python 3 just 'w').
Example: Pre-built list of string to file
import csv
strings = []
s1 = 'abcd'
s2 = 'wxyz'
strings.append(s1)
strings.append(s2)
csvf = r"C:\path\to\my\file.csv"
with open(csvf, 'wb') as f:
w = csv.writer(f, delimiter=',')
for s in strings:
w.writerow(s)
Example: Use of reader() to build list of rows to supply writer()
import csv
# read current rows in csv and return reader object
def read(_f):
with open(_f, 'rb') as f:
reader = csv.reader(f, delimiter=',')
return reader
# writes the reader object content
# then writes the new content to end
def write(_f, _reader, _adding):
with open(_f, 'wb') as f:
writer = csv.writer(f, delimiter=',')
for row in _reader:
writer.writerow(row)
for row in _adding:
writer.writerow(row)
strings = []
s1 = 'abcd'
s2 = 'wxyz'
strings.append(s1)
strings.append(s2)
csvf = r"C:\path\to\my\file.csv"
content = read(csvf)
write(csvf, content, strings)
Example: Quick append
import csv
strings = []
s1 = 'abcd'
s2 = 'wxyz'
strings.append(s1)
strings.append(s2)
csvf = r"C:\path\to\my\file.csv"
with open(csvf, 'ab') as f:
writer = csv.writer(f, delimiter=',')
for s in strings:
writer.writerow(s)
References:
In Python 2.x, the reader() and writer() objects required a 'b' flag upon opening. This was a result of how the module handle line termination.
In Python 3.x this was changed so that reader() and writer() objects should be opened with newline=''; line termination is still handled however.
There is also this post and that post covering some of this.

Python csv reader // how to ignore enclosing char (because sometimes it's missing)

I am trying to import csv data from files where sometimes the enclosing char " is missing.
So I have rows like this:
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
# In the second row the closing " after 2200.00 is missing
# also the closing " before EUR" is missing
Now I am reading the csv data with this:
csv.reader(
codecs.open(filename, 'r', encoding='latin-1'),
delimiter=";",
dialect=csv.excel_tab)
And the data I get for the second row is this:
["MacBookPro", "2200.00;EUR"]
Aside from pre-processing my csv files with a unix command like sed and removing all closing chars " and relying on the semicolon to seperate the columns, what else can I do?

This might work:
import csv
import io
file = io.StringIO(u'''
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
'''.strip())
reader = csv.reader((line.replace('"', '') for line in file), delimiter=';', quotechar='"')
for row in reader:
print(row)
The problem is that if there are any legitimate quoted line, e.g.
"MacBookPro;Awesome Edition";"2200.00";"EUR"
Or, worse:
"MacBookPro:
Description: Awesome Edition";"2200.00";"EUR"
Your output is going to produce too few/many columns. But if you know that's not a problem then it will work fine. You could pre-screen the file by adding this before the read part, which would give you the malformed line:
for line in file:
if line.count(';') != 2:
raise ValueError('No! This file has broken data on line {!r}'.format(line))
file.seek(0)
Or alternatively you could screen as you're reading:
for row in reader:
if any(';' in _ for _ in row):
print('Error:')
print(row)
Ultimately your best option is to fix whatever is producing your garbage csv file.

If you're looping through all the lines/rows of the file, you can use string's .replace() function to get rid off the quotes (if you don't need them later-on for other purposes.).
>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
... my_file = csv.reader(codecs.open(filename, 'r', encoding='latin-1')
... delimiter=";",
... dialect=csv.excel_tab)
... )
... for row in my_file:
... (model,price,currency) = row
... model.replace('"','')
... price.replace('"','')
... currency.replace('"','')v
... print 'Model is: %s (costs %s%s).' % (model,price,currency)
>>>
Model is: MacBookPro (costs 2200.00EUR).

python 3 csv reader + Ignore empty records [duplicate]

This is my code i am able to print each line but when blank line appears it prints ; because of CSV file format, so i want to skip when blank line appears
import csv
import time
ifile = open ("C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv", "rb")
for line in csv.reader(ifile):
if not line:
empty_lines += 1
continue
print line

If you want to skip all whitespace lines, you should use this test: ' '.isspace().
Since you may want to do something more complicated than just printing the non-blank lines to the console(no need to use CSV module for that), here is an example that involves a DictReader:
#!/usr/bin/env python
# Tested with Python 2.7
# I prefer this style of importing - hides the csv module
# in case you do from this_file.py import * inside of __init__.py
import csv as _csv
# Real comments are more complicated ...
def is_comment(line):
return line.startswith('#')
# Kind of sily wrapper
def is_whitespace(line):
return line.isspace()
def iter_filtered(in_file, *filters):
for line in in_file:
if not any(fltr(line) for fltr in filters):
yield line
# A dis-advantage of this approach is that it requires storing rows in RAM
# However, the largest CSV files I worked with were all under 100 Mb
def read_and_filter_csv(csv_path, *filters):
with open(csv_path, 'rb') as fin:
iter_clean_lines = iter_filtered(fin, *filters)
reader = _csv.DictReader(iter_clean_lines, delimiter=';')
return [row for row in reader]
# Stores all processed lines in RAM
def main_v1(csv_path):
for row in read_and_filter_csv(csv_path, is_comment, is_whitespace):
print(row) # Or do something else with it
# Simpler, less refactored version, does not use with
def main_v2(csv_path):
try:
fin = open(csv_path, 'rb')
reader = _csv.DictReader((line for line in fin if not
line.startswith('#') and not line.isspace()),
delimiter=';')
for row in reader:
print(row) # Or do something else with it
finally:
fin.close()
if __name__ == '__main__':
csv_path = "C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv"
main_v1(csv_path)
print('\n'*3)
main_v2(csv_path)

Instead of
if not line:
This should work:
if not ''.join(line).strip():

my suggestion would be to just use the csv reader who can delimite the file into rows. Like this you can just check whether the row is empty and if so just continue.
import csv
with open('some.csv', 'r') as csvfile:
# the delimiter depends on how your CSV seperates values
csvReader = csv.reader(csvfile, delimiter = '\t')
for row in csvReader:
# check if row is empty
if not (row):
continue

You can always check for the number of comma separated values. It seems to be much more productive and efficient.
When reading the lines iteratively, as these are a list of comma separated values you would be getting a list object. So if there is no element (blank link), then we can make it skip.
with open(filename) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
for row in csv_reader:
if len(row) == 0:
continue

You can strip leading and trailing whitespace, and if the length is zero after that the line is empty.

import csv
with open('userlist.csv') as f:
reader = csv.reader(f)
user_header = next(reader) # Add this line if there the header is
user_list = [] # Create a new user list for input
for row in reader:
if any(row): # Pick up the non-blank row of list
print (row) # Just for verification
user_list.append(row) # Compose all the rest data into the list

This example just prints the data in array form while skipping the empty lines:
import csv
file = open("data.csv", "r")
data = csv.reader(file)
for line in data:
if line: print line
file.close()
I find it much clearer than the other provided examples.

import csv
ifile=csv.reader(open('C:\Users\BKA4ABT\Desktop\Test_Specification\RDBI.csv', 'rb'),delimiter=';')
for line in ifile:
if set(line).pop()=='':
pass
else:
for cell_value in line:
print cell_value

Reading russian language data from csv

I have some data in CSV file that are in Russian:
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы
2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы
2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is ; symbol.
I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename):
lines = csv.reader(open(filename, "rb"),delimiter=";" )
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [str(x) for x in dataset[i]]
return dataset
Then I read and print result:
mydata = loadCsv('krish(csv3).csv')
print mydata
Output:
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
I found that in this case codecs are required and tried to do the same with this code:
import codecs
with codecs.open('krish(csv3).csv','r',encoding='utf8') as f:
text = f.read()
print text
I got this error:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data?
I just want to read data from file and put it in 2-dimensional array.

\ea is the windows-1251 / cp5347 encoding for к. Therefore, you need to use windows-1251 decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly - See "Unicode" in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
This would allow you to do:
def loadCsv(filename):
lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
# if you really need lists then uncomment the next line
# this will let you do call exact lines by doing `line_12 = lines[12]`
# return list(lines)
# this will return an "iterator", so that the file is read on each call
# use this if you'll do a `for x in x`
return lines
If you try to print dataset, then you'll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u. To print the values, do:
for csv_line in loadCsv("myfile.csv"):
print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
for csv_line in loadCsv("myfile.csv"):
my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.

You cant try:
import pandas as pd
pd.read_csv(path_file , "cp1251")
or
import csv
with open(path_file, encoding="cp1251", errors='ignore') as source_file:
reader = csv.reader(source_file, delimiter=",")

Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.

In py3:
import csv
path = 'C:/Users/me/Downloads/sv.csv'
with open(path, encoding="UTF8") as f:
reader = csv.reader(f)
for row in reader:
print(row)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

trouble with unicodecsv reader in python - python

Related

decoding breaks lines into characters in python 3

Issue in writing result into csv file using python 2.7

Python csv reader // how to ignore enclosing char (because sometimes it's missing)

python 3 csv reader + Ignore empty records [duplicate]

Reading russian language data from csv

Categories

Resources