decoding breaks lines into characters in python 3

decoding breaks lines into characters in python 3 - python

I am reading a CSV file through samba share. My CSV file format
hello;world
1;2;
Python code
import urllib
from smb.SMBHandler import SMBHandler
PATH = 'smb://myusername:mypassword#192.168.1.200/myDir/'
opener = urllib.request.build_opener(SMBHandler)
fh = opener.open(PATH + 'myFileName')
data = fh.read().decode('utf-8')
print(data) // This prints the data right
csvfile = csv.reader(data, delimiter=';')
for myrow in csvfile:
print(myrow) // This just prints ['h']. however it should print(hello;world)
break
fh.close()
The problem is that after decoding to utf-8, the rows are not the actual lines in the file
Desired output of a row after reading the file: hello;world
Current output of a row after reading the file: h
Any help is appreciated.

csv.reader takes an iterable that returns lines. Strings, when iterated, yield characters. The fix is simple:
csvfile = csv.reader(data.splitlines(), delimiter=';')

Related

Converting a Text .txt document to CSV .csv Using a Delimiter

I'd like to create a CSV from a TXT file. I have a text file with lines (300 lines+) separated by backslashes. I'd like each line to be a separate row, and each backslash to be a separate new column.
The text file looks like:
example 1\example 2\example 3\example 4
test 1\test 2\test 3\test 4
I'd like the CSV to look like:
Example 1
Example 2
Example 3
Example 4
Test 1
Test 2
Test 3
Test 4
So far I have:
import csv
with open('Report.txt') as report:
report_txt = report.read()
with open('Report.csv','w',newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(report_txt)
I know I need to use \ as a delimiter, but I'm not sure how. Thanks for any help!

Define your delimiter like this (escape the \):
reader = csv.reader(open("Report.csv"), delimiter="\\")
Code:
import csv
with open('Report.txt') as report:
reader = csv.reader(report, delimiter="\\")
with open('Report_output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
for line in reader:
writer.writerow(line)

First you got to split the string based on the delimeter. You can achieve this by using the split operator or regex.
import csv
with open('file.txt', 'r') as in_file:
stripped = (line.strip() for line in in_file)
lines = (line.split("\\") for line in stripped if line)
Then pretty much write it to the csv.
with open('report.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerows(lines)
Tweak your code accordingly. The concept is pretty much the same. Note the double backslash is to account for the escape character.

If you are just trying to convert that text into CSV, you can just replace every "\" character with ";" and you'll have a valid CSV file.
Else, if you want to do something with the parsed data before reexporting to CSV, you can read the file line by line and use the split() Method with "\", then rejoin and write line by line, like here:
with open('in.txt') as input_file:
with open('out.csv','a') as output_file:
txt_line = input_file.readline()
while txt_line:
cells = txt_line.split("\\")
# Do something with each cell...
csv_line = ";".join(cells)
output_file.write(csv_line)
txt_line = input_file.readline()

Errors when reading column name from csv files and saving as list

I have a folder that has over 15,000 csv files. They all have different number of column names.
Most files have its first row as a column name (attribute of data) like this :
Name Date Contact Email
a b c d
a2 b2 c2 d2
What I want to do is read first row of all files, store them as a list, and write that list as new csv file.
Here is what I have done so far :
import csv
import glob
list=[]
files=glob.glob('C:/example/*.csv')
for file in files :
f = open(file)
a=[file,f.readline()]
list.append(a)
with open('test.csv', 'w') as testfile:
csv_writer = csv.writer(testfile)
for i in list:
csv_writer.writerow(i)
When I try this code, result comes out like this :
[('C:/example\\example.csv', 'Name,Date,Contact,Email\n'), ('C:/example\\example2.csv', 'Address,Date,Name\n')]
Therefore in a made csv, all attributes of each file go into second column making it look like this (for some reason, there's a empty row between) :
New CSV file made
Moreover when going through files, I have encoutered another error :
UnicodeDecodeError: 'cp949' codec can't decode byte 0xed in position 6: illegal multibyte sequence
So I included this code in first line but it didn't work saying files are invalid.
import codecs
files=glob.glob('C:/example/*.csv')
fileObj = codecs.open( files, "r", "utf-8" )
I read answers on stackflow but I couldn't find one related to my problem. I appreciate your answers.

Ok, so
import csv
import glob
list=[]
files=glob.glob('C:/example/*.csv')
for file in files :
f = open(file)
a=[file,f.readline()]
list.append(a)
here you're opening the file and then creating a list with the column headers as a string(note that means they'll look like "Column1,Column2") and the file name. So [("Filename", "Column1, Column2")]
so you're going to need to split that on the ',' like:
for file in files :
f = open(file)
a=[file] + f.readline().split(',')
Now we have:
["filename", ("Column1", "Column2")]
So it's still going to print to the file wrong. We need to concatenate the lists.
a=[file] + f.readline().split(',')
So we get:
["filename", "Column1", "Column2"]
And you should be closing each file after you open it with f.close() or use a context manager inside your loop like:
for file in files :
with open(file) as f:
a=[file] + f.readline()
list.append(a)
Better solution and how I would write it:
import csv
import glob
files = glob.glob('mydir/*.csv')
lst = list()
for file in files:
with open(file) as f:
reader = csv.reader(f)
lst.append(next(reader))
try:
with open(files,'r'.encoding='utf8') as f:
# do things
except UnicodeError:
with open(files,'r'.encoding='utf8') as f:
# do things

a little bit of tidying, proper context managing, and using csv.reader:
import csv
import glob
list=[]
files=glob.glob('C:/example/*.csv')
with open('test.csv', 'w') as testfile:
csv_writer = csv.writer(testfile)
for file in files:
with open(file, 'r') as infile:
reader = csv.reader(infile)
headers = next(reader)
lst = [file] + headers
writer.writerow(lst)
this will write a new csv with one row per infile, each row being filename, column1, column2, ...

Number formatting a CSV

I have developed a script that produces a CSV file. On inspection of the file, some cell's are being interpreted not the way I want..
E.g In my list in python, values that are '02e4' are being automatically formatted to be 2.00E+04.
table = [['aa02', 'fb4a82', '0a0009'], ['02e4, '452ca2', '0b0004']]
ofile = open('test.csv', 'wb')
for i in range(0, len(table)):
for j in range(0, len(table[i]):
ofile.write(table[i][j] + ",")
ofile.write("\n")
This gives me:
aa02 fb4a82 0a0009
2.00E+04 452ca2 0b0004
I've tried using the csv.writer instead where writer = csv.writer(ofile, ...)
and giving attributes from the lib (e.g csv.QUOTE_ALL)... but its the same output as before..
Is there a way using the CSV lib to automatically format all my values as strings before it's written?
Or is this not possible?
Thanks

Try setting the quoting parameter in your csv writer to csv.QUOTE_ALL.
See the doc for more info:
import csv
with open('myfile.csv', 'wb') as csvfile:
wtr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
wtr.writerow(...)
Although it sounds like the problem might lie with your csv viewer. Excel has a rather annoying habit of auto-formatting data like you describe.

If you want the '02e4' to show up in excel as "02e4" then annoyingly you have to write a csv with triple-double quotes: """02e4""". I don't know of a way to do this with the csv writer because it limits your quote character to a character. However, you can do something similar to your original attempt:
table = [['aa02', 'fb4a82', '0a0009'], ['02e4', '452ca2', '0b0004']]
ofile = open('test.csv', 'wb')
for i in range(0, len(table)):
for j in range(len(table[i])):
ofile.write('"""%s""",'%table[i][j])
ofile.write("\n")
If opened in a text editor your csv file will read:
"""aa02""","""fb4a82""","""0a0009""",
"""02e4""","""452ca2""","""0b0004""",
This produces the following result in Excel:
If you wanted to use any single character quotation you could use the csv module like so:
import csv
table = [['aa02', 'fb4a82', '0a0009'], ['02e4', '452ca2', '0b0004']]
ofile = open('test.csv', 'wb')
writer = csv.writer(ofile, delimiter=',', quotechar='|',quoting=csv.QUOTE_ALL)
for i in range(len(table)):
writer.writerow(table[i])
The output in the text editor will be:
|aa02|,|fb4a82|,|0a0009|
|02e4|,|452ca2|,|0b0004|
and Excel will show:

Python csv reader // how to ignore enclosing char (because sometimes it's missing)

I am trying to import csv data from files where sometimes the enclosing char " is missing.
So I have rows like this:
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
# In the second row the closing " after 2200.00 is missing
# also the closing " before EUR" is missing
Now I am reading the csv data with this:
csv.reader(
codecs.open(filename, 'r', encoding='latin-1'),
delimiter=";",
dialect=csv.excel_tab)
And the data I get for the second row is this:
["MacBookPro", "2200.00;EUR"]
Aside from pre-processing my csv files with a unix command like sed and removing all closing chars " and relying on the semicolon to seperate the columns, what else can I do?

This might work:
import csv
import io
file = io.StringIO(u'''
"ThinkPad";"2000.00";"EUR"
"MacBookPro";"2200.00;EUR"
'''.strip())
reader = csv.reader((line.replace('"', '') for line in file), delimiter=';', quotechar='"')
for row in reader:
print(row)
The problem is that if there are any legitimate quoted line, e.g.
"MacBookPro;Awesome Edition";"2200.00";"EUR"
Or, worse:
"MacBookPro:
Description: Awesome Edition";"2200.00";"EUR"
Your output is going to produce too few/many columns. But if you know that's not a problem then it will work fine. You could pre-screen the file by adding this before the read part, which would give you the malformed line:
for line in file:
if line.count(';') != 2:
raise ValueError('No! This file has broken data on line {!r}'.format(line))
file.seek(0)
Or alternatively you could screen as you're reading:
for row in reader:
if any(';' in _ for _ in row):
print('Error:')
print(row)
Ultimately your best option is to fix whatever is producing your garbage csv file.

If you're looping through all the lines/rows of the file, you can use string's .replace() function to get rid off the quotes (if you don't need them later-on for other purposes.).
>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
... my_file = csv.reader(codecs.open(filename, 'r', encoding='latin-1')
... delimiter=";",
... dialect=csv.excel_tab)
... )
... for row in my_file:
... (model,price,currency) = row
... model.replace('"','')
... price.replace('"','')
... currency.replace('"','')v
... print 'Model is: %s (costs %s%s).' % (model,price,currency)
>>>
Model is: MacBookPro (costs 2200.00EUR).

How to parse a single line csv string without the csv.reader iterator in python?

I have a CSV file that i need to rearrange and renecode. I'd like to run
line = line.decode('windows-1250').encode('utf-8')
on each line before it's parsed and split by the CSV reader. Or I'd like iterate over lines myself run the re-encoding and use just single line parsing form CSV library but with the same reader instance.
Is there a way to do that nicely?

Loop over lines on file can be done this way:
with open('path/to/my/file.csv', 'r') as f:
for line in f:
puts line # here You can convert encoding and save lines
But if You want to convert encoding of a whole file You can also call:
$ iconv -f Windows-1250 -t UTF8 < file.csv > file.csv
Edit: So where the problem is?
with open('path/to/my/file.csv', 'r') as f:
for line in f:
line = line.decode('windows-1250').encode('utf-8')
elements = line.split(",")

Thx, for the answers. The wrapping one gave me an idea:
def reencode(file):
for line in file:
yield line.decode('windows-1250').encode('utf-8')
csv_writer = csv.writer(open(outfilepath,'w'), delimiter=',',quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_reader = csv.reader(reencode(open(filepath)), delimiter=";",quotechar='"')
for c in csv_reader:
l = # rearange columns here
csv_writer.writerow(l)
That's exactly what i was going for re-encoding a line just before it's get parsed by the csv_reader.

At the very bottom of the csv documentation is a set of classes (UnicodeReader and UnicodeWriter) that implements Unicode support for csv:
rfile = open('input.csv')
wfile = open('output.csv','w')
csv_reader = UnicodeReader(rfile,encoding='windows-1250')
csv_writer = UnicodeWriter(wfile,encoding='utf-8')
for c in csv_reader:
# process Unicode lines
csv_writer.writerow(c)
rfile.close()
wfile.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

decoding breaks lines into characters in python 3 - python

csv.reader takes an iterable that returns lines. Strings, when iterated, yield characters. The fix is simple: csvfile = csv.reader(data.splitlines(), delimiter=';')

Related

Converting a Text .txt document to CSV .csv Using a Delimiter

Errors when reading column name from csv files and saving as list

Number formatting a CSV

Python csv reader // how to ignore enclosing char (because sometimes it's missing)

How to parse a single line csv string without the csv.reader iterator in python?

Categories

Resources