DBF - encoding cp1250 - python

I have dbf database encoded in cp1250 and I am reading this database using folowing code:
import csv
from dbfpy import dbf
import os
import sys
filename = sys.argv[1]
if filename.endswith('.dbf'):
print "Converting %s to csv" % filename
csv_fn = filename[:-4]+ ".csv"
with open(csv_fn,'wb') as csvfile:
in_db = dbf.Dbf(filename)
out_csv = csv.writer(csvfile)
names = []
for field in in_db.header.fields:
names.append(field.name)
#out_csv.writerow(names)
for rec in in_db:
out_csv.writerow(rec.fieldData)
in_db.close()
print "Done..."
else:
print "Filename does not end with .dbf"
Problem is, that final csv file is wrong. Encoding of the file is ANSI and some characters are corrupted. I would like to ask you, if you can help me how to read dbf file correctly.
EDIT 1
I tried different code from https://pypi.python.org/pypi/simpledbf/0.2.4, there is some error.
Source 2:
from simpledbf import Dbf5
import os
import sys
dbf = Dbf5('test.dbf', codec='cp1250');
dbf.to_csv('junk.csv');
Output:
python program2.py
Traceback (most recent call last):
File "program2.py", line 5, in <module>
dbf = Dbf5('test.dbf', codec='cp1250');
File "D:\ProgramFiles\Anaconda\lib\site-packages\simpledbf\simpledbf.py", line 557, in __init__
assert terminator == b'\r'
AssertionError
I really don't know how to solve this problem.

Try using my dbf library:
import dbf
with dbf.Table('test.dbf') as table:
dbf.export(table, 'junk.csv')

I wrote simpledbf. The line that is causing you problems was from some testing I was doing when developing the module. First of all, you might want to update your installation, as 0.2.6 is the most recent. Then you can try removing that particular line (#557) from the file "D:\ProgramFiles\Anaconda\lib\site-packages\simpledbf\simpledbf.py". If that doesn't work, you can ping me at the GitHub repo for simpledbf, or you could try Ethan's suggestion for the dbf module.

You can decode and encode as necessary. dbfpy assumes strings are utf8 encoded, so you can decode as it isn't that encoding and then encode again with the right encoding.
import csv
from dbfpy import dbf
import os
import sys
filename = sys.argv[1]
if filename.endswith('.dbf'):
print "Converting %s to csv" % filename
csv_fn = filename[:-4]+ ".csv"
with open(csv_fn,'wb') as csvfile:
in_db = dbf.Dbf(filename)
out_csv = csv.writer(csvfile)
names = []
for field in in_db.header.fields:
names.append(field.name)
#out_csv.writerow(names)
for rec in in_db:
row = [i.decode('utf8').encode('cp1250') if isinstance(i, str) else i for i in rec.fieldData]
out_csv.writerow(rec.fieldData)
in_db.close()
print "Done..."
else:
print "Filename does not end with .dbf"

Related

Hashing Issue, Non-Text Files

My code works ok except for hashing. It works fine on hashing text files but as soon as it encounters a jpg or other file type, it crashes. I know it's some type of encoding error, but I'm stumped on how to encode it properly for non-text files.
#import libraries
import os
import time
from datetime import datetime
import logging
import hashlib
from prettytable import PrettyTable
from pathlib import Path
import glob
#user input
path = input ("Please enter directory: ")
print ("===============================================")
#processing input
if os.path.exists(path):
print("Processing directory: ", (path))
else:
print("Invalid directory.")
logging.basicConfig(filename="error.log", level=logging.ERROR)
logging.error(' The directory is not valid, please run the script again with the correct directory.')
print ("===============================================")
#process directory
directory = Path(path)
paths = []
filename = []
size = []
hashes = []
modified = []
files = list(directory.glob('**/*.*'))
for file in files:
paths.append(file.parents[0])
filename.append(file.parts[-1])
size.append(file.stat().st_size)
modified.append(datetime.fromtimestamp(file.stat().st_mtime))
with open(file) as f:
hashes.append(hashlib.md5(f.read().encode()).hexdigest())
#output in to tablecx
report = PrettyTable()
column_names = ['Path', 'File Name', 'File Size', 'Last Modified Time', 'MD5 Hash']
report.add_column(column_names[0], paths)
report.add_column(column_names[1], filename)
report.add_column(column_names[2], size)
report.add_column(column_names[3], modified)
report.add_column(column_names[4], hashes)
report.sortby = 'File Size'
print(report)
change following lines
with open(file) as f:
hashes.append(hashlib.md5(f.read().encode()).hexdigest())
to
with open(file, "rb") as f:
hashes.append(hashlib.md5(f.read()).hexdigest())
Doing this you will read the contents directly as bytes and you calculate the hash.
Your version tried to read the file as text and re-encoded it to bytes.
Reading a file as text means, the code tries to decode it with the system's encoding. For some byte combinations this will fail, as they are no valid code points for the given encoding.
So just read everything directly as bytes.

Open file from zip without extracting it in Python?

I am working on a script that fetches a zip file from a URL using tje request library. That zip file contains a csv file. I'm trying to read that csv file without saving it. But while parsing it's giving me this error: _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
import csv
import requests
from io import BytesIO, StringIO
from zipfile import ZipFile
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
files = zip_file.namelist()
with zip_file.open(files[0]) as csvfile:
csvreader = csv.reader(csvfile)
# _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
for row in csvreader:
print(row)
Try this:
import pandas as pd
import requests
from io import BytesIO, StringIO
from zipfile import ZipFile
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
files = zip_file.namelist()
with zip_file.open(files[0]) as csvfile:
print(pd.read_csv(csvfile, encoding='utf8', sep=","))
As #Aran-Fey alluded to:
import zipfile
import csv
import io
with open('/path/to/archive.zip', 'r') as f:
with zipfile.ZipFile(f) as zf:
csv_filename = zf.namelist()[0] # see namelist() for the list of files in the archive
with zf.open(csv_filename) as csv_f:
csv_f_as_text = io.TextIOWrapper(csv_f)
reader = csv.reader(csv_f_as_text)
csv.reader (and csv.DictReader) require a file-like object opened in text mode. Normally this is not a problem when simply open(...)ing file in 'r' mode, as the Python 3 docs say, text mode is the default: "The default mode is 'r' (open for reading text, synonym of 'rt')". But if you try rt with open on a ZipFile, you'll see an error that: ZipFile.open() requires mode "r" or "w":
with zf.open(csv_filename, 'rt') as csv_f:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: open() requires mode "r" or "w"
That's what io.TextIOWrapper is for -- for wrapping byte streams to be readable as text, decoding them on the fly.

How to read Arabic text from PDF using Python script

I have a code written in Python that reads from PDF files and convert it to text file.
The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.
The system converts Arabic PDF files but the text file is empty.
and display this error:
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in
f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)
Code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
print (m)
i=0
while i<=m-1:
path=list[i]
print(path)
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
content.encode('utf-8')
f.write(content)
f.close
i=i+1
You have a couple of problems:
content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.
Example (works for both Python 2 and 3):
import io
f = io.open(name,'w',encoding='utf8')
f.write(content)
If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.
Example:
import io
with io.open(name,'w',encoding='utf8') as f:
f.write(content)
In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.
you can use anthor library called pdfplumber instead of using pypdf or PyPDF2
import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
my_page = pdf.pages[10]
thepages=my_page.extract_text()
reshaped_text = arabic_reshaper.reshape(thepages)
bidi_text = get_display(reshaped_text)
print(bidi_text)

Python _csv Error: line contains NULL byte

This is my code:
filepath = sys.argv[1]
csvdata = list(csv.reader(open(filepath)))
How can I fix it?
I saved my excel file as a csv and receieved this error:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
An Excel file is not a csv file. First export / save the file as csv.
There are differences between python versions about whether to open the file as binary or text. This has relevance to how newlines are handled.
In Python 2.x, open as binary: open(filepath, 'rb')
In Python 3.x, don't : open('file.csv', 'r')
The second part I learned from this link about reading in csv files
For some operating systems (Mac OS for sure) you need to open with the mode 'rU' See: this link with same problem specifically on Mac OS
try this (put actual location of csv file)...
with open('c:\pytest.csv', 'rb') as csvfile:
data = csv.reader(csvfile)
mylist = list (data)
print mylist
from tkFileDialog import askopenfilename
import csv
filename = askopenfilename()
with open(filename, 'rb') as csvfile:
data = csv.reader(csvfile)
mylist = list (data)
print mylist

Python Special Characters Encoding

I have a python script that reads a CSV file and writes in a XML file. I have been hitting a wall trying to find out how to read special characters such as: ç, á, é, í, etc. The script runs perfectly fine without special characters. That is the script header:
# coding=utf-8
'''
#modified by: Julierme Pinheiro
'''
import os
import sys
import unittest
from unittest import skip
import csv
import uuid
import xml
import xml.dom.minidom as minidom
import owslib
from owslib.iso import *
import pyproj
from decimal import *
import logging
The way I retrieve information from the csv file is shown bellow:
# add the title
title = data[1]
titleElement = identificationInfo[0].getElementsByTagName('gmd:title')[0]
titleNode = record.createTextNode(title)
titleElement.childNodes[1].appendChild(titleNode)
print "Title:" + title
Note: If data[1], second column in the csv file, contains a special character as found in "Navegação" the script fails (It does not write anything in the xml file).
The way a new XML file is created based on a blank Template XML is shown bellow:
# write out the gemini record
filename = '../output/%s.xml' % fileId
with open(filename,'w') as test_xml:
test_xml.write(record.toprettyxml(newl="", encoding="utf-8"))
except:
e = sys.exc_info()[1]
logging.debug("Import failed for entry %s" % data[0])
logging.debug("Specific error: %s" % e)
#skip('')
def testOWSMetadataImport(self):
raw_data = []
with open('../input/metadata_cartapapel.csv') as csvfile:
reader = csv.reader(csvfile, dialect='excel')
for columns in reader:
raw_data.append(columns)
md = MD_Metadata(etree.parse('gemini-template.xml'))
md.identification.topiccategory = ['farming','environment']
print md.identification.topiccategory
outfile = open('mdtest.xml','w')
# crap, can't update the model and write back out - this is badly needed!!
outfile.write(md.xml)
if __name__ == "__main__":
unittest.main()
Could someone help to solve this issue, please?
Thank you in advance for your time.
That's unicode. csv can't read unicode if you are in python 2.7. In python 3.x you can pass the utf-8 option while opening the file.
In python you can decode the data[1] to utf-8 like below.
title = data[1].decode('utf-8')
Some of the windows legacy windows components in english might require the 'cp1252'. If the above decoding fails, try this.
title = data[1].decode('cp1252')

Categories