Problems opening DBF files in python - python

I am trying to open en transform several DBF files to a dataframe. Most of them worked fine, but for one of the files I receive the error:
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 15: invalid start byte"
I have read this error on some other topics such as opening csv and xlsx and other files. The proposed solution was to include encoding = 'utf-8'
in the reading the file part. I haven't found a solution for DBF files unfortunately and I have very limited knowledge on DBF files.
What I have tried so far:
1)
from dbfread import DBF
dbf = DBF('file.DBF')
dbf = pd.DataFrame(dbf)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 8: character maps to <undefined>
2)
from simpledbf import Dbf5
dbf = Dbf5('file.DBF')
dbf = dbf.to_dataframe()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 15: invalid start byte
3)
# this block of code copied from https://gist.github.com/ryan-hill/f90b1c68f60d12baea81
import pysal as ps
def dbf2DF(dbfile, upper=True): #Reads in DBF files and returns Pandas DF
db = ps.table(dbfile) #Pysal to open DBF
d = {col: db.by_col(col) for col in db.header} #Convert dbf to dictionary
#pandasDF = pd.DataFrame(db[:]) #Convert to Pandas DF
pandasDF = pd.DataFrame(d) #Convert to Pandas DF
if upper == True: #Make columns uppercase if wanted
pandasDF.columns = map(str.upper, db.header)
db.close()
return pandasDF
dfb = dbf2DF('file.DBF')
AttributeError: module 'pysal' has no attribute 'open'
And last, if I try to install the dbfpy module, I receive:
SyntaxError: invalid syntax
Any suggestions on how to solve this?

Try using my dbf library:
import dbf
table = dbf.Table('file.DBF')
Print it to see if an encoding is present in the file:
print table # print(table) in Python 3
One of my test tables looks like this:
Table: tempy.dbf
Type: dBase III Plus
Codepage: ascii (plain ol ascii)
Status: DbfStatus.CLOSED
Last updated: 2019-07-26
Record count: 1
Field count: 2
Record length: 31
--Fields--
0) name C(20)
1) desc M
The important line being the Codepage line -- it sounds like that is not properly set for your DBF file. If you know what it should be, you can either open it with that codepage (temporarily) with:
table = dbf.Table('file.DBF', codepage='...')
Or you can change it permanently (updates the DBF file) with:
table.open()
table.codepage = dbf.CodePage('cp1252') # for example
table.close()

from simpledbf import Dbf5
dbf2 = Dbf5('/Users/.../TCAT_MUNICIPIOS.dbf', codec='latin')
df2 = dbf2.to_dataframe()
df2.head(3)

install library DBF
conda install DBF
from dbfread import DBF
db_in_dbf = DBF('paht/database.dbf) this line uplodad the database
df = pd.DataFrame(db_in_dbf ) this line converts a dataframe of pandas

For all those who helped me on this issue for myself where I had to fix a corrupt .dbf file (so came from a .dbf and had to be returned to a .dbf). My particular issue was dates throughout the .dbf were... just very wrong... and tried and failed via many methods, with many errors, to crack and reassemble it... before succeeding with the below:
#Modify dbase3 file to recast null date fields as a default date and
#reimport back into dbase3 file
import collections
import datetime
from typing import OrderedDict
import dbf as dbf1
from simpledbf import Dbf5
from dbfread import DBF, FieldParser
import pandas as pd
import numpy as np
#Default date to overwrite NaN values
blank_date = datetime.date(1900, 1, 1)
#Read in dbase file from Old Path and point to new Path
old_path = r"C:\...\ex.dbf"
new_path = r"C:\...\newex.dbf"
#Establish 1st rule for resolving corrupted dates
class MyFieldParser(FieldParser):
def parse(self, field, data):
try:
return FieldParser.parse(self, field, data)
except ValueError:
return blank_date
#Collect the original .DBF data while stepping over any errors
table = DBF(old_path, None, True, False, MyFieldParser, collections.OrderedDict, False, False, False,'ignore')
#Grab the Header Name, Old School Variable Format, and number of characters/length for each variable
dbfh = Dbf5(old_path, codec='utf-8')
headers = dbfh.fields
hdct = {x[0]: x[1:] for x in headers}
hdct.pop('DeletionFlag')
keys = hdct.keys()
#Position of Type and Length relative to field name
ftype = 0
characters = 1
# Reformat and join all old school DBF Header fields in required format
fields = list()
for key in keys:
ftemp = hdct.get(key)
k1 = str(key)
res1 = ftemp[ftype]
res2 = ftemp[characters]
if k1 == "decimal_field_name":
fields.append(k1 + " " + res1 + "(" + str(res2) + ",2)")
elif res1 == 'N':
fields.append(k1 + " " + res1 + "(" + str(res2) + ",0)")
elif res1 == 'D':
fields.append(k1 + " " + res1)
elif res1 == 'L':
fields.append(k1 + " " + res1)
else:
fields.append(k1 + " " + res1 + "(" + str(res2) + ")")
addfields = '; '.join(str(f) for f in fields)
#load the records of the.dbf into a dataframe
df = pd.DataFrame(iter(table))
#go ham reformatting date fields to ensure they are in the correct format
df['DATE_FIELD1'] = df['DATE_FIELD1'].replace(np.nan, blank_date)
df['DATE_FIELD1'] = pd.to_datetime(df['DATE_FIELD1'])
# eliminate further errors in the dataframe
df = df.fillna('0')
#drop added "record index" field from dataframe
df.set_index('existing_primary_key', inplace=False)
#initialize defaulttdict and convert the dataframe into a .DBF appendable format
dd = collections.defaultdict(list)
records = df.to_dict('records',into=dd)
#create the new .DBF file
new_table = dbf1.Table(new_path, addfields)
#append the dataframe to the new .DBF file
new_table.open(mode=dbf1.READ_WRITE)
for record in records:
new_table.append(record)
new_table.close()

Related

'utf-8' codec can't decode byte 0x8b

There are several folders (called DT_20180102, DT_20180103, ...) in ComputedTEsCsv folder. In each DT_... folder, there are 498 CSV files. I want to store these into a dictionary and store it in a pickle.
I write the code below, but it raises an error:
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
How can I correct this?
# Directory containing Joined Datasets of all companies.
_dir = "/Users/admin/Desktop/TransferEntropyEarningsAnnouncements/SP500Data/ComputedTEsCsv/"
# Create Directory names
#dates = []#['DT_20180201','DT_20180202']
dates = [i for i in os.listdir(_dir) if 'DT' in i]#['DT_20180201','DT_20180202']
# Create/Populate dictionary to contain all network data
network_dfs = {}
for _date in dates:
network_dfs[_date] = {}
load_pickle = False # Process to read in data is costly. Set to True to read in from pickle file
p_path = "SP500Data/NetworkJoinAll.pickle" # Save all files here ...
#if load_pickle is not True:
for date in tqdm(dates, total=len(dates), desc='JoiningAllNetworkDates'):
try:
base_path = "{0}{1}/".format(_dir, date)
company_files = os.listdir(base_path)
if '.ipynb_checkpoints' in company_files:
company_files.remove('.ipynb_checkpoints')
if '.rda' in company_files:
company_files.remove('.rda')
for i, company_file in enumerate(company_files):#tqdm(enumerate(company_files), total=len(company_files)):
# Only read in 1st 34 columns with 2hr 10 min periods
tmp_df = pd.read_csv(base_path+company_file)
if i == 0:
network_dfs[date] = tmp_df
else:
network_dfs[date] = pd.concat([network_dfs[date], tmp_df], ignore_index=True)
# Clean Data set any negative TE values to nan.
for col in network_dfs[date].columns[3:]:
network_dfs[date][network_dfs[date][col] < 0][col] = np.nan
except FileNotFoundError:
pass
print('Writing Network Data from {0}'.format(p_path))
with open(p_path, 'wb') as f:
pickle.dump(network_dfs, f, pickle.HIGHEST_PROTOCOL)
print('Done.')

Data not decrypting correctly - data from csv

My input ciphertext from my csv doesnt seem to be decrypting properly. It decrypts as a random string of bytes. I've checked my key and IV and they are exactly the same from encryption, Its just the decryption that doesnt seem to work properly.
I wondered if the way I have put my encrypted data into my csv, or retrieved it is the issue? maybe it alters the bytes etc? if not im stumped. I've been on this issue for days, help!
Program works like this:
User inputs credentials -- encrypt -- generate unique ID and hash values and stores in db -- store ciphertext in csv // user inputs ID -- matches in db and fetches encryption key stored with ID -- Fetches matching ID ciphertext from CSV, puts into pandas dataframe and decrypts ciphertext with key
def decoder():
from Crypto.Cipher import AES
import hashlib
from secrets import token_bytes
cursor.execute(
'''
Select enc_key FROM Login where ID = (?);
''',
(L_ID_entry.get(), ))
row = cursor.fetchone()
if row is not None:
keys = row[0]
#design padding function for encryption
def padded_text(data_in):
while len(data_in)% 16 != 0:
data_in = data_in + b"0"
return data_in
#calling stored key from main file and reverting back to bytes
key_original = keys
print(key_original)
print("Key original above")
mode = AES.MODE_CBC
#model
cipher = AES.new(key_original, mode, IV2.encode('utf8'))
print(IV2)
print("IV2 above")
#padding data
p4 = padded_text(df1.tobytes())
p5 = padded_text(df2.tobytes())
p6 = padded_text(df3.tobytes())
#decrypting data
d_fname = cipher.decrypt(p4)
d_sname = cipher.decrypt(p5)
d_email = cipher.decrypt(p6)
print(d_fname)
print(d_sname)
print(d_email)
#connecting to db
try:
conn = sqlite3.connect('login_details.db')
cursor = conn.cursor()
print("Connected to SQLite")
except sqlite3.Error as error:
print("Failure, error: ", error)
finally:
#downloading txt from dropbox and converting to dataframe to operate on
import New_user
import ast
_, res = client.files_download("/user_details/enc_logins.csv")
with io.BytesIO(res.content) as csvfile:
with open("enc_logins.csv", 'rb'):
df = pd.read_csv(csvfile, names=['ID', 'Fname', 'Sname', 'Email'], encoding='utf-8')
newdf = df[df['ID'] == L_ID_entry.get()]
print(newdf)
df1 = newdf['Fname'].values
df2 = newdf['Sname'].values
df3 = newdf['Email'].values
print(df1)
print(df2)
print(df3)
decoder()

Store Gtk.Textbuffer in SQL database. Encoding troubles

I'm working on a note taking app using python2/Gtk3/Glade.
The notes are stored in a MySQL Database and displayed in a TextView widget.
I can load/store/display plain text fine. However I want the ability to add images to the note page, and store them in the Database.so the data has to be serialised and I'm having some trouble figuring out how to encode/decode the serialised data going in and out of the Database. I'm getting unicode start byte errors. If was working with files I could just open the file in binary mode, but I'm storing as a string in a Database. I've tried encoding/decoding as UTF-8 and ASCII using bytes() and string.encode()[see the sample code below] and a few other ways but none work.
I am using this function to add the image to the textview buffer:
def _AddImagetoNode(self,oWidget):
filenm = None
seliter = self.GetTreeSelection(self.treeview)
filenm = self.FileOpenDiag("Select an Image To Insert.","Image","*.png,*.jpg,*.bmp")
if filenm == None:
return()
#filenm = "/home/drift/Pictures/a.png"
buf = self.dataview.get_buffer()
pixbuf = GdkPixbuf.Pixbuf.new_from_file(filenm)
#pixbuf.scale_simple(dest_width, dest_height, gtk.gdk.INTERP_BILINEAR)
buf.insert_pixbuf(buf.get_end_iter(), pixbuf)
self.dataview.set_buffer(buf)
self.dataview.show()
This is the function that stores the textview buffer:
def SaveDataView(self):
global DataViewNode
global DataViewIsImage
if len(self.GetProjectName()) == 0:
return()
buf = self.dataview.get_buffer()
format = buf.register_serialize_tagset()
data2 = buf.serialize(buf, format, buf.get_start_iter(), buf.get_end_iter())
#convert bytes(data) to string
data = data2.decode(encoding='UTF-8') #<< i think my problem is here
print("save b4 decode >>>>>>:%s"%data2)
sql = "UPDATE " + self.GetProjectName() + " SET tDataPath=%s WHERE tNodeID=%s"
val = (data, DataViewNode)
self.cursor.execute(sql,val)
self.mariadb_connection.commit()
This is the function that loads the Buffer:
def UpdateDataView(self, nodeid):
global DataViewNode
#global DataViewIsFile
DataViewNode=nodeid
if self.GetProjectName() != None and DataViewNode != None:
self.dataview.set_sensitive(True)
else:
self.dataview.set_sensitive(False)
self.dataview.show()
return()
buf = self.dataview.get_buffer()
buf.set_text('')
enc = self.DbGetNodeData(nodeid)
#convert string(enc) to bytes
data = enc.encode(encoding='UTF-8')#<<< i think my problem is here
print("update after decode >>>>>>>>>: %s"%data)
########### load
format = buf.register_deserialize_tagset()
buf.deserialize(buf, format, buf.get_end_iter(),data)
#buf.set_text(enc)
self.dataview.set_buffer(buf)
self.dataview.show()
I'm using mysql.connector to connect to a mariadb.
This is the sql connection string:
self.mariadb_connection = mariadb.connect(user='box', password='box', host='localhost', database='Boxer',charset='utf8')
This is the error im getting.
Traceback (most recent call last): File "Boxer.py", line 402, in
_TreeSelectionChanged
self.SaveDataView() File "Boxer.py", line 334, in SaveDataView
data = data2.decode(encoding='UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 174: invalid start byte
Traceback (most recent call last): File "Boxer.py", line 398, in
_DataViewLostFocus
self.SaveDataView() File "Boxer.py", line 334, in SaveDataView
data = data2.decode(encoding='UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 174: invalid start byte
With this code I can add/edit plain text in the text view and successfully save/load it but as soon as I add the image, I'm get the encoding errors. Any help would be appreciated.
Here is a more complete example:
def example (self):
#retrieve info from first textview
buf = self.builder.get_object('textbuffer1')
format = buf.register_serialize_tagset()
data = buf.serialize(buf, format, buf.get_start_iter(), buf.get_end_iter())
#run db update to prove it can be inserted into a database
db = psycopg2.connect(database= 'silrep_restore3', host='192.168.0.101',
user='postgres', password = 'true',
port = '5432')
c = db.cursor()
c.execute("UPDATE products SET byt = %s WHERE id = 1", (psycopg2.Binary(data),))
#append info to second treeview as a proof of concept
c.execute("SELECT byt FROM products WHERE id = 1")
data = c.fetchone()[0]
buf = self.builder.get_object('textbuffer2')
format = buf.register_deserialize_tagset()
buf.deserialize(buf, format, buf.get_end_iter(), data)
Since you are using MySQL, I recommend reading this article about inserting and retrieving data like you are.
For my example I used a bytea column. In MySQL this is may be a BLOB or BINARY type.
P.S. Sorry for not having a complete MySQL example in my answer. I would have posted a comment, but comments are pathetic for proper formatting.
Got it workings. thanks to theGtknerd your answer was the key. for anyone else having trouble with this i ended up using the BLOB type for the MySQL field type for the column im working with. I tried BINARY[it returnd malformed serialize data] AND VARBINARY [wouldnt even allow me to create the table] so i ended up using the LONGBLOB type.
here is the working code for anyone that needs it.
def UpdateDataView(self, nodeid):
global DataViewNode
#global DataViewIsFile
DataViewNode=nodeid
if self.GetProjectName() != None and DataViewNode != None:
self.dataview.set_sensitive(True)
else:
self.dataview.set_sensitive(False)
self.dataview.show()
return()
buf = self.dataview.get_buffer()
buf.set_text('')
data = self.DbGetNodeData(nodeid)
if data =='':
return()
format = buf.register_deserialize_tagset()
buf.deserialize(buf, format, buf.get_end_iter(),data)
self.dataview.set_buffer(buf)
self.dataview.show()
def SaveDataView(self):
global DataViewNode
global DataViewIsImage
if len(self.GetProjectName()) == 0:
return()
buf = self.dataview.get_buffer()
enc = buf.get_text(buf.get_start_iter(),buf.get_end_iter(),False)
self.AddData2Db(DataViewNode,enc)
format = buf.register_serialize_tagset()
data = buf.serialize(buf, format, buf.get_start_iter(), buf.get_end_iter())
sql = "UPDATE " + self.GetProjectName() + " SET tDataPath=%s WHERE tNodeID=%s"
val = (data, DataViewNode)
self.cursor.execute(sql,val)
self.mariadb_connection.commit()
and im using this to create the table
sql = "CREATE TABLE %s (tParentNodeID TEXT,tNodeTxt TEXT,tNodeID TEXT,tDataPath LONGBLOB)" %pName
self.cursor.execute(sql)
self.mariadb_connection.commit()

Iterate write row value to excel python, what's wrong with my code?

I want to write root's title value to the excel column A, my code:
from openpyxl import Workbook
import os
path = "C:/path_to_folder"
#word = '<option value="1.2.0-b.1" key="#SSPVersion#"/>'
os.chdir(path) #change directroy to application notes folder
titlelist = []
for root, dirs, files in os.walk(path):
title = str(root.split("/")[-1])
titlelist.append(title)
wb = Workbook()
ws = wb.active
r=2
for t in titlelist:
ws.cell(row=r, column = 1).value = str(t)
r += 1
wb.save("row_creation_loop.xlsx")
This does not work...always shows Error :
traceback(most recent call last):
ws[column_cell+str(row+2)] = stri(i)
self[key].value = value self._bind_value(value)
value = self.check_string(value)
value = unicode(value, self.encoding)
unicodeDecodeError: 'utf8' codec can't decode byte 0*92 in position 17: invalid start byte
Just posting some thought here: This code here (which is a copy of yours without reading in titles works just fine):
from openpyxl import Workbook
titlelist = ["title1"]
wb = Workbook()
ws = wb.active
for ind,t in enumerate(titlelist):
ws.cell(row= ind+2, column = 1).value = str(t)
wb.save("row_creation_loop.xlsx")
So the issue here is your titlelist which contains characters that can't be encoded in utf-8. We need to fix that, probably by using some decode and encode.
Share that list with us.

Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file.
Converting is working, if i set
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('utf8','ignore')
and
fp = open(json_file_path, 'r', encoding='utf-8')
but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok.
Experimented with setting encode('cp1251','ignore') but then i got an error
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)
import sys
import json
import csv
##
# This function converts an item like
# {
# "item_1":"value_11",
# "item_2":"value_12",
# "item_3":"value_13",
# "item_4":["sub_value_14", "sub_value_15"],
# "item_5":{
# "sub_item_1":"sub_item_value_11",
# "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
# }
# }
# To
# {
# "node_item_1":"value_11",
# "node_item_2":"value_12",
# "node_item_3":"value_13",
# "node_item_4_0":"sub_value_14",
# "node_item_4_1":"sub_value_15",
# "node_item_5_sub_item_1":"sub_item_value_11",
# "node_item_5_sub_item_2_0":"sub_item_value_12",
# "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
global reduced_item
#Reduction Condition 1
if type(value) is list:
i=0
for sub_item in value:
reduce_item(key+'_'+str(i), sub_item)
i=i+1
#Reduction Condition 2
elif type(value) is dict:
sub_keys = value.keys()
for sub_key in sub_keys:
reduce_item(key+'_'+str(sub_key), value[sub_key])
#Base Condition
else:
reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
else:
#Reading arguments
node = sys.argv[1]
json_file_path = sys.argv[2]
csv_file_path = sys.argv[3]
fp = open(json_file_path, 'r', encoding='cp1251')
json_value = fp.read()
raw_data = json.loads(json_value)
processed_data = []
header = []
for item in raw_data[node]:
reduced_item = {}
reduce_item(node, item)
header += reduced_item.keys()
processed_data.append(reduced_item)
header = list(set(header))
header.sort()
with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
writer.writeheader()
for row in processed_data:
writer.writerow(row)
print("Just completed writing csv file with %d columns" % len(header))
How to convert cyrillic correctly and also i want to skip bad characters?
You need to know cyrylic encoding of which file are you going to open.
For example that is enough in python3:
with open(args.input_file, 'r', encoding="cp866") as input_file:
data = input_file.read()
structure = json.loads(data)
In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.
Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?
Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.
maybe you could use the chardet to detect the file's encoding.
import chardet
File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
data=json.load(f)
f.close()
This avoids 'to kick' the encoding.

Categories