How to obtain the field type using dbfpy? - python

I have some dbf files that I want to add new fields to. To do so, I'm using dbfpy to open the original dbf, copy all fields (or the ones I want to keep) and records and then create a new file with those fields plus the new ones that I want. All is working great, except for one minor detail: I can't manage to keep the original fields' types, since I don't know how to obtain them. What I'm doing is to create all the fields in the new file as "C" (character), which so far works for what I need right now but might be an issue eventually.
The real problem is that there is no documentation available. I searched through the package files to look for the examples there, but couldn't find an answer to this question (might be that I couldn't find just by the "greenish" I still am with python... I'm definitely not an expert).
An example of the code:
from dbfpy import dbf
import sys
org_db_file = str(sys.argv[1])
org_db = dbf.Dbf(org_db_file, new = False)
new_db_file = str(sys.argv[2])
new_db = dbf.Dbf(new_db_file, new = True)
#Obtain original field names:
fldnames = []
fldsize = {}
for names in org_db.fieldNames:
fldnames.append(names)
fldsize[name] = 0
#Cycle thru table entries:
for rec in org_db:
#Cycle thru columns to obtain fields' name and value:
for name in fldnames:
value = str(rec[name])
if len(value) > fldsize[name]:
fldsize[name] = len(value)
#Copy original fields to new table:
for names in fldnames:
new_db.addField((names, "C", fldsize[name]))
#Add new fields:
new_fieldname = "some_name"
new_db.addField((new_fieldname, "C", 2))
#Copy original entries and store new values:
for rec in org_db:
#Create new record instance for new table:
new_rec = new_db.newRecord()
#Populate fields:
for field in fldnames:
new_rec[field] = rec[field]
#Store value of new field for record i:
new_rec[new_fieldname] = "some_value"
new_rec.store()
new_db.close()
Thanks in advance for your time.
Cheers.

I don't have any experience with dbfpy other than when I first went looking several years ago it (and several others) did not meet my needs. So I wrote my own.
Here is how you would accomplish your task using it:
import dbf
import sys
org_db_file = sys.argv[1]
org_db = dbf.Table(org_db_file)
new_db_file = sys.argv[2]
# postpone until we have the field names...
# new_db = dbf.Dbf(new_db_file, new = True)
# Obtain original field list:
fields = org_db.field_names
for field in fields[:]: # cycle through a separate list
if field == "something we don't like":
fields.remove(field)
# now get definitions for fields we keep
field_defs = ord_db.structure(fields)
# Add new fields:
field_defs.append("some_name C(2)")
# now create new table
new_db = ord_db.new(new_db_file, field_specs=field_defs)
# open both tables
with dbf.Tables(ord_db, new_db):
# Copy original entries and store new values:
for rec in org_db:
# Create new record instance for new table:
new_db.append()
# Populate fields:
with new_db.last_record as new_rec:
for field in new_db.field_names:
new_rec[field] = rec[field]
# Store value of new field for record i:
new_rec[new_fieldname] = "some_value"

Related

Python unittest: write, read, compare does not work

I'm using python unittests and sqlalchemy to test datamodels to store an WTFoms in mariaDB.
The test should create a dataset, write this dataset to db, read this set an compare if original dataset is the same like sored data.
So the partial test looks like that:
#set data
myForm = NiceForm()
myForm.name = "Ben"
#write data
db.session.add(myForm)
db.session.commit()
#read data
loadedForms = NiceForm.query.all()
#check that only one entry is in db
self.assertEqual(len(loadedForms), 1)
#compare stores data with dataset
self.assertIn(myForm, loadedForms)
The test seams to work fine. No I tried the find out, if the test fails, if dataset != stored data. So ein changed the dataset before compareing it, like this:
#set data
myForm = NiceForm()
myForm.name = "Ben"
#write data
db.session.add(myForm)
db.session.commit()
#read data
loadedForms = NiceForm.query.all()
#modify dataset
myForm.name = "Foo"
#show content of both
print(myForm.name)
print(loadedForms[0].name)
#check that only one entry is in db
self.assertEqual(len(loadedForms), 1)
#compare stores data with dataset
self.assertIn(myForm, loadedForms)
This test still passed. Why? I output the content of myForm.name and loadedForms[0].name where both set to Foo. This is the reason, why the self.assertsIn(myForm, loadedForms)passed the test, but I don't understand:
Why the content of the loadedForms is changed, when Foowas only applied to myForm?
The row identity for MyForm does not change by changing one of the values.
Row numbers have no meaning in a table, but to make the issue clear I will still use them.
Row 153 has 2 fields. Field name = "Ben" and field homeruns = 3.
Now we change the home runs (Ben has hit a home run);
Row 153 has 2 fields. Field name = "Ben" and field homeruns = 4.
It is still row 153, so your assertIn wil still return True, though one of the values in the row has changed. You only test identity.
If it wouldn't, changing a field in a table row would need to be saved by an insert into the table and not an update to the row. That is not correct of course; how many Bens do we have? One. And he has 4 home runs, not 3 or 4, depending on which record you look at.

Exporting data into CSV file from Flask-SQLAlchemy

I'm looking to generate(export) a csv from a flask-sqlalchemy app i'm developing. But i'm getting some unexpected outcomes in my csv i.e. instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??). The issue possibly could be the way i structured the query or even, the entire function.
Oddly enough - judging from the csv output (pic) - it would seem i'm on the right track since the row/column count is the same as the DB table but actual data is just not populated. I'm fairly new to SQLAlchemy ORM and Flask, so looking for some guidance here to pull through. Constructive feedback appreciated.
#class declaration with DB object (divo)
class pearl(divo.Model):
__tablename__ = 'users'
work_id = divo.Column(divo.Integer, primary_key=True)
user_fname = divo.Column(divo.String(length=255))
user_lname = divo.Column(divo.String(length=255))
user_category = divo.Column(divo.String(length=255))
user_status = divo.Column(divo.String(length=1))
login_id = divo.Column(divo.String(length=255))
login_passwd = divo.Column(divo.String(length=255))
#user report function
#app.route("/reports/users")
def users_report():
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = pearl.query.all()
for i in x15:
# x16 = tuple(x15)
csv_out = csv.writer(s_key)
csv_out.writerow(x15)
flash("Report generated. Please check designated repository.", "green")
return redirect(url_for('reports_landing')) # return redirect(url_for('other_tasks'))
#csv outcome (see attached pic)
instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??)
Each object in the list
x15 = pearl.query.all()
represents a row in your users table.
What you're seeing in the spreadsheet are not placeholders, but string representations of each row object (See object.repr).
You could get the value of a column for a particular row object by the column name attribute, for example:
x15[0].work_id # Assumes there is at least one row object in x15
What you could do instead is something like this:
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = divo.session.query(pearl.work_id, pearl.user_fname) # Add columns to query as needed
for i in x15:
csv_out = csv.writer(s_key)
csv_out.writerow(i)
i in the code above is a tuple of the form:
('work_id value', 'user_fname value')

What's the most efficient way to add (new) documents from a Dataframe to MongoDB?

In this use case, I am trying to add documents to a MongoDB collection using pymongo that are retrieved from various RSS news feeds based on the date (not datetime), title, and article summary in dataframe format (the date being the index to the dataframe).
When I store the dataframe to the database, they are stored with the schema of _id, date, title, summary which is fine.
So what I'm trying to do is only upload those rows in the dataframe which haven't been stored as documents in the collection. There are a few ways I've tried:
Get the last document in the database, compare to the dataframe. Create a new DF which excludes all previous rows + the row its being compared to. This should work, however, it is still uploading roughly 20% of the rows which have been previously stored and I have no idea why.
Store the entire dataframe, then aggregate the collection and remove the duplicates: Sounds good in theory however all of the examples of doing this are in JS and not python, so I haven't been able to get this to work.
Create a unique index of the title: Again, this should work in theory, but I haven't gotten it to work.
One thing that I don't want to do is to query the entire collection and store as a DF, concatenate them, drop the duplicates, delete the collection, and re-create it from the new DF. It wouldn't be an issue now since I'm working with 30 or so documents, but when I'll be working with multiple collections and millions of documents, well.. not very efficient at all.
Anyone have any suggestions I can look into / research / code examples?
Here is the code I'm working with now:
Download RSS Feed
def getSymbolNews(self, symbol):
self.symbol = symbol
self.dbName = 'db_' + self.symbol
self.columnName = 'col_News'
self.topics = ['$' + self.symbol]
self.sa = getNews().parseNews(fn.SeekingAlpha(topics = self.topics))
self.yfin = getNews().parseNews(fn.Yahoo(topics = self.topics))
self.wb_news = getNews().getWebullNews(self.symbol)
self.df = pd.concat([self.sa, self.yfin, self.wb_news], axis = 0, ignore_index = False)
self.df.drop_duplicates(inplace = True)
self.df.sort_index(ascending = True, inplace = True)
del self.symbol, self.topics, self.sa, self.yfin, self.wb_news
getNews().uploadRecords(self.dbName, self.columnName, self.df)
return self.df
Upload to Collection:
def uploadRecords(self, dbName, columnName, data):
self.data = data
self.dbName = dbName
self.columnName = columnName
self.data.reset_index(inplace=True)
self.data.rename(columns={'index': 'Date'}, inplace = True)
mongoFunctions.insertRecords(self.dbName, self.columnName, self.data)
del self.data
gc.collect()
return
PyMongo function to upload:
def insertRecords(dbName: str, collectionName: str, data: object):
"""Inserts a pandas dataframe object into a MongoDB collection (table)
Args:
dbName (str): Database name
collectionName (str): Collection name
data (object): Pandas dataframe object
"""
collection = getCollection(dbName, collectionName)
query = queryAllRecords(dbName, collectionName)
if query.shape == (0, 0):
record = data.to_dict(orient="records")
collection.insert(record)
else:
query.drop(["_id"], axis=1, inplace=True)
if query.equals(data):
return
else:
df_temp = pd.concat([query, data]).drop_duplicates(keep=False)
records = df_temp.to_dict(orient="records")
collection.insert_many(records)
return
I'd be minded to take an md5 hash of the document and store that as the _id; then you can just use insert_many() with ordered=False to insert any items that aren't duplicates; you can run this as often as you like and only new items will be added; bear in mind that if any field is even sligtly changed a new item is added; if this isn't the behaviour you want then tweak what you pass to md5().
The code ends up being fairly straightforward:
from pymongo import MongoClient
from pymongo.errors import BulkWriteError
import feedparser
from hashlib import md5
from json import dumps
db = MongoClient()['mydatabase']
entries = feedparser.parse("http://feeds.bbci.co.uk/news/world/rss.xml")['entries']
for item in entries:
item['_id'] = md5(dumps(item).encode("utf-8")).hexdigest()
try:
db.news.insert_many(entries, ordered=False)
except BulkWriteError:
pass

Is the only way to add a column in PyTables to create a new table and copy?

I am searching for a persistent data storage solution that can handle heterogenous data stored on disk. PyTables seems like an obvious choice, but the only information I can find on how to append new columns is a tutorial example. The tutorial has the user create a new table with added column, copy the old table into the new table, and finally delete the old table. This seems like a huge pain. Is this how it has to be done?
If so, what are better alternatives for storing mixed data on disk that can accommodate new columns with relative ease? I have looked at sqlite3 as well and the column options seem rather limited there, too.
Yes, you must create a new table and copy the original data. This is because Tables are a dense format. This gives it a huge performance benefits but one of the costs is that adding new columns is somewhat expensive.
thanks for Anthony Scopatz's answer.
I search website and in github, I found someone has shown how to add columns in PyTables.
Example showing how to add a column in PyTables
orginal version ,Example showing how to add a column in PyTables, but have some difficulty to migrate.
revised version, Isolated the copying logic, while some terms is deprecated, and it has some minor error in adding new columns.
based on their's contribution, I updated the code for adding new column in PyTables. (Python 3.6, windows)
# -*- coding: utf-8 -*-
"""
PyTables, append a column
"""
import tables as tb
pth='d:/download/'
# Describe a water class
class Water(tb.IsDescription):
waterbody_name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.Int32Col(pos=2) # integer
longi = tb.Int32Col(pos=3) # integer
airpressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.Float64Col(pos=5) # double (double-precision)
# Open a file in "w"rite mode
# if don't include pth, then it will be in the same path as the code.
fileh = tb.open_file(pth+"myadd-column.h5", mode = "w")
# Create a table in the root directory and append data...
tableroot = fileh.create_table(fileh.root, 'root_table', Water,
"A table at root", tb.Filters(1))
tableroot.append([("Mediterranean", 10, 0, 10*10, 10**2),
("Mediterranean", 11, -1, 11*11, 11**2),
("Adriatic", 12, -2, 12*12, 12**2)])
print ("\nContents of the table in root:\n",
fileh.root.root_table[:])
# Create a new table in newgroup group and append several rows
group = fileh.create_group(fileh.root, "newgroup")
table = fileh.create_table(group, 'orginal_table', Water, "A table", tb.Filters(1))
table.append([("Atlantic", 10, 0, 10*10, 10**2),
("Pacific", 11, -1, 11*11, 11**2),
("Atlantic", 12, -2, 12*12, 12**2)])
print ("\nContents of the original table in newgroup:\n",
fileh.root.newgroup.orginal_table[:])
# close the file
fileh.close()
#%% Open it again in append mode
fileh = tb.open_file(pth+"myadd-column.h5", "a")
group = fileh.root.newgroup
table = group.orginal_table
# Isolated the copying logic
def append_column(table, group, name, column):
"""Returns a copy of `table` with an empty `column` appended named `name`."""
description = table.description._v_colObjects.copy()
description[name] = column
copy = tb.Table(group, table.name+"_copy", description)
# Copy the user attributes
table.attrs._f_copy(copy)
# Fill the rows of new table with default values
for i in range(table.nrows):
copy.row.append()
# Flush the rows to disk
copy.flush()
# Copy the columns of source table to destination
for col in descr:
getattr(copy.cols, col)[:] = getattr(table.cols, col)[:]
# choose wether remove the original table
# table.remove()
return copy
# Get a description of table in dictionary format
descr = table.description._v_colObjects
descr2 = descr.copy()
# Add a column to description
descr2["hot"] = tb.BoolCol(dflt=False)
# append orginal and added data to table2
table2 = append_column(table, group, "hot", tb.BoolCol(dflt=False))
# Fill the new column
table2.cols.hot[:] = [row["temperature"] > 11**2 for row in table ]
# Move table2 to table, you can use the same name as original one.
table2.move('/newgroup','new_table')
# Print the new table
print ("\nContents of the table with column added:\n",
fileh.root.newgroup.new_table[:])
# Finally, close the file
fileh.close()

Populate Unique ID field after Sorting, Python

I am trying to create an new unique id field in an access table. I already have one field called SITE_ID_FD, but it is historical. The format of the unique value in that field isn't what our current format is, so I am creating a new field with the new format.
Old Format = M001, M002, K003, K004, S005, M006, etc
New format = 12001, 12002, 12003, 12004, 12005, 12006, etc
I wrote the following script:
fc = r"Z:\test.gdb\testfc"
x = 12001
cursor = arcpy.UpdateCursor(fc)
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
This works fine, but it populates the new id field based on the default sorting of objectID. I need to sort 2 fields first and then populate the new id field based on that sorting (I want to sort by a field called SITE and then by the old id field SITE_ID_FD)
I tried manually sorting the 2 fields in hopes that Python would honor the sort, but it doesn't. I'm not sure how to do this in Python. Can anyone suggest a method?
A possible solution is when you are creating your update cursor. you can specify to the cursor the fields by which you wish it to be sorted (sorry for my english..), they explain this in the documentation: http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//000v0000003m000000
so it goes like this:
UpdateCursor(dataset, {where_clause}, {spatial_reference}, {fields}, {sort_fields})
and you are intrested only in the sort_fields so assuming that your code will work well on a sorted table and that you want the table ordered asscending the second part of your code should look like this:
fc = r"Z:\test.gdb\testfc"
x = 12001
cursor = arcpy.UpdateCursor(fc,"","","","SITE A, SITE_ID_FD A")
#if you want to sort it descending you need to write it with a D
#>> cursor = arcpy.UpdateCursor(fc,"","","","SITE D, SITE_ID_FD D")
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
i hope this helps
Added a link to the arcpy docs in a comment, but from what I can tell, this will create a new, sorted dataset--
import arcpy
from arcpy import env
env.workspace = r"z:\test.gdb"
arcpy.Sort_management("testfc", "testfc_sort", [["SITE", "ASCENDING"],
["SITE_IF_FD", "ASCENDING]])
And this will, on the sorted dataset, do what you want:
fc = r"Z:\test.gdb\testfc_sort"
x = 12001
cursor = arcpy.UpdateCursor(fc)
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
I'm assuming there's some way to just copy the sorted/modified dataset back over the original, so it's all good?
I'll admit, I don't use arcpy, and the docs could be a lot more explicit.

Categories