Exporting data into CSV file from Flask-SQLAlchemy - python

I'm looking to generate(export) a csv from a flask-sqlalchemy app i'm developing. But i'm getting some unexpected outcomes in my csv i.e. instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??). The issue possibly could be the way i structured the query or even, the entire function.
Oddly enough - judging from the csv output (pic) - it would seem i'm on the right track since the row/column count is the same as the DB table but actual data is just not populated. I'm fairly new to SQLAlchemy ORM and Flask, so looking for some guidance here to pull through. Constructive feedback appreciated.
#class declaration with DB object (divo)
class pearl(divo.Model):
__tablename__ = 'users'
work_id = divo.Column(divo.Integer, primary_key=True)
user_fname = divo.Column(divo.String(length=255))
user_lname = divo.Column(divo.String(length=255))
user_category = divo.Column(divo.String(length=255))
user_status = divo.Column(divo.String(length=1))
login_id = divo.Column(divo.String(length=255))
login_passwd = divo.Column(divo.String(length=255))
#user report function
#app.route("/reports/users")
def users_report():
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = pearl.query.all()
for i in x15:
# x16 = tuple(x15)
csv_out = csv.writer(s_key)
csv_out.writerow(x15)
flash("Report generated. Please check designated repository.", "green")
return redirect(url_for('reports_landing')) # return redirect(url_for('other_tasks'))
#csv outcome (see attached pic)

instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??)
Each object in the list
x15 = pearl.query.all()
represents a row in your users table.
What you're seeing in the spreadsheet are not placeholders, but string representations of each row object (See object.repr).
You could get the value of a column for a particular row object by the column name attribute, for example:
x15[0].work_id # Assumes there is at least one row object in x15
What you could do instead is something like this:
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = divo.session.query(pearl.work_id, pearl.user_fname) # Add columns to query as needed
for i in x15:
csv_out = csv.writer(s_key)
csv_out.writerow(i)
i in the code above is a tuple of the form:
('work_id value', 'user_fname value')

Related

Dynamically Adding a Column to a SQLAlchemy Table

I have a SQLAlchemy Table which is defined as:
class Student(db.Model):
id = db.Column(db.Integer(), primary_key=True)
name = db.Column(db.Text(), nullable=False)
I am loading this table with data from a .csv file, and I may have columns which are not defined statically in the Student class. Let's say while loading the file, I find a new column that is not in the database, so I need to add it to the Student class.
So far I have tried this:
from alembic.migration import MigrationContext
from alembic.operations import Operations
from sqlalchemy.exc import OperationalError
op = Operations(MigrationContext.configure(db.engine.connect()))
return_col_list = []
for col_name in headers_list:
# If the column is not already in the database, then create it.
if (col_name not in COLUMN_HEADER_NAMES):
lower_col_name = col_name.lower()
try:
op.add_column('students', Column(lower_col_name, Text)) # Add the column
Student.__table__.append_column(Column(lower_col_name, Text)) # Append the column to the table.
return_col_list.append(col_name)
except OperationalError:
# The column already exists in the table.
pass
return return_col_list
Then when I loop over each row of my .csv file, I do this to set the data to the attribute.
setattr(new_student, col, val)
However, when I set the data on a given Student object, the data is not reflected in the database. It appears there is no "link" between the column I made and the SQLite table itself. Does anyone know of a better way to do this?

Python unittest: write, read, compare does not work

I'm using python unittests and sqlalchemy to test datamodels to store an WTFoms in mariaDB.
The test should create a dataset, write this dataset to db, read this set an compare if original dataset is the same like sored data.
So the partial test looks like that:
#set data
myForm = NiceForm()
myForm.name = "Ben"
#write data
db.session.add(myForm)
db.session.commit()
#read data
loadedForms = NiceForm.query.all()
#check that only one entry is in db
self.assertEqual(len(loadedForms), 1)
#compare stores data with dataset
self.assertIn(myForm, loadedForms)
The test seams to work fine. No I tried the find out, if the test fails, if dataset != stored data. So ein changed the dataset before compareing it, like this:
#set data
myForm = NiceForm()
myForm.name = "Ben"
#write data
db.session.add(myForm)
db.session.commit()
#read data
loadedForms = NiceForm.query.all()
#modify dataset
myForm.name = "Foo"
#show content of both
print(myForm.name)
print(loadedForms[0].name)
#check that only one entry is in db
self.assertEqual(len(loadedForms), 1)
#compare stores data with dataset
self.assertIn(myForm, loadedForms)
This test still passed. Why? I output the content of myForm.name and loadedForms[0].name where both set to Foo. This is the reason, why the self.assertsIn(myForm, loadedForms)passed the test, but I don't understand:
Why the content of the loadedForms is changed, when Foowas only applied to myForm?
The row identity for MyForm does not change by changing one of the values.
Row numbers have no meaning in a table, but to make the issue clear I will still use them.
Row 153 has 2 fields. Field name = "Ben" and field homeruns = 3.
Now we change the home runs (Ben has hit a home run);
Row 153 has 2 fields. Field name = "Ben" and field homeruns = 4.
It is still row 153, so your assertIn wil still return True, though one of the values in the row has changed. You only test identity.
If it wouldn't, changing a field in a table row would need to be saved by an insert into the table and not an update to the row. That is not correct of course; how many Bens do we have? One. And he has 4 home runs, not 3 or 4, depending on which record you look at.

What's the most efficient way to add (new) documents from a Dataframe to MongoDB?

In this use case, I am trying to add documents to a MongoDB collection using pymongo that are retrieved from various RSS news feeds based on the date (not datetime), title, and article summary in dataframe format (the date being the index to the dataframe).
When I store the dataframe to the database, they are stored with the schema of _id, date, title, summary which is fine.
So what I'm trying to do is only upload those rows in the dataframe which haven't been stored as documents in the collection. There are a few ways I've tried:
Get the last document in the database, compare to the dataframe. Create a new DF which excludes all previous rows + the row its being compared to. This should work, however, it is still uploading roughly 20% of the rows which have been previously stored and I have no idea why.
Store the entire dataframe, then aggregate the collection and remove the duplicates: Sounds good in theory however all of the examples of doing this are in JS and not python, so I haven't been able to get this to work.
Create a unique index of the title: Again, this should work in theory, but I haven't gotten it to work.
One thing that I don't want to do is to query the entire collection and store as a DF, concatenate them, drop the duplicates, delete the collection, and re-create it from the new DF. It wouldn't be an issue now since I'm working with 30 or so documents, but when I'll be working with multiple collections and millions of documents, well.. not very efficient at all.
Anyone have any suggestions I can look into / research / code examples?
Here is the code I'm working with now:
Download RSS Feed
def getSymbolNews(self, symbol):
self.symbol = symbol
self.dbName = 'db_' + self.symbol
self.columnName = 'col_News'
self.topics = ['$' + self.symbol]
self.sa = getNews().parseNews(fn.SeekingAlpha(topics = self.topics))
self.yfin = getNews().parseNews(fn.Yahoo(topics = self.topics))
self.wb_news = getNews().getWebullNews(self.symbol)
self.df = pd.concat([self.sa, self.yfin, self.wb_news], axis = 0, ignore_index = False)
self.df.drop_duplicates(inplace = True)
self.df.sort_index(ascending = True, inplace = True)
del self.symbol, self.topics, self.sa, self.yfin, self.wb_news
getNews().uploadRecords(self.dbName, self.columnName, self.df)
return self.df
Upload to Collection:
def uploadRecords(self, dbName, columnName, data):
self.data = data
self.dbName = dbName
self.columnName = columnName
self.data.reset_index(inplace=True)
self.data.rename(columns={'index': 'Date'}, inplace = True)
mongoFunctions.insertRecords(self.dbName, self.columnName, self.data)
del self.data
gc.collect()
return
PyMongo function to upload:
def insertRecords(dbName: str, collectionName: str, data: object):
"""Inserts a pandas dataframe object into a MongoDB collection (table)
Args:
dbName (str): Database name
collectionName (str): Collection name
data (object): Pandas dataframe object
"""
collection = getCollection(dbName, collectionName)
query = queryAllRecords(dbName, collectionName)
if query.shape == (0, 0):
record = data.to_dict(orient="records")
collection.insert(record)
else:
query.drop(["_id"], axis=1, inplace=True)
if query.equals(data):
return
else:
df_temp = pd.concat([query, data]).drop_duplicates(keep=False)
records = df_temp.to_dict(orient="records")
collection.insert_many(records)
return
I'd be minded to take an md5 hash of the document and store that as the _id; then you can just use insert_many() with ordered=False to insert any items that aren't duplicates; you can run this as often as you like and only new items will be added; bear in mind that if any field is even sligtly changed a new item is added; if this isn't the behaviour you want then tweak what you pass to md5().
The code ends up being fairly straightforward:
from pymongo import MongoClient
from pymongo.errors import BulkWriteError
import feedparser
from hashlib import md5
from json import dumps
db = MongoClient()['mydatabase']
entries = feedparser.parse("http://feeds.bbci.co.uk/news/world/rss.xml")['entries']
for item in entries:
item['_id'] = md5(dumps(item).encode("utf-8")).hexdigest()
try:
db.news.insert_many(entries, ordered=False)
except BulkWriteError:
pass

SQLAlchemy MySQL - Optimal method to update a table that needs to be frequently entirely updated. Questions on implementation

My use case;
My code runs multiple Scrapy spiders on different US counties to collect property data on every property. This is done by looping through a list of PINS/Parcels(100k to 200k) which are appended to the same URLS over and over, collecting sales data on each parcel or property, and storing that data on its respective county table one row at a time. My use case involves updating these tables frequently(once a week or so) to collect trends on sales data. Out of 100K properties, it can be that only a few acquired new sales records, but I would not know unless I went through all of them.
I currently began implementing this via pipeline below which essentially accomplishes getting the data in the table on the first run when the table is a clean slate. However when re-running to refresh data, I'm obviously unable to insert rows that contain the same unique ID and would need to update the row instead. My unique ID for each data point is its parcel number.
My questions-
1. What is the optimal method to update a database table that
requires a full refresh(all rows) frequently?
My guess so far based on research I've done is replacing the old table with a new temporary table. This is because it would be quicker(I think) to insert all data into a new table than to query each item in the old table, see if it has changed, and if changed, modify that row. This can be accomplished by inserting all data into the temporary table first, then replacing the old table with the new one.
If my method of implementation is optimal, how would I go about implementing this?
Should I use some kind of data migration module(panda?)- What would happen if I dropped the old table, and program was interrupted at this point prior to new table replacing it.
class PierceDataPipeline(object):
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates tables.
"""
engine = db_connect()
create_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self,item,spider):
"""
This method is called for every item pipeline component
"""
session = self.Session()
propertyDataTable = PierceCountyPropertyData()
propertyDataTable.parcel = item["parcel"]
propertyDataTable.mailing_address = item["mailing_address"]
propertyDataTable.owner_name = item["owner_name"]
propertyDataTable.county = item["county"]
propertyDataTable.site_address = item["site_address"]
propertyDataTable.property_type = item["property_type"]
propertyDataTable.occupancy = item["occupancy"]
propertyDataTable.year_built = item["year_built"]
propertyDataTable.adj_year_built = item["adj_year_built"]
propertyDataTable.units = item["units"]
propertyDataTable.bedrooms = item["bedrooms"]
propertyDataTable.baths = item["baths"]
propertyDataTable.siding_type = item["siding_type"]
propertyDataTable.stories = item["stories"]
propertyDataTable.lot_square_footage = item["lot_square_footage"]
propertyDataTable.lot_acres = item["lot_acres"]
propertyDataTable.current_balance_due = item["current_balance_due"]
propertyDataTable.tax_year_1 = item["tax_year_1"]
propertyDataTable.tax_year_2 = item["tax_year_2"]
propertyDataTable.tax_year_3 = item["tax_year_3"]
propertyDataTable.tax_year_1_assessed = item["tax_year_1_assessed"]
propertyDataTable.tax_year_2_assessed = item["tax_year_2_assessed"]
propertyDataTable.tax_year_3_assessed = item["tax_year_3_assessed"]
propertyDataTable.sale1_price = item["sale1_price"]
propertyDataTable.sale1_date = item["sale1_date"]
propertyDataTable.sale2_date = item["sale2_date"]
propertyDataTable.sale2_price = item["sale2_price"]
try:
session.add(propertyDataTable)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item

How to obtain the field type using dbfpy?

I have some dbf files that I want to add new fields to. To do so, I'm using dbfpy to open the original dbf, copy all fields (or the ones I want to keep) and records and then create a new file with those fields plus the new ones that I want. All is working great, except for one minor detail: I can't manage to keep the original fields' types, since I don't know how to obtain them. What I'm doing is to create all the fields in the new file as "C" (character), which so far works for what I need right now but might be an issue eventually.
The real problem is that there is no documentation available. I searched through the package files to look for the examples there, but couldn't find an answer to this question (might be that I couldn't find just by the "greenish" I still am with python... I'm definitely not an expert).
An example of the code:
from dbfpy import dbf
import sys
org_db_file = str(sys.argv[1])
org_db = dbf.Dbf(org_db_file, new = False)
new_db_file = str(sys.argv[2])
new_db = dbf.Dbf(new_db_file, new = True)
#Obtain original field names:
fldnames = []
fldsize = {}
for names in org_db.fieldNames:
fldnames.append(names)
fldsize[name] = 0
#Cycle thru table entries:
for rec in org_db:
#Cycle thru columns to obtain fields' name and value:
for name in fldnames:
value = str(rec[name])
if len(value) > fldsize[name]:
fldsize[name] = len(value)
#Copy original fields to new table:
for names in fldnames:
new_db.addField((names, "C", fldsize[name]))
#Add new fields:
new_fieldname = "some_name"
new_db.addField((new_fieldname, "C", 2))
#Copy original entries and store new values:
for rec in org_db:
#Create new record instance for new table:
new_rec = new_db.newRecord()
#Populate fields:
for field in fldnames:
new_rec[field] = rec[field]
#Store value of new field for record i:
new_rec[new_fieldname] = "some_value"
new_rec.store()
new_db.close()
Thanks in advance for your time.
Cheers.
I don't have any experience with dbfpy other than when I first went looking several years ago it (and several others) did not meet my needs. So I wrote my own.
Here is how you would accomplish your task using it:
import dbf
import sys
org_db_file = sys.argv[1]
org_db = dbf.Table(org_db_file)
new_db_file = sys.argv[2]
# postpone until we have the field names...
# new_db = dbf.Dbf(new_db_file, new = True)
# Obtain original field list:
fields = org_db.field_names
for field in fields[:]: # cycle through a separate list
if field == "something we don't like":
fields.remove(field)
# now get definitions for fields we keep
field_defs = ord_db.structure(fields)
# Add new fields:
field_defs.append("some_name C(2)")
# now create new table
new_db = ord_db.new(new_db_file, field_specs=field_defs)
# open both tables
with dbf.Tables(ord_db, new_db):
# Copy original entries and store new values:
for rec in org_db:
# Create new record instance for new table:
new_db.append()
# Populate fields:
with new_db.last_record as new_rec:
for field in new_db.field_names:
new_rec[field] = rec[field]
# Store value of new field for record i:
new_rec[new_fieldname] = "some_value"

Categories