SQLAlchemy MySQL - Optimal method to update a table that needs to be frequently entirely updated. Questions on implementation - python

My use case;
My code runs multiple Scrapy spiders on different US counties to collect property data on every property. This is done by looping through a list of PINS/Parcels(100k to 200k) which are appended to the same URLS over and over, collecting sales data on each parcel or property, and storing that data on its respective county table one row at a time. My use case involves updating these tables frequently(once a week or so) to collect trends on sales data. Out of 100K properties, it can be that only a few acquired new sales records, but I would not know unless I went through all of them.
I currently began implementing this via pipeline below which essentially accomplishes getting the data in the table on the first run when the table is a clean slate. However when re-running to refresh data, I'm obviously unable to insert rows that contain the same unique ID and would need to update the row instead. My unique ID for each data point is its parcel number.
My questions-
1. What is the optimal method to update a database table that
requires a full refresh(all rows) frequently?
My guess so far based on research I've done is replacing the old table with a new temporary table. This is because it would be quicker(I think) to insert all data into a new table than to query each item in the old table, see if it has changed, and if changed, modify that row. This can be accomplished by inserting all data into the temporary table first, then replacing the old table with the new one.
If my method of implementation is optimal, how would I go about implementing this?
Should I use some kind of data migration module(panda?)- What would happen if I dropped the old table, and program was interrupted at this point prior to new table replacing it.
class PierceDataPipeline(object):
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates tables.
"""
engine = db_connect()
create_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self,item,spider):
"""
This method is called for every item pipeline component
"""
session = self.Session()
propertyDataTable = PierceCountyPropertyData()
propertyDataTable.parcel = item["parcel"]
propertyDataTable.mailing_address = item["mailing_address"]
propertyDataTable.owner_name = item["owner_name"]
propertyDataTable.county = item["county"]
propertyDataTable.site_address = item["site_address"]
propertyDataTable.property_type = item["property_type"]
propertyDataTable.occupancy = item["occupancy"]
propertyDataTable.year_built = item["year_built"]
propertyDataTable.adj_year_built = item["adj_year_built"]
propertyDataTable.units = item["units"]
propertyDataTable.bedrooms = item["bedrooms"]
propertyDataTable.baths = item["baths"]
propertyDataTable.siding_type = item["siding_type"]
propertyDataTable.stories = item["stories"]
propertyDataTable.lot_square_footage = item["lot_square_footage"]
propertyDataTable.lot_acres = item["lot_acres"]
propertyDataTable.current_balance_due = item["current_balance_due"]
propertyDataTable.tax_year_1 = item["tax_year_1"]
propertyDataTable.tax_year_2 = item["tax_year_2"]
propertyDataTable.tax_year_3 = item["tax_year_3"]
propertyDataTable.tax_year_1_assessed = item["tax_year_1_assessed"]
propertyDataTable.tax_year_2_assessed = item["tax_year_2_assessed"]
propertyDataTable.tax_year_3_assessed = item["tax_year_3_assessed"]
propertyDataTable.sale1_price = item["sale1_price"]
propertyDataTable.sale1_date = item["sale1_date"]
propertyDataTable.sale2_date = item["sale2_date"]
propertyDataTable.sale2_price = item["sale2_price"]
try:
session.add(propertyDataTable)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item

Related

Exporting data into CSV file from Flask-SQLAlchemy

I'm looking to generate(export) a csv from a flask-sqlalchemy app i'm developing. But i'm getting some unexpected outcomes in my csv i.e. instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??). The issue possibly could be the way i structured the query or even, the entire function.
Oddly enough - judging from the csv output (pic) - it would seem i'm on the right track since the row/column count is the same as the DB table but actual data is just not populated. I'm fairly new to SQLAlchemy ORM and Flask, so looking for some guidance here to pull through. Constructive feedback appreciated.
#class declaration with DB object (divo)
class pearl(divo.Model):
__tablename__ = 'users'
work_id = divo.Column(divo.Integer, primary_key=True)
user_fname = divo.Column(divo.String(length=255))
user_lname = divo.Column(divo.String(length=255))
user_category = divo.Column(divo.String(length=255))
user_status = divo.Column(divo.String(length=1))
login_id = divo.Column(divo.String(length=255))
login_passwd = divo.Column(divo.String(length=255))
#user report function
#app.route("/reports/users")
def users_report():
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = pearl.query.all()
for i in x15:
# x16 = tuple(x15)
csv_out = csv.writer(s_key)
csv_out.writerow(x15)
flash("Report generated. Please check designated repository.", "green")
return redirect(url_for('reports_landing')) # return redirect(url_for('other_tasks'))
#csv outcome (see attached pic)
instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??)
Each object in the list
x15 = pearl.query.all()
represents a row in your users table.
What you're seeing in the spreadsheet are not placeholders, but string representations of each row object (See object.repr).
You could get the value of a column for a particular row object by the column name attribute, for example:
x15[0].work_id # Assumes there is at least one row object in x15
What you could do instead is something like this:
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = divo.session.query(pearl.work_id, pearl.user_fname) # Add columns to query as needed
for i in x15:
csv_out = csv.writer(s_key)
csv_out.writerow(i)
i in the code above is a tuple of the form:
('work_id value', 'user_fname value')

What's the most efficient way to add (new) documents from a Dataframe to MongoDB?

In this use case, I am trying to add documents to a MongoDB collection using pymongo that are retrieved from various RSS news feeds based on the date (not datetime), title, and article summary in dataframe format (the date being the index to the dataframe).
When I store the dataframe to the database, they are stored with the schema of _id, date, title, summary which is fine.
So what I'm trying to do is only upload those rows in the dataframe which haven't been stored as documents in the collection. There are a few ways I've tried:
Get the last document in the database, compare to the dataframe. Create a new DF which excludes all previous rows + the row its being compared to. This should work, however, it is still uploading roughly 20% of the rows which have been previously stored and I have no idea why.
Store the entire dataframe, then aggregate the collection and remove the duplicates: Sounds good in theory however all of the examples of doing this are in JS and not python, so I haven't been able to get this to work.
Create a unique index of the title: Again, this should work in theory, but I haven't gotten it to work.
One thing that I don't want to do is to query the entire collection and store as a DF, concatenate them, drop the duplicates, delete the collection, and re-create it from the new DF. It wouldn't be an issue now since I'm working with 30 or so documents, but when I'll be working with multiple collections and millions of documents, well.. not very efficient at all.
Anyone have any suggestions I can look into / research / code examples?
Here is the code I'm working with now:
Download RSS Feed
def getSymbolNews(self, symbol):
self.symbol = symbol
self.dbName = 'db_' + self.symbol
self.columnName = 'col_News'
self.topics = ['$' + self.symbol]
self.sa = getNews().parseNews(fn.SeekingAlpha(topics = self.topics))
self.yfin = getNews().parseNews(fn.Yahoo(topics = self.topics))
self.wb_news = getNews().getWebullNews(self.symbol)
self.df = pd.concat([self.sa, self.yfin, self.wb_news], axis = 0, ignore_index = False)
self.df.drop_duplicates(inplace = True)
self.df.sort_index(ascending = True, inplace = True)
del self.symbol, self.topics, self.sa, self.yfin, self.wb_news
getNews().uploadRecords(self.dbName, self.columnName, self.df)
return self.df
Upload to Collection:
def uploadRecords(self, dbName, columnName, data):
self.data = data
self.dbName = dbName
self.columnName = columnName
self.data.reset_index(inplace=True)
self.data.rename(columns={'index': 'Date'}, inplace = True)
mongoFunctions.insertRecords(self.dbName, self.columnName, self.data)
del self.data
gc.collect()
return
PyMongo function to upload:
def insertRecords(dbName: str, collectionName: str, data: object):
"""Inserts a pandas dataframe object into a MongoDB collection (table)
Args:
dbName (str): Database name
collectionName (str): Collection name
data (object): Pandas dataframe object
"""
collection = getCollection(dbName, collectionName)
query = queryAllRecords(dbName, collectionName)
if query.shape == (0, 0):
record = data.to_dict(orient="records")
collection.insert(record)
else:
query.drop(["_id"], axis=1, inplace=True)
if query.equals(data):
return
else:
df_temp = pd.concat([query, data]).drop_duplicates(keep=False)
records = df_temp.to_dict(orient="records")
collection.insert_many(records)
return
I'd be minded to take an md5 hash of the document and store that as the _id; then you can just use insert_many() with ordered=False to insert any items that aren't duplicates; you can run this as often as you like and only new items will be added; bear in mind that if any field is even sligtly changed a new item is added; if this isn't the behaviour you want then tweak what you pass to md5().
The code ends up being fairly straightforward:
from pymongo import MongoClient
from pymongo.errors import BulkWriteError
import feedparser
from hashlib import md5
from json import dumps
db = MongoClient()['mydatabase']
entries = feedparser.parse("http://feeds.bbci.co.uk/news/world/rss.xml")['entries']
for item in entries:
item['_id'] = md5(dumps(item).encode("utf-8")).hexdigest()
try:
db.news.insert_many(entries, ordered=False)
except BulkWriteError:
pass

Get all rows in Json format with API-Rest Cassandra

I have the following code that allows me to retrieve the first keyspace:
def Query(str):
auth_provider = PlainTextAuthProvider(username='admin', password='root')
cluster = Cluster(['hostname'], auth_provider=auth_provider)
session = cluster.connect('system')
rows = session.execute(str)
keyspaces = []
row_list = list(rows)
for x in range(len(row_list)):
return row_list[0]
#app.route('/keyspaces')
def all():
return Query('select json * from schema_keyspaces')
I would like not only get all the keyspaces, but also their attributes and that in JSON document, how I can proceed ?
Thanks,
Instead of a loop that only runs once, you need to collect all the elements
rows = session.execute(str)
return jsonify(list(rows))
Note that you should ideally not be creating a new cassandra connection for each query you need to make, but that's unrelated to the current problem

Big query insert /delete to table

I have a table X in big query with 170,000 rows . The values on this table as based on complex calculations done on the values from a table Y. These are done in python so as to automate the ingestion when Y gets updated.
Every time Y updates, I recompute the values needed for X in my script and insert them using the script below using streaming:
def stream_data(table, json_data):
data = json.loads(str(json_data))
# Reload the table to get the schema.
table.reload()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}'.format( table))
else:
print('Errors:')
The problem here is that I have to delete all rows in the table before I insert . I know a query to do this but it fails because big query does not allow DML when there is a streaming buffer on the table and this is for one day apparently.
IS there a workaround where I can delete all rows in X , recompute based on Y and then insert the new values using the code above ??
Possibly turning the streaming buffer off ??!!
Another option would be to drop the whole table and recreate it . But my table is huge with 60 columns and the JSON for the schema would be huge . I couldn't find samples where I can create a new table with schema passed from json/file ? Some samples in this would be great.
A third option is to make the streaming insert smart that it does an update instead of insert if the row has changed . This again is a DML operation and goes back to original problem.
UPDATE:
another approach I tried is to delete the table and recreate it . Before delete I copy the schema so I can set it in the new table.:
def stream_data( json_data):
bigquery_client = bigquery.Client("myproject")
dataset = bigquery_client.dataset("mydataset")
table = dataset.table("test")
data = json.loads(json_data)
schema=table.schema
table.delete()
table = dataset.table("test")
# Set the table schema
table = dataset.table("test",schema)
table.create()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row ')
else:
print('Errors:')
This gives me an error :
ValueError: Set either 'view_query' or 'schema'.
UPDATE 2:
Key was to do a
table.reload() before
schema=table.schema to fix the above!

Put retrieved data from MySQL query into DataFrame pandas by a for loop

I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.
That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!

Categories