What's the most efficient way to add (new) documents from a Dataframe to MongoDB? - python

In this use case, I am trying to add documents to a MongoDB collection using pymongo that are retrieved from various RSS news feeds based on the date (not datetime), title, and article summary in dataframe format (the date being the index to the dataframe).
When I store the dataframe to the database, they are stored with the schema of _id, date, title, summary which is fine.
So what I'm trying to do is only upload those rows in the dataframe which haven't been stored as documents in the collection. There are a few ways I've tried:
Get the last document in the database, compare to the dataframe. Create a new DF which excludes all previous rows + the row its being compared to. This should work, however, it is still uploading roughly 20% of the rows which have been previously stored and I have no idea why.
Store the entire dataframe, then aggregate the collection and remove the duplicates: Sounds good in theory however all of the examples of doing this are in JS and not python, so I haven't been able to get this to work.
Create a unique index of the title: Again, this should work in theory, but I haven't gotten it to work.
One thing that I don't want to do is to query the entire collection and store as a DF, concatenate them, drop the duplicates, delete the collection, and re-create it from the new DF. It wouldn't be an issue now since I'm working with 30 or so documents, but when I'll be working with multiple collections and millions of documents, well.. not very efficient at all.
Anyone have any suggestions I can look into / research / code examples?
Here is the code I'm working with now:
Download RSS Feed
def getSymbolNews(self, symbol):
self.symbol = symbol
self.dbName = 'db_' + self.symbol
self.columnName = 'col_News'
self.topics = ['$' + self.symbol]
self.sa = getNews().parseNews(fn.SeekingAlpha(topics = self.topics))
self.yfin = getNews().parseNews(fn.Yahoo(topics = self.topics))
self.wb_news = getNews().getWebullNews(self.symbol)
self.df = pd.concat([self.sa, self.yfin, self.wb_news], axis = 0, ignore_index = False)
self.df.drop_duplicates(inplace = True)
self.df.sort_index(ascending = True, inplace = True)
del self.symbol, self.topics, self.sa, self.yfin, self.wb_news
getNews().uploadRecords(self.dbName, self.columnName, self.df)
return self.df
Upload to Collection:
def uploadRecords(self, dbName, columnName, data):
self.data = data
self.dbName = dbName
self.columnName = columnName
self.data.reset_index(inplace=True)
self.data.rename(columns={'index': 'Date'}, inplace = True)
mongoFunctions.insertRecords(self.dbName, self.columnName, self.data)
del self.data
gc.collect()
return
PyMongo function to upload:
def insertRecords(dbName: str, collectionName: str, data: object):
"""Inserts a pandas dataframe object into a MongoDB collection (table)
Args:
dbName (str): Database name
collectionName (str): Collection name
data (object): Pandas dataframe object
"""
collection = getCollection(dbName, collectionName)
query = queryAllRecords(dbName, collectionName)
if query.shape == (0, 0):
record = data.to_dict(orient="records")
collection.insert(record)
else:
query.drop(["_id"], axis=1, inplace=True)
if query.equals(data):
return
else:
df_temp = pd.concat([query, data]).drop_duplicates(keep=False)
records = df_temp.to_dict(orient="records")
collection.insert_many(records)
return

I'd be minded to take an md5 hash of the document and store that as the _id; then you can just use insert_many() with ordered=False to insert any items that aren't duplicates; you can run this as often as you like and only new items will be added; bear in mind that if any field is even sligtly changed a new item is added; if this isn't the behaviour you want then tweak what you pass to md5().
The code ends up being fairly straightforward:
from pymongo import MongoClient
from pymongo.errors import BulkWriteError
import feedparser
from hashlib import md5
from json import dumps
db = MongoClient()['mydatabase']
entries = feedparser.parse("http://feeds.bbci.co.uk/news/world/rss.xml")['entries']
for item in entries:
item['_id'] = md5(dumps(item).encode("utf-8")).hexdigest()
try:
db.news.insert_many(entries, ordered=False)
except BulkWriteError:
pass

Related

Exporting data into CSV file from Flask-SQLAlchemy

I'm looking to generate(export) a csv from a flask-sqlalchemy app i'm developing. But i'm getting some unexpected outcomes in my csv i.e. instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??). The issue possibly could be the way i structured the query or even, the entire function.
Oddly enough - judging from the csv output (pic) - it would seem i'm on the right track since the row/column count is the same as the DB table but actual data is just not populated. I'm fairly new to SQLAlchemy ORM and Flask, so looking for some guidance here to pull through. Constructive feedback appreciated.
#class declaration with DB object (divo)
class pearl(divo.Model):
__tablename__ = 'users'
work_id = divo.Column(divo.Integer, primary_key=True)
user_fname = divo.Column(divo.String(length=255))
user_lname = divo.Column(divo.String(length=255))
user_category = divo.Column(divo.String(length=255))
user_status = divo.Column(divo.String(length=1))
login_id = divo.Column(divo.String(length=255))
login_passwd = divo.Column(divo.String(length=255))
#user report function
#app.route("/reports/users")
def users_report():
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = pearl.query.all()
for i in x15:
# x16 = tuple(x15)
csv_out = csv.writer(s_key)
csv_out.writerow(x15)
flash("Report generated. Please check designated repository.", "green")
return redirect(url_for('reports_landing')) # return redirect(url_for('other_tasks'))
#csv outcome (see attached pic)
instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??)
Each object in the list
x15 = pearl.query.all()
represents a row in your users table.
What you're seeing in the spreadsheet are not placeholders, but string representations of each row object (See object.repr).
You could get the value of a column for a particular row object by the column name attribute, for example:
x15[0].work_id # Assumes there is at least one row object in x15
What you could do instead is something like this:
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = divo.session.query(pearl.work_id, pearl.user_fname) # Add columns to query as needed
for i in x15:
csv_out = csv.writer(s_key)
csv_out.writerow(i)
i in the code above is a tuple of the form:
('work_id value', 'user_fname value')

Select specific columns to read from PostgreSQL based on python list

I have two lists : one contains the column names of categorical variables and the other numeric as shown below.
cat_cols = ['stat','zip','turned_off','turned_on']
num_cols = ['acu_m1','acu_cnt_m1','acu_cnt_m2','acu_wifi_m2']
These are the columns names in a table in Redshift.
I want to pass these as a parameter to pull only numeric columns from a table in Redshift(PostgreSql),write that into a csv and close the csv.
Next I want to pull only cat_cols and open the csv and then append to it and close it.
my query so far:
#1.Pull num data:
seg = ['seg1','seg2']
sql_data = str(""" SELECT {num_cols} """ + """FROM public.""" + str(seg) + """ order by random() limit 50000 ;""")
df_data = pd.read_sql(sql_data, cnxn)
# Write to csv.
df_data.to_csv("df_sample.csv",index = False)
#2.Pull cat data:
sql_data = str(""" SELECT {cat_cols} """ + """FROM public.""" + str(seg) + """ order by random() limit 50000 ;""")
df_data = pd.read_sql(sql_data, cnxn)
# Append to df_seg.csv and close the connection to csv.
with open("df_sample.csv",'rw'):
## Append to the csv ##
This is the first time I am trying to do selective querying based on python lists and hence stuck on how to pass the list as column names to select from table.
Can someone please help me with this?
If you want, to make a query in a string representation, in your case will be better to use format method, or f-strings (required python 3.6+).
Example for the your case, only with built-in format function.
seg = ['seg1', 'seg2']
num_cols = ['acu_m1','acu_cnt_m1','acu_cnt_m2','acu_wifi_m2']
query = """
SELECT {} FROM public.{} order by random() limit 50000;
""".format(', '.join(num_cols), seg)
print(query)
If you want use only one item from the seg array, use seg[0] or seg[1] in format function.
I hope this will help you!

SQLAlchemy MySQL - Optimal method to update a table that needs to be frequently entirely updated. Questions on implementation

My use case;
My code runs multiple Scrapy spiders on different US counties to collect property data on every property. This is done by looping through a list of PINS/Parcels(100k to 200k) which are appended to the same URLS over and over, collecting sales data on each parcel or property, and storing that data on its respective county table one row at a time. My use case involves updating these tables frequently(once a week or so) to collect trends on sales data. Out of 100K properties, it can be that only a few acquired new sales records, but I would not know unless I went through all of them.
I currently began implementing this via pipeline below which essentially accomplishes getting the data in the table on the first run when the table is a clean slate. However when re-running to refresh data, I'm obviously unable to insert rows that contain the same unique ID and would need to update the row instead. My unique ID for each data point is its parcel number.
My questions-
1. What is the optimal method to update a database table that
requires a full refresh(all rows) frequently?
My guess so far based on research I've done is replacing the old table with a new temporary table. This is because it would be quicker(I think) to insert all data into a new table than to query each item in the old table, see if it has changed, and if changed, modify that row. This can be accomplished by inserting all data into the temporary table first, then replacing the old table with the new one.
If my method of implementation is optimal, how would I go about implementing this?
Should I use some kind of data migration module(panda?)- What would happen if I dropped the old table, and program was interrupted at this point prior to new table replacing it.
class PierceDataPipeline(object):
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates tables.
"""
engine = db_connect()
create_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self,item,spider):
"""
This method is called for every item pipeline component
"""
session = self.Session()
propertyDataTable = PierceCountyPropertyData()
propertyDataTable.parcel = item["parcel"]
propertyDataTable.mailing_address = item["mailing_address"]
propertyDataTable.owner_name = item["owner_name"]
propertyDataTable.county = item["county"]
propertyDataTable.site_address = item["site_address"]
propertyDataTable.property_type = item["property_type"]
propertyDataTable.occupancy = item["occupancy"]
propertyDataTable.year_built = item["year_built"]
propertyDataTable.adj_year_built = item["adj_year_built"]
propertyDataTable.units = item["units"]
propertyDataTable.bedrooms = item["bedrooms"]
propertyDataTable.baths = item["baths"]
propertyDataTable.siding_type = item["siding_type"]
propertyDataTable.stories = item["stories"]
propertyDataTable.lot_square_footage = item["lot_square_footage"]
propertyDataTable.lot_acres = item["lot_acres"]
propertyDataTable.current_balance_due = item["current_balance_due"]
propertyDataTable.tax_year_1 = item["tax_year_1"]
propertyDataTable.tax_year_2 = item["tax_year_2"]
propertyDataTable.tax_year_3 = item["tax_year_3"]
propertyDataTable.tax_year_1_assessed = item["tax_year_1_assessed"]
propertyDataTable.tax_year_2_assessed = item["tax_year_2_assessed"]
propertyDataTable.tax_year_3_assessed = item["tax_year_3_assessed"]
propertyDataTable.sale1_price = item["sale1_price"]
propertyDataTable.sale1_date = item["sale1_date"]
propertyDataTable.sale2_date = item["sale2_date"]
propertyDataTable.sale2_price = item["sale2_price"]
try:
session.add(propertyDataTable)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item

Get all rows in Json format with API-Rest Cassandra

I have the following code that allows me to retrieve the first keyspace:
def Query(str):
auth_provider = PlainTextAuthProvider(username='admin', password='root')
cluster = Cluster(['hostname'], auth_provider=auth_provider)
session = cluster.connect('system')
rows = session.execute(str)
keyspaces = []
row_list = list(rows)
for x in range(len(row_list)):
return row_list[0]
#app.route('/keyspaces')
def all():
return Query('select json * from schema_keyspaces')
I would like not only get all the keyspaces, but also their attributes and that in JSON document, how I can proceed ?
Thanks,
Instead of a loop that only runs once, you need to collect all the elements
rows = session.execute(str)
return jsonify(list(rows))
Note that you should ideally not be creating a new cassandra connection for each query you need to make, but that's unrelated to the current problem

How to obtain the field type using dbfpy?

I have some dbf files that I want to add new fields to. To do so, I'm using dbfpy to open the original dbf, copy all fields (or the ones I want to keep) and records and then create a new file with those fields plus the new ones that I want. All is working great, except for one minor detail: I can't manage to keep the original fields' types, since I don't know how to obtain them. What I'm doing is to create all the fields in the new file as "C" (character), which so far works for what I need right now but might be an issue eventually.
The real problem is that there is no documentation available. I searched through the package files to look for the examples there, but couldn't find an answer to this question (might be that I couldn't find just by the "greenish" I still am with python... I'm definitely not an expert).
An example of the code:
from dbfpy import dbf
import sys
org_db_file = str(sys.argv[1])
org_db = dbf.Dbf(org_db_file, new = False)
new_db_file = str(sys.argv[2])
new_db = dbf.Dbf(new_db_file, new = True)
#Obtain original field names:
fldnames = []
fldsize = {}
for names in org_db.fieldNames:
fldnames.append(names)
fldsize[name] = 0
#Cycle thru table entries:
for rec in org_db:
#Cycle thru columns to obtain fields' name and value:
for name in fldnames:
value = str(rec[name])
if len(value) > fldsize[name]:
fldsize[name] = len(value)
#Copy original fields to new table:
for names in fldnames:
new_db.addField((names, "C", fldsize[name]))
#Add new fields:
new_fieldname = "some_name"
new_db.addField((new_fieldname, "C", 2))
#Copy original entries and store new values:
for rec in org_db:
#Create new record instance for new table:
new_rec = new_db.newRecord()
#Populate fields:
for field in fldnames:
new_rec[field] = rec[field]
#Store value of new field for record i:
new_rec[new_fieldname] = "some_value"
new_rec.store()
new_db.close()
Thanks in advance for your time.
Cheers.
I don't have any experience with dbfpy other than when I first went looking several years ago it (and several others) did not meet my needs. So I wrote my own.
Here is how you would accomplish your task using it:
import dbf
import sys
org_db_file = sys.argv[1]
org_db = dbf.Table(org_db_file)
new_db_file = sys.argv[2]
# postpone until we have the field names...
# new_db = dbf.Dbf(new_db_file, new = True)
# Obtain original field list:
fields = org_db.field_names
for field in fields[:]: # cycle through a separate list
if field == "something we don't like":
fields.remove(field)
# now get definitions for fields we keep
field_defs = ord_db.structure(fields)
# Add new fields:
field_defs.append("some_name C(2)")
# now create new table
new_db = ord_db.new(new_db_file, field_specs=field_defs)
# open both tables
with dbf.Tables(ord_db, new_db):
# Copy original entries and store new values:
for rec in org_db:
# Create new record instance for new table:
new_db.append()
# Populate fields:
with new_db.last_record as new_rec:
for field in new_db.field_names:
new_rec[field] = rec[field]
# Store value of new field for record i:
new_rec[new_fieldname] = "some_value"

Categories