Efficient Update of All Rows in SQLAlchemy - python

I'm currently trying to update all rows in a table, by setting the value of one column images based on the value of another column source_images in the same row.
for row in Cars.query.all():
# Generate a list of strings
images = [s for s in row.source_images if 'example.com/' in s]
# Column 'images' being updated is a sqlalchemy.dialects.postgresql.ARRAY(Text())
Cars.query.filter(Cars.id == row.id).update({'images': images})
db_session.commit()
Problem: This appears to be really slow, especially when applying to 100k+ rows. Is there a more efficient way of updating the rows?
Similar questions:
#1: This question involves updating all the rows by incrementing the value.
Model Class Definition: cars.py
from sqlalchemy import *
from sqlalchemy.dialects import postgresql
from ..Base import Base
class Car(Base):
__tablename__ = 'cars'
id = Column(Integer, primary_key=True)
images = Column(postgresql.ARRAY(Text))
source_images = Column(postgresql.ARRAY(Text))

You could take the operation to the DB, instead of fetching and updating each row separately:
from sqlalchemy import select, column, func
source_images = select([column('i')]).\
select_from(func.unnest(Car.source_images).alias('i')).\
where(column('i').contains('example.com/'))
source_images = func.array(source_images)
Car.query.update({Car.images: source_images},
synchronize_session=False)
The correlated subquery unnests the source images, selects those matching the criteria, and the ARRAY() constructor forms the new image array.
Alternatively you could use array_agg():
source_images = select([func.array_agg(column('i'))]).\
select_from(func.unnest(Car.source_images).alias('i')).\
where(column('i').contains('example.com/'))

Related

Dynamically Adding a Column to a SQLAlchemy Table

I have a SQLAlchemy Table which is defined as:
class Student(db.Model):
id = db.Column(db.Integer(), primary_key=True)
name = db.Column(db.Text(), nullable=False)
I am loading this table with data from a .csv file, and I may have columns which are not defined statically in the Student class. Let's say while loading the file, I find a new column that is not in the database, so I need to add it to the Student class.
So far I have tried this:
from alembic.migration import MigrationContext
from alembic.operations import Operations
from sqlalchemy.exc import OperationalError
op = Operations(MigrationContext.configure(db.engine.connect()))
return_col_list = []
for col_name in headers_list:
# If the column is not already in the database, then create it.
if (col_name not in COLUMN_HEADER_NAMES):
lower_col_name = col_name.lower()
try:
op.add_column('students', Column(lower_col_name, Text)) # Add the column
Student.__table__.append_column(Column(lower_col_name, Text)) # Append the column to the table.
return_col_list.append(col_name)
except OperationalError:
# The column already exists in the table.
pass
return return_col_list
Then when I loop over each row of my .csv file, I do this to set the data to the attribute.
setattr(new_student, col, val)
However, when I set the data on a given Student object, the data is not reflected in the database. It appears there is no "link" between the column I made and the SQLite table itself. Does anyone know of a better way to do this?

Exporting data into CSV file from Flask-SQLAlchemy

I'm looking to generate(export) a csv from a flask-sqlalchemy app i'm developing. But i'm getting some unexpected outcomes in my csv i.e. instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??). The issue possibly could be the way i structured the query or even, the entire function.
Oddly enough - judging from the csv output (pic) - it would seem i'm on the right track since the row/column count is the same as the DB table but actual data is just not populated. I'm fairly new to SQLAlchemy ORM and Flask, so looking for some guidance here to pull through. Constructive feedback appreciated.
#class declaration with DB object (divo)
class pearl(divo.Model):
__tablename__ = 'users'
work_id = divo.Column(divo.Integer, primary_key=True)
user_fname = divo.Column(divo.String(length=255))
user_lname = divo.Column(divo.String(length=255))
user_category = divo.Column(divo.String(length=255))
user_status = divo.Column(divo.String(length=1))
login_id = divo.Column(divo.String(length=255))
login_passwd = divo.Column(divo.String(length=255))
#user report function
#app.route("/reports/users")
def users_report():
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = pearl.query.all()
for i in x15:
# x16 = tuple(x15)
csv_out = csv.writer(s_key)
csv_out.writerow(x15)
flash("Report generated. Please check designated repository.", "green")
return redirect(url_for('reports_landing')) # return redirect(url_for('other_tasks'))
#csv outcome (see attached pic)
instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??)
Each object in the list
x15 = pearl.query.all()
represents a row in your users table.
What you're seeing in the spreadsheet are not placeholders, but string representations of each row object (See object.repr).
You could get the value of a column for a particular row object by the column name attribute, for example:
x15[0].work_id # Assumes there is at least one row object in x15
What you could do instead is something like this:
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = divo.session.query(pearl.work_id, pearl.user_fname) # Add columns to query as needed
for i in x15:
csv_out = csv.writer(s_key)
csv_out.writerow(i)
i in the code above is a tuple of the form:
('work_id value', 'user_fname value')

How to return specific dictionary keys from within a nested list from a jsonb column in sqlalchemy

I am attempting to return some named columns from a jsonb data set that is stored with PostgreSQL.
I am able to run a raw query that meets my needs directly, however I am trying to run the query utilising SQLAlchemy, in order to ensure that my code is 'pythonic' and easy to read.
The query that returns the correct result (two columns) is:
SELECT
tmp.item->>'id',
tmp.item->>'name'
FROM (SELECT jsonb_array_elements(t.data -> 'users') AS item FROM tpeople t) as tmp
Example json (each user has 20+ columns)
{ "results":247, "users": [
{"id":"202","regdate":"2015-12-01","name":"Bob Testing"},
{"id":"87","regdate":"2014-12-12","name":"Sally Testing"},
{"id":"811", etc etc}
...
]}
The table is simple enough, with a PK, datetime of json extraction, and the jsonb column for the extract
CREATE TABLE tpeople
(
record_id bigint NOT NULL DEFAULT nextval('"tpeople_record_id_seq"'::regclass) ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
scrape_time timestamp without time zone NOT NULL,
data jsonb NOT NULL,
CONSTRAINT "tpeople_pkey" PRIMARY KEY (record_id)
);
Additionally I have a People Class that looks as follows:
class people(Base):
__tablename__ = 'tpeople'
record_id = Column(BigInteger, primary_key=True, server_default=text("nextval('\"tpeople_record_id_seq\"'::regclass)"))
scrape_time = Column(DateTime, nullable=False)
data = Column(JSONB(astext_type=Text()), nullable=False)
Presently my code to return the two columns looks like this:
from db.db_conn import get_session // Generic connector for my db
from model.models import people
from sqlalchemy import func,
sess = get_session()
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
test = sess.query(sub.c.item).select_entity_from(sub).all()
SQLAlchemy generates the following SQL:
SELECT anon_1.item AS anon_1_item
FROM (SELECT jsonb_array_elements(tpeople.data -> %(data_1)s) AS item
FROM tpeople) AS anon_1
{'data_1': 'users'}
But nothing I seem to do can allow me to only get certain columns within the item itself like the raw SQL I can write. Some of the approaches I have tried as follows (they all error out):
test = sess.query("sub.item.id").select_entity_from(sub).all()
test = sess.query(sub.item.["id"]).select_entity_from(sub).all()
aas = func.jsonb_to_recordset(people.data["users"])
res = sess.query("id").select_from(aas).all()
sub = select(func.jsonb_array_elements(people.data["users"]).label("item"))
Presently I can extract the columns I need in a simple for loop, but this seems like a hacky way to do it, and I'm sure there is something dead obvious I'm missing.
for row in test:
print(row.item['id'])
Searched for a few hours eventually found some who accidentally did this while trying to get another result.
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
tmp = sub.c.item.op('->>')('id')
tmp2 = sub.c.item.op('->>')('name')
test = sess.query(tmp, tmp2).all()

Identify what values in a list doesn't exist in a Table column using SQLAlchemy

I have a list cities = ['Rome', 'Barcelona', 'Budapest', 'Ljubljana']
Then,
I have a sqlalchemy model as follows -
class Fly(Base):
__tablename__ = 'fly'
pkid = Column('pkid', INTEGER(unsigned=True), primary_key=True, nullable=False)
city = Column('city', VARCHAR(45), unique=True, nullable=False)
country = Column('country', VARCHAR(45))
flight_no = Column('Flight', VARCHAR(45))
I need to check if ALL the values in given cities list exists in fly table or not using sqlalchemy. Return true only if ALL the cities exists in table. Even if a single city doesn't exist in table, I need to return false and list of cities that doesn't exist. How to do that? Any ideas/hints/suggestions? I'm using MYSQL
One way would be to create a (temporary) relation based on the given list and take the set difference between it and the cities from the fly table. In other words create a union of the values from the list1:
from sqlalchemy import union, select, literal
cities_union = union(*[select([literal(v)]) for v in cities])
Then take the difference:
sq = cities_union.select().except_(select([Fly.city]))
and check that no rows are left after the difference:
res = session.query(~exists(sq)).scalar()
For a list of cities lacking from fly table omit the (NOT) EXISTS:
res = session.execute(sq).fetchall()
1 Other database vendors may offer alternative methods for producing relations from arrays, such as Postgresql and its unnest().

Keep smallest value for each unique ID with arcpy/numpy

I've got a ESRI Point Shape file with (amongst others) a nMSLINK field and a DIAMETER field. The MSLINK is not unique, because of a spatial join. What I want to achieve is to keep only the features in the shapefile that have a unique MSLINK and the smallest DIAMETER value, together with the corresponding values in the other fields. I can use a searchcursor to achieve this (looping through all features and removing each feature that does not comply, but this takes ages (> 75000 features). I was wondering if eg. numpy could do the trick faster in ArcMap/arcpy.
I think, making that kind of processing would definitely be a lot faster if you work on memory instead of interacting with arcgis. For example, by putting all the rows first into a python object (probably a namedtuple would be a good option here). Then you can find out which rows you want to delete or insert.
The fastest approach depends on a) if you have a lot of (MSLINK) repeated rows, then the fastest would be inserting just the ones you need in a new layer. Or b) if the rows to be deleted are just a few compared to the total of rows, then deleting is faster.
For a) you'll need to fetch all fields into the tuple, including the point coordinates, so that you can just create a new feature class and insert the new rows.
# Example of Variant a:
from collections import namedtuple
# assuming the following:
source_fc # contains name of the fclass
the_path # contains path to the shape
cleaned_fc # the name of the cleaned fclass
# use all fields of source_fc plus the shape token to get a touple with xy
# coordinates (using 'mslink' and 'diam' here to simplify the example)
fields = ['mslink', 'diam', 'field3', ... ]
all_fields = fields + ['SHAPE#XY']
# define a namedtuple to hold and work with the rows, use the name 'point' to
# hold the coordinates-tuple
Row = namedtuple('Row', fields + ['point'])
data = []
with arcpy.da.SearchCursor(source_fc, fields) as sc:
for r in sc:
# unzip the values from each row into a new Row (namedtuple) and append
# to data
data.append(Row(*r))
# now just delete the rows we don't want, for this, the easiest way, is probably
# to order the tuple first after MSLINK and then after the diamater...
data = sorted(data, key = lambda x : (x.mslink, x.diam))
# ... now just keep the first ones for each mslink
to_keep = []
last_mslink = None
for d in data:
if last_mslink != d.mslink:
last_mslink = d.mslink
to_keep.append(d)
# create a new feature class with the same fields as the source_fc
arcpy.CreateFeatureclass_management(
out_path=the_path, out_name=cleaned_fc, template=source_fc)
with arcpy.da.InsertCursor(cleaned_fc, all_fields) as ic:
for r in to_keep:
ic.insertRow(*r)
And for alternative b) I would just fetch 3 fields, a unique ID, MSLINK and the diameter. Then make a delete list (here you only need the unique ids). Then loop again through the feature class and delete the rows with the id on your delete-list. Just to be sure, I would duplicate the feature class first, and work on a copy.
There are a few steps you can take to accomplish this task more efficiently. First and foremost, making use of the data analyst cursor as opposed to the older version of cursor will increase the speed of your process. This assumes you are working in 10.1 or beyond. Then you can employ summary statistics, namely its ability to find a minimum value based off a case field. For yours, the case field would be nMSLINK.
The code below first creates a statistics table with all unique 'nMSLINK' values, and its corresponding minimum 'DIAMETER' value. I then use a table select to select out only rows in the table whose 'FREQUENCY' field is not 1. From here I iterate through my new table and start to build a list of strings that will make up a final sql statement. After this iteration, I use the python join function to create an sql string that looks something like this:
("nMSLINK" = 'value1' AND "DIAMETER" <> 624.0) OR ("nMSLINK" = 'value2' AND "DIAMETER" <> 1302.0) OR ("nMSLINK" = 'value3' AND "DIAMETER" <> 1036.0) ...
The sql selects rows where nMSLINK values are not unique and where DIAMETER values are not the minimum. Using this SQL, I select by attribute and delete selected rows.
This SQL statement is written assuming your feature class is in a file geodatabase and that 'nMSLINK' is a string field and 'DIAMETER' is a numeric field.
The code has the following inputs:
Feature: The feature to be analyzed
Workspace: A folder that will store a couple intermediate tables temporarily
TempTableName1: A name for one temporary table.
TempTableName2: A name for a second temporary table
Field1 = The nonunique field
Field2 = The field with the numeric values that you wish to find the lowest of
Code:
# Import modules
from arcpy import *
import os
# Local variables
#Feature to analyze
Feature = r"C:\E1B8\ScriptTesting\Workspace\Workspace.gdb\testfeatureclass"
#Workspace to export table of identicals
Workspace = r"C:\E1B8\ScriptTesting\Workspace"
#Name of temp DBF table file
TempTableName1 = "Table1"
TempTableName2 = "Table2"
#Field names
Field1 = "nMSLINK" #nonunique
Field2 = "DIAMETER" #field with numeric values
#Make layer to allow selection
MakeFeatureLayer_management (Feature, "lyr")
#Path for first temp table
Table = os.path.join (Workspace, TempTableName1)
#Create statistics table with min value
Statistics_analysis (Feature, Table, [[Field2, "MIN"]], [Field1])
#SQL Select rows with frequency not equal to one
sql = '"FREQUENCY" <> 1'
# Path for second temp table
Table2 = os.path.join (Workspace, TempTableName2)
# Select rows with Frequency not equal to one
TableSelect_analysis (Table, Table2, sql)
#Empty list for sql bits
li = []
# Iterate through second table
cursor = da.SearchCursor (Table2, [Field1, "MIN_" + Field2])
for row in cursor:
# Add SQL bit to list
sqlbit = '("' + Field1 + '" = \'' + row[0] + '\' AND "' + Field2 + '" <> ' + str(row[1]) + ")"
li.append (sqlbit)
del row
del cursor
#Create SQL for selection of unwanted features
sql = " OR ".join (li)
print sql
#Select based on SQL
SelectLayerByAttribute_management ("lyr", "", sql)
#Delete selected features
DeleteFeatures_management ("lyr")
#delete temp files
Delete_management ("lyr")
Delete_management (Table)
Delete_management (Table2)
This should be quicker than a straight-up cursor. Let me know if this makes sense. Good luck!

Categories