Dynamically Adding a Column to a SQLAlchemy Table - python

I have a SQLAlchemy Table which is defined as:
class Student(db.Model):
id = db.Column(db.Integer(), primary_key=True)
name = db.Column(db.Text(), nullable=False)
I am loading this table with data from a .csv file, and I may have columns which are not defined statically in the Student class. Let's say while loading the file, I find a new column that is not in the database, so I need to add it to the Student class.
So far I have tried this:
from alembic.migration import MigrationContext
from alembic.operations import Operations
from sqlalchemy.exc import OperationalError
op = Operations(MigrationContext.configure(db.engine.connect()))
return_col_list = []
for col_name in headers_list:
# If the column is not already in the database, then create it.
if (col_name not in COLUMN_HEADER_NAMES):
lower_col_name = col_name.lower()
try:
op.add_column('students', Column(lower_col_name, Text)) # Add the column
Student.__table__.append_column(Column(lower_col_name, Text)) # Append the column to the table.
return_col_list.append(col_name)
except OperationalError:
# The column already exists in the table.
pass
return return_col_list
Then when I loop over each row of my .csv file, I do this to set the data to the attribute.
setattr(new_student, col, val)
However, when I set the data on a given Student object, the data is not reflected in the database. It appears there is no "link" between the column I made and the SQLite table itself. Does anyone know of a better way to do this?

Related

Exporting data into CSV file from Flask-SQLAlchemy

I'm looking to generate(export) a csv from a flask-sqlalchemy app i'm developing. But i'm getting some unexpected outcomes in my csv i.e. instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??). The issue possibly could be the way i structured the query or even, the entire function.
Oddly enough - judging from the csv output (pic) - it would seem i'm on the right track since the row/column count is the same as the DB table but actual data is just not populated. I'm fairly new to SQLAlchemy ORM and Flask, so looking for some guidance here to pull through. Constructive feedback appreciated.
#class declaration with DB object (divo)
class pearl(divo.Model):
__tablename__ = 'users'
work_id = divo.Column(divo.Integer, primary_key=True)
user_fname = divo.Column(divo.String(length=255))
user_lname = divo.Column(divo.String(length=255))
user_category = divo.Column(divo.String(length=255))
user_status = divo.Column(divo.String(length=1))
login_id = divo.Column(divo.String(length=255))
login_passwd = divo.Column(divo.String(length=255))
#user report function
#app.route("/reports/users")
def users_report():
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = pearl.query.all()
for i in x15:
# x16 = tuple(x15)
csv_out = csv.writer(s_key)
csv_out.writerow(x15)
flash("Report generated. Please check designated repository.", "green")
return redirect(url_for('reports_landing')) # return redirect(url_for('other_tasks'))
#csv outcome (see attached pic)
instead of the actual data from the MySQL DB table populated in the csv file, i get the declarative class model entries (placeholders??)
Each object in the list
x15 = pearl.query.all()
represents a row in your users table.
What you're seeing in the spreadsheet are not placeholders, but string representations of each row object (See object.repr).
You could get the value of a column for a particular row object by the column name attribute, for example:
x15[0].work_id # Assumes there is at least one row object in x15
What you could do instead is something like this:
with open(r'C:\Users\Xxxxxxx\Projects\_repository\zzz.csv', 'w') as s_key:
x15 = divo.session.query(pearl.work_id, pearl.user_fname) # Add columns to query as needed
for i in x15:
csv_out = csv.writer(s_key)
csv_out.writerow(i)
i in the code above is a tuple of the form:
('work_id value', 'user_fname value')

Efficient Update of All Rows in SQLAlchemy

I'm currently trying to update all rows in a table, by setting the value of one column images based on the value of another column source_images in the same row.
for row in Cars.query.all():
# Generate a list of strings
images = [s for s in row.source_images if 'example.com/' in s]
# Column 'images' being updated is a sqlalchemy.dialects.postgresql.ARRAY(Text())
Cars.query.filter(Cars.id == row.id).update({'images': images})
db_session.commit()
Problem: This appears to be really slow, especially when applying to 100k+ rows. Is there a more efficient way of updating the rows?
Similar questions:
#1: This question involves updating all the rows by incrementing the value.
Model Class Definition: cars.py
from sqlalchemy import *
from sqlalchemy.dialects import postgresql
from ..Base import Base
class Car(Base):
__tablename__ = 'cars'
id = Column(Integer, primary_key=True)
images = Column(postgresql.ARRAY(Text))
source_images = Column(postgresql.ARRAY(Text))
You could take the operation to the DB, instead of fetching and updating each row separately:
from sqlalchemy import select, column, func
source_images = select([column('i')]).\
select_from(func.unnest(Car.source_images).alias('i')).\
where(column('i').contains('example.com/'))
source_images = func.array(source_images)
Car.query.update({Car.images: source_images},
synchronize_session=False)
The correlated subquery unnests the source images, selects those matching the criteria, and the ARRAY() constructor forms the new image array.
Alternatively you could use array_agg():
source_images = select([func.array_agg(column('i'))]).\
select_from(func.unnest(Car.source_images).alias('i')).\
where(column('i').contains('example.com/'))

How to return specific dictionary keys from within a nested list from a jsonb column in sqlalchemy

I am attempting to return some named columns from a jsonb data set that is stored with PostgreSQL.
I am able to run a raw query that meets my needs directly, however I am trying to run the query utilising SQLAlchemy, in order to ensure that my code is 'pythonic' and easy to read.
The query that returns the correct result (two columns) is:
SELECT
tmp.item->>'id',
tmp.item->>'name'
FROM (SELECT jsonb_array_elements(t.data -> 'users') AS item FROM tpeople t) as tmp
Example json (each user has 20+ columns)
{ "results":247, "users": [
{"id":"202","regdate":"2015-12-01","name":"Bob Testing"},
{"id":"87","regdate":"2014-12-12","name":"Sally Testing"},
{"id":"811", etc etc}
...
]}
The table is simple enough, with a PK, datetime of json extraction, and the jsonb column for the extract
CREATE TABLE tpeople
(
record_id bigint NOT NULL DEFAULT nextval('"tpeople_record_id_seq"'::regclass) ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
scrape_time timestamp without time zone NOT NULL,
data jsonb NOT NULL,
CONSTRAINT "tpeople_pkey" PRIMARY KEY (record_id)
);
Additionally I have a People Class that looks as follows:
class people(Base):
__tablename__ = 'tpeople'
record_id = Column(BigInteger, primary_key=True, server_default=text("nextval('\"tpeople_record_id_seq\"'::regclass)"))
scrape_time = Column(DateTime, nullable=False)
data = Column(JSONB(astext_type=Text()), nullable=False)
Presently my code to return the two columns looks like this:
from db.db_conn import get_session // Generic connector for my db
from model.models import people
from sqlalchemy import func,
sess = get_session()
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
test = sess.query(sub.c.item).select_entity_from(sub).all()
SQLAlchemy generates the following SQL:
SELECT anon_1.item AS anon_1_item
FROM (SELECT jsonb_array_elements(tpeople.data -> %(data_1)s) AS item
FROM tpeople) AS anon_1
{'data_1': 'users'}
But nothing I seem to do can allow me to only get certain columns within the item itself like the raw SQL I can write. Some of the approaches I have tried as follows (they all error out):
test = sess.query("sub.item.id").select_entity_from(sub).all()
test = sess.query(sub.item.["id"]).select_entity_from(sub).all()
aas = func.jsonb_to_recordset(people.data["users"])
res = sess.query("id").select_from(aas).all()
sub = select(func.jsonb_array_elements(people.data["users"]).label("item"))
Presently I can extract the columns I need in a simple for loop, but this seems like a hacky way to do it, and I'm sure there is something dead obvious I'm missing.
for row in test:
print(row.item['id'])
Searched for a few hours eventually found some who accidentally did this while trying to get another result.
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
tmp = sub.c.item.op('->>')('id')
tmp2 = sub.c.item.op('->>')('name')
test = sess.query(tmp, tmp2).all()

Bulk Upsert with SQLAlchemy Postgres

I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable:
class MyTable(base):
__tablename__ = 'mytable'
id = Column(types.Integer, primary_key=True)
test_value = Column(types.Text)
Creating a generic insert statement is simple enough:
from sqlalchemy.dialects import postgresql
values = [{'id': 0, 'test_value': 'a'}, {'id': 1, 'test_value': 'b'}]
insert_stmt = postgresql.insert(MyTable.__table__).values(values)
The problem I run into is when I try to add the "on conflict" part of the upsert.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
Trying to execute this statement yields a ProgrammingError:
from sqlalchemy import create_engine
engine = create_engine('postgres://localhost/db_name')
engine.execute(update_stmt)
>>> ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
I think my misunderstanding is in constructing the statement with the on_conflict_do_update method. Does anyone know how to construct this statement? I have looked at other questions on StackOverflow (eg. here) but I can't seem to a way to address the above error.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
index_elements should either be a list of strings or a list of column objects. So either [MyTable.id] or ['id'] (This is correct)
set_ should be a dictionary with column names as keys and valid sql update objects as values. You can reference values from the insert block using the excluded attribute. So to get the result you are hoping for here you would want set_={'test_value': insert_stmt.excluded.test_value} (The error you made is that data= in the example isn't a magic argument... it was the name of the column on their example table)
So, the whole thing would be
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_={'test_value': insert_stmt.excluded.test_value}
)
Of course, in a real world example I usually want to change more then one column. In that case I would do something like...
update_columns = {col.name: col for col in insert_stmt.excluded if col.name not in ('id', 'datetime_created')}
update_statement = insert_stmt.on_conflict_do_update(index_elements=['id'], set_=update_columns)
(This example would overwrite every column except for the id and datetime_created columns)

Postgresql: Insert from huge csv file, collect the ids and respect unique constraints

In a postgresql database:
class Persons(models.Model):
person_name = models.CharField(max_length=10, unique=True)
The persons.csv file, contains 1 million names.
$cat persons.csv
Name-1
Name-2
...
Name-1000000
I want to:
Create the names that do not already exist
Query the database and fetch the id for each name contained in the csv file.
My approach:
Use the COPY command or the django-postgres-copy application that implements it.
Also take advantage of the new Postgresql-9.5+ upsert feature.
Now, all the names in the csv file, are also in the database.
I need to get their ids -from the database- either in memory or in another csv file with an efficient way:
Use Q objects
list_of_million_q = <iterate csv and append Qs>
million_names = Names.objects.filter(list_of_million_q)
or
Use __in to filter based on a list of names:
list_of_million_names = <iterate csv and append strings>
million_names = Names.objects.filter(
person_name__in=[list_of_million_names]
)
or
?
I do not feel that any of the above approaches for fetching the ids is efficient.
Update
There is a third option, along the lines of this post that should be a great solution which combines all the above.
Something like:
SELECT * FROM persons;
make a name: id dictionary out of the names recieved from the database:
db_dict = {'Harry': 1, 'Bob': 2, ...}
Query the dictionary:
ids = []
for name in list_of_million_names:
if name in db_dict:
ids.append(db_dict[name])
This way you're using the quick dictionary indexing as opposed to the slower if x in list approach.
But the only way to really know for sure is to benchmark these 3 approaches.
This post describes how to use RETURNING with ON CONFLICT so while inserting into the database the contents of the csv file, the ids will be saved in another table either when an insertion was successful, or when -due to unique constraints- the insertion was omitted.
I have tested it in sqlfiddle where I used a set up that resembles the one used for the COPY command which inserts to the database straight from a csv file, respecting the unique constraints.
The schema:
CREATE TABLE IF NOT EXISTS label (
id serial PRIMARY KEY,
label_name varchar(200) NOT NULL UNIQUE
);
INSERT INTO label (label_name) VALUES
('Name-1'),
('Name-2');
CREATE TABLE IF NOT EXISTS ids (
id serial PRIMARY KEY,
label_ids varchar(12) NOT NULL
);
The script:
CREATE TEMP TABLE tmp_table
(LIKE label INCLUDING DEFAULTS)
ON COMMIT DROP;
INSERT INTO tmp_table (label_name) VALUES
('Name-2'),
('Name-3');
WITH ins AS(
INSERT INTO label
SELECT *
FROM tmp_table
ON CONFLICT (label_name) DO NOTHING
RETURNING id
)
INSERT INTO ids (label_ids)
SELECT
id FROM ins
UNION ALL
SELECT
l.id FROM tmp_table
JOIN label l USING(label_name);
The output:
SELECT * FROM ids;
SELECT * FROM label;

Categories