How to create a pg_trgm index using SQLAlchemy for Scrapy?

How to create a pg_trgm index using SQLAlchemy for Scrapy? - python

I am using Scrapy to scrape data from a web forum. I am storing this data in a PostgreSQL database using SQLAlchemy. The table and columns create fine, however, I am not able to have SQLAlchemy create an index on one of the columns. I am trying to create a trigram index (pg_trgm) using gin.
The Postgresql code that would create this index is:
CREATE INDEX description_idx ON table USING gin (description gin_trgm_ops);
The SQLAlchemy code I have added to my models.py file is:
desc_idx = Index('description_idx', text("description gin_trgm_ops"), postgresql_using='gin')
I have added this line to my models.py but when I check in postgresql, the index was never created.
Below are my full models.py and pipelines.py files. Am I going about this all wrong??
Any help would be greatly appreciated!!
models.py:
from sqlalchemy import create_engine, Column, Integer, String, DateTime, Index, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
return create_engine(URL(**settings.DATABASE))
def create_forum_table(engine):
DeclarativeBase.metadata.create_all(engine)
class forumDB(DeclarativeBase):
__tablename__ = "table"
id = Column(Integer, primary_key=True)
title = Column('title', String)
desc = Column('description', String, nullable=True)
desc_idx = Index('description_idx', text("description gin_trgm_ops"), postgresql_using='gin')
pipelines.py
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from models import forumDB, db_connect, create_forum_table
class ScrapeforumToDB(object):
def __init__(self):
engine = db_connect()
create_forum_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
session = self.Session()
forumitem = forumDB(**item)
try:
session.add(forumitem)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item

The proper way to reference an Operator Class in SQLAlchemy (such as gin_trgm_ops) is to use the postgresql_ops parameter. This will also allow tools like alembic to understand how use it when auto-generating migrations.
Index('description_idx',
'description', postgresql_using='gin',
postgresql_ops={
'description': 'gin_trgm_ops',
})

Since the Index definition uses text expression it has no references to the Table "table", which has been implicitly created by the declarative class forumDB. Compare that to using a Column as expression, or some derivative of it, like this:
Index('some_index_idx', forumDB.title)
In the above definition the index will know about the table and the other way around.
What this means in your case is that the Table "table" has no idea that such an index exists. Adding it as an attribute of the declarative class is the wrong way to do it. It should be passed to the implicitly created Table instance. The attribute __table_args__ is just for that:
class forumDB(DeclarativeBase):
__tablename__ = "table"
# Note: This used to use `text('description gin_trgm_ops')` instead of the
# `postgresql_ops` parameter, which should be used.
__table_args__ = (
Index('description_idx', "description",
postgresql_ops={"description": "gin_trgm_ops"},
postgresql_using='gin'),
)
id = Column(Integer, primary_key=True)
title = Column('title', String)
desc = Column('description', String, nullable=True)
With the modification in place, a call to create_forum_table(engine) resulted in:
> \d "table"
Table "public.table"
Column | Type | Modifiers
-------------+-------------------+----------------------------------------------------
id | integer | not null default nextval('table_id_seq'::regclass)
title | character varying |
description | character varying |
Indexes:
"table_pkey" PRIMARY KEY, btree (id)
"description_idx" gin (description gin_trgm_ops)

Related

Patch unwanted column attribute

I have an object
class Summary():
__tablename__ = 'employeenames'
name= Column('employeeName', String(128, collation='utf8_bin'))
date = Column('dateJoined', Date)
I want to patch Summary with a mock object
class Summary():
__tablename__ = 'employeenames'
name= Column('employeeName', String)
date = Column('dateJoined', Date)
or just patch name the field to name= Column('employeeName', String)
The reason I'm doing this is that I'm doing my tests in sqlite and some queries that are only for Mysql are interfering with my tests.

I think it would be difficult to mock the column. However you could instead conditionally compile the String type for Sqlite, removing the collation.
import sqlalchemy as sa
from sqlalchemy import orm
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.types import String
#compiles(String, 'sqlite')
def compile_varchar(element, compiler, **kw):
type_expression = kw['type_expression']
type_expression.type.collation = None
return compiler.visit_VARCHAR(element, **kw)
Base = orm.declarative_base()
class Summary(Base):
__tablename__ = 'employeenames'
id = sa.Column(sa.Integer, primary_key=True)
name = sa.Column('employeeName', sa.String(128, collation='utf8_bin'))
date = sa.Column('dateJoined', sa.Date)
urls = ['mysql:///test', 'sqlite://']
for url in urls:
engine = sa.create_engine(url, echo=True, future=True)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
This script produces the expected output for MySQL:
CREATE TABLE employeenames (
id INTEGER NOT NULL AUTO_INCREMENT,
`employeeName` VARCHAR(128) COLLATE utf8_bin,
`dateJoined` DATE,
PRIMARY KEY (id)
)
but removes the collation for Sqlite:
CREATE TABLE employeenames (
id INTEGER NOT NULL,
"employeeName" VARCHAR(128),
"dateJoined" DATE,
PRIMARY KEY (id)
)

null value for primary key when trying to insert list of dictionaries

I'm trying to do a large number of inserts with one call, and the way someone here recommended was by giving .insert a list of dictionaries. This is using SQLAlchemy Core.
As an example:
try:
engine = db.create_engine(f"postgres://user:pass#myip/addressbook", connect_args={'connect_timeout': 5})
connection = engine.connect()
metadata = db.MetaData()
except exc.OperationalError:
print_error(f":: Could not connect to myip!")
sys.exit()
table_addressbook = db.Table('addressbook', metadata, autoload=True, autoload_with=engine)
list = []
list.append({'firstname': "John", 'lastname': "Doe"})
list.append({'firstname': "Jane", 'lastname': "Doe"})
query = db.insert(table_addressbook).values(list)
connection.execute(query)
But I'm getting an error saying the column id violates a non-null constraint. This is because insert normally auto-generates the primary-key id. How do I use this method but specify that id should be auto-generated? Or is there a different method I should use?
edit
Table name is addressbook.
Column id is type integer with default sequence 'untitled_table_id_seq', constraints are PRIMARY_KEY. This was autogenerated by Postico for Mac, but I've always been able to insert without including id and it auto increments from the last inserted ID.
Columns firstname and lastname are type text, no default, no constraints.

Without any information on your model and/or connection it is a bit difficult to answer your question. Please find below a piece of code which uses insert without throwing non-null constraint errors. Hopefully it helps you.
from sqlalchemy import create_engine, Column, Integer, String, Table
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql import insert
engine = create_engine('sqlite:///:memory:', echo=True)
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
firstname = Column(String)
lastname = Column(String)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
Session.configure(bind=engine) # once engine is available
session = Session()
new_users = []
new_users.append({'firstname': "John", 'lastname': "Doe"})
new_users.append({'firstname': "Jane", 'lastname': "Doe"})
i = insert(User).values(new_users)
session.execute(i)
PS: most of this is coming from the tutorial on: https://docs.sqlalchemy.org/en/13/orm/tutorial.html

from sqlalchemy import Column
from sqlalchemy import create_engine
from sqlalchemy import Integer
from sqlalchemy import String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session
engine = create_engine('sqlite:///:memory:', echo=True)
Base = declarative_base()
# Example Model definition for the illustration
class Customer(Base):
__tablename__ = "customer"
id = Column(Integer, primary_key=True)
name = Column(String(255))
description = Column(String(255))
Base.metadata.create_all(engine)
######################################################
# Bulk insert using dictionaries.
######################################################
# Insert test records into `customer`table.
def bulk_insert_customers(n):
session = Session(bind=engine)
session.bulk_insert_mappings(
Customer,
[
dict(
name="customer name %d" % i,
description="customer description %d" % i,
)
for i in range(n)
],
)
session.commit()
Refer these for more examples of how to do bulk inserts in different ways:
https://docs.sqlalchemy.org/en/13/_modules/examples/performance/bulk_inserts.html

SQLAlchemy: relationship collection lazy loading

As per the SQLAlchemy documentation on relationship loading:
When the given collection or reference is first accessed on a particular object, an additional SELECT statement is emitted such that the requested collection is loaded.
How do I achieve loading behavior such that only the single elements of a relationship collection that I access are loaded, rather than the entire collection all at once?
I have heard of deferred column loading; this would be more like "deferred row loading". Rather than deferring loading of attributes, I'd like to defer loading of relationship collection elements.
Desired use case:
# Persist instance.
coln = Collection([1, 2, 3])
session.add(coln)
session.commit()
# Test lazy loading.
print('data' in coln.__dict__)
# Lazy loads the entire collection. I'd like only one element.
print(coln.data[1])
# Will output: "True 3". I'd like: "True 1".
print('data' in coln.__dict__, len(coln.__dict__['data']))
Class definitions and other backwork:
from sqlalchemy import Column, Integer, ForeignKey
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
Base = declarative_base()
engine = create_engine('sqlite:///:memory:')
# Define classes.
class Collection(Base):
__tablename__ = 'collection'
id = Column(Integer, primary_key=True)
data = relationship('Element')
def __init__(self, list_):
self.data = [Element(e) for e in list_]
class Element(Base):
__tablename__ = 'element'
id = Column(Integer, primary_key=True)
parent_id = Column(Integer, ForeignKey('collection.id'))
value = Column(Integer)
def __init__(self, value):
self.value = value
def __repr__(self):
return 'Element({})'.format(self.value)
# Create schema.
Base.metadata.create_all(engine)
# Create session.
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session = Session()

Use the lazy parameter with dynamic value:
data = relationship('Element', lazy='dynamic')
https://docs.sqlalchemy.org/en/13/orm/collections.html#dynamic-relationship

Retrieving all columns but some with SQLAlchemy

I'm making a WebService that sends specific tables in JSON.
I use SQLAlchemy to communicate with the database.
I'd want to retrieve just the columns the user has the right to see.
Is there a way to tell SQLAlchemy to not retrieve some columns ?
It's not correct but something like this :
SELECT * EXCEPT column1 FROM table.
I know it is possible to specify just some columns in the SELECT statement but it's not exactly what I want because I don't know all the table columns. I just want all the columns but some.
I also tried to get all the columns and delete the column attribute I don't want like this :
result = db_session.query(Table).all()
for row in result:
row.__delattr(column1)
but it seems SQLAlchemy doesn't allow to do this.
I get the warning :
Warning: Column 'column1' cannot be null
cursor.execute(statement, parameters)
ok
What would be the most optimized way to do it for you guys ?
Thank you

You can pass in all columns in the table, except the ones you don't want, to the query method.
session.query(*[c for c in User.__table__.c if c.name != 'password'])
Here is a runnable example:
#!/usr/bin/env python
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import Session
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String)
fullname = Column(String)
password = Column(String)
def __init__(self, name, fullname, password):
self.name = name
self.fullname = fullname
self.password = password
def __repr__(self):
return "<User('%s','%s', '%s')>" % (self.name, self.fullname, self.password)
engine = create_engine('sqlite:///:memory:', echo=True)
Base.metadata.create_all(engine)
session = Session(bind=engine)
ed_user = User('ed', 'Ed Jones', 'edspassword')
session.add(ed_user)
session.commit()
result = session.query(*[c for c in User.__table__.c if c.name != 'password']).all()
print(result)

You can make the column a defered column. This feature allows particular columns of a table be loaded only upon direct access, instead of when the entity is queried using Query.
See Deferred Column Loading

This worked for me
users = db.query(models.User).filter(models.User.email != current_user.email).all()
return users

Problem with sqlalchemy, reflected table and defaults for string fields

hmm, is there any reason why sa tries to add Nones to for varchar columns that have defaults set in in database schema ?, it doesnt do that for floats or ints (im using reflection).
so when i try to add new row :
like
u = User()
u.foo = 'a'
u.bar = 'b'
sa issues a query that has a lot more cols with None values assigned to those, and db obviously bards and doesnt perform default substitution.

What version do you use and what is actual code? Below is a sample code showing that server_default parameter works fine for string fields:
from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
metadata = MetaData()
Base = declarative_base(metadata=metadata)
class Item(Base):
__tablename__="items"
id = Column(String, primary_key=True)
int_val = Column(Integer, nullable=False, server_default='123')
str_val = Column(String, nullable=False, server_default='abc')
engine = create_engine('sqlite://', echo=True)
metadata.create_all(engine)
session = sessionmaker(engine)()
item = Item(id='foo')
session.add(item)
session.commit()
print item.int_val, item.str_val
The output is:
<...>
<...> INSERT INTO items (id) VALUES (?)
<...> ['foo']
<...>
123 abc

I've found its a bug in sa, this happens only for string fields, they dont get server_default property for some unknow reason, filed a ticket for this already

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a pg_trgm index using SQLAlchemy for Scrapy? - python

Related

Patch unwanted column attribute

null value for primary key when trying to insert list of dictionaries

SQLAlchemy: relationship collection lazy loading

Retrieving all columns but some with SQLAlchemy

Problem with sqlalchemy, reflected table and defaults for string fields

Categories

Resources