Scrapy and DuplicatesPipeline avoid saving duplicates to db

Scrapy and DuplicatesPipeline avoid saving duplicates to db - python

Currently my Scrapy library based spider is scraping a url (this url updates every minute with new items) and saving news list items to a database, the list is updated every hour and I am trying to avoid adding duplicates of these news items, through the use of "class DuplicatesPipeline(object):" in my pipelines.py
Currently my script is saving news items into the db, however it still saves duplicates.
Probably class DuplicatesPipeline is the wrong way to go since it does not seem to check against existing records in the database, it only checks against duplicates in current session.
Very thankful for your help
Model:
class Listitem(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "newsitems"
id = Column(Integer, primary_key=True)
title = Column('title', String)
description = Column('description', String, nullable=True)
link = Column('link', String, nullable=True)
date = Column('date', String, nullable=True)
Pipelines.py:
from sqlalchemy.orm import sessionmaker
from models import Presstv, db_connect, create_presstv_table
from scrapy import signals
from scrapy.exceptions import DropItem
class PressTvPipeline(object):
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_presstv_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save deals in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Presstv(**item)
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item

I think you should:
add a UNIQUE constraint on link in your database
check something like db.query(Listitem).filter_by(link=item.link).first() is None in the pipeline
Your mechanism may be used as an optimization - if you find a copy in your instance cache, there is not need to run the query. But if there is no copy, run the query.

Related

SQLAlchemy strategy: ORM + Core for classes with large amounts of data

Apparently use of ORM and Core in tandem is possible, but I haven't been able to find any solid explanation of a strategy for this.
Here's the use case class:
class DataHolder(Base):
__tablename__ = 'data_holder'
id = Column(Integer, primary_key=True)
dataset_id = Column(Integer, ForeignKey('data_set.id'))
name = Column(String)
_dataset_table = Table('data_set', Base.metadata,
Column('id', Integer, primary_key=True),
)
_datarows_table = Table('data_rows', Base.metadata,
Column('id', Integer, primary_key=True),
Column('dataset_id', None, ForeignKey('data_set.id')),
Column('row', Integer),
Column('col_0', Integer),
Column('col_1', Integer),
Column('col_2', Integer),
)
def __init__(self, name=None, data=None):
self.name = name
self.data = data
def _pack_info(self):
# Return __class__ and other info needed for packing.
def _unpack_info(self):
# Return info needed for unpacking.
name should be persisted via the ORM. data, which would be a large NumPy array (or similar type), should be persisted via the Core.
There is a go-between table 'data_set' that exists for the purpose of a many-to-one relationship between DataHolder and the data. This allows data sets to exist independently within some library. (The sole purpose of this table is to generate IDs for new data sets.)
Actual persistence would be accomplished through a class that implements some listeners, such as the following.
class PersistenceManager:
def __init__(self):
self.init_db()
self.init_listeners()
def init_db(self):
engine = create_engine('sqlite:///path/to/database.db')
self.sa_engine = engine
self.sa_sessionmaker = sessionmaker(bind=engine)
Base.metadata.create_all(engine)
def init_listeners(self):
#event.listens_for(Session, 'transient_to_pending')
def pack_data(session, instance):
try:
pack_info = instance._pack_info()
# Use Core to execute INSERT for bulky data.
except AttributeError:
pass
#event.listens_for(Session, 'loaded_as_persistent')
def unpack_data(session, instance):
try:
unpack_info = instance._unpack_info()
# Use Core to execute SELECT for bulky data.
except AttributeError:
pass
def persist(self, obj):
session.add(obj)
def load(self, class_, spec):
obj = session.query(class_).filter_by(**spec).all()[-1]
return obj
def session_scope(self):
session = self.sa_sessionmaker()
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.close()
The idea is that whenever a DataHolder is persisted, its data is also persisted at the same (or nearly the same) time.
Listening for 'transient_to_pending' (for "packing") and 'loaded_as_persistent' (for "unpacking") events will work for simple saving and loading. However, it seems care should be taken to also listen for the 'pending_to_transient' event. In the case of a rollback, the data added via Core will not be pulled back out of the database in the same way the ORM-related data will.
Is there another, better way to manipulate this behavior besides listening for 'pending_to_transient'? This could cause problems in the case where two different DataHolders reference the same data set: one DataHolder could rollback, removing the data set from the database so that the other DataHolder can no longer use it.

SQL Alchemy ORM tables on-demand

Running into something guys and was hoping to get some ideas/help.
I have a database with the tree structure where leaf can participate in the several parents as a foreign key. The typical example is a city, which belongs to the country and to the continent. Needless to say that countries and continents should not be repeatable, hence before adding another city I need to find an object in the DB. If it doesn't exist I have to create it, but if for instance country doesn't exist yet, then I have to check for the continent and if this one doesn't exist then I have to have creation process for it.
So far I got around with the creation of a whole bunch of items if I run it from the single file, but if I push the SQL alchemy code into module the story becomes different. For some reason meta scope becomes limited and if the table doesn't exist yet, then the code start throwing ProgrammingError exceptions if I query for the foreign key presence (from the city for the country). I have intercepted it and in the __init__ class constructor of the class I am looking for (country) I am checking if the table exists and creating it if doesn't. Two things I have a problem with and need an advice on:
1) Verification of the table is inefficient - I am working with the Base.metadata.sorted_tables array through which I have to look through and figure out if the table structure is the one that matches my class __tablename__. Such as:
for table in Base.metadata.sorted_tables:
# Find a right table in the list of tables
if table.name == self.__tablename__:
if __DEBUG__:
print 'DEBUG: Found table {} that equal to the class table {}'.format(table.name, self.__tablename__)
if not table.exists():
session.get_bind().execute(table.create())
Needless to say, this takes time I am looking for more efficient way to do the same.
2) The second issue is with the inheritance of the declarative base (declarative_base()) with respect to the OOP in Python. I want to take some of the code repetitions away and pull them into one class from which the other classes will be derived from. For instance code above can be taken out into the separate function and have something like this:
Base = declarative_base()
class OnDemandTables(Base):
__tablename__ = 'no_table'
# id = Column(Integer, Sequence('id'), nullable=False, unique=True, primary_key=True, autoincrement=True)
def create_my_table(self, session):
if __DEBUG__:
print 'DEBUG: Creating tables for the class {}'.format(self.__class__)
print 'DEBUG: Base.metadata.sorted_tables exists returns {}'.format(Base.metadata.sorted_tables)
for table in Base.metadata.sorted_tables:
# Find a right table in the list of tables
if table.name == self.__tablename__:
if __DEBUG__:
print 'DEBUG: Found table {} that equal to the class table {}'.format(table.name, self.__tablename__)
if not table.exists():
session.get_bind().execute(table.create())
class Continent(OnDemandTables):
__tablename__ = 'continent'
id = Column(Integer, Sequence('id'), nullable=False, unique=True, primary_key=True, autoincrement=True)
name = Column(String(64), unique=True, nullable=False)
def __init__(self, session, continent_description):
if type(continent_description) != dict:
raise AttributeError('Continent should be described by the dictionary!')
else:
self.create_my_table(session)
if 'continent' not in continent_description:
raise ReferenceError('No continent can be created without a name!. Dictionary is {}'.
format(continent_description))
else:
self.name = continent_description['continent']
print 'DEBUG: Continent name is {} '.format(self.name)
The problem here is that the metadata is trying to link unrelated classes together and requires __tablename__ and some index column to be present in the parent OnDemandTables class, which doesn't make any sense to me.
Any ideas?
Cheers

Wanted to post the solution here for the rest of the gang to keep it in mind. Apparently, SQLAlchemy doesn't see the classes in the module if they are not being used, so to say. After couple days of trying to work around things, the simplest solution that I found was to do it in a semi-manual way - not rely on the ORM to construct and build-up the database for you, but rather do this part in a sort of manual approach using class methods. The code is:
__DEBUG__ = True
from sqlalchemy import String, Integer, Column, ForeignKey, BigInteger, Float, Boolean, Sequence
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from sqlalchemy.orm.exc import MultipleResultsFound, NoResultFound
from sqlalchemy.exc import ProgrammingError
from sqlalchemy import create_engine, schema
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
engine = create_engine("mysql://test:test123#localhost/test", echo=True)
Session = sessionmaker(bind=engine, autoflush=False)
session = Session()
schema.MetaData.bind = engine
class TemplateBase(object):
__tablename__ = None
#classmethod
def create_table(cls, session):
if __DEBUG__:
print 'DEBUG: Creating tables for the class {}'.format(cls.__class__)
print 'DEBUG: Base.metadata.sorted_tables exists returns {}'.format(Base.metadata.sorted_tables)
for table in Base.metadata.sorted_tables:
# Find a right table in the list of tables
if table.name == cls.__tablename__:
if __DEBUG__:
print 'DEBUG: Found table {} that equal to the class table {}'.format(table.name, cls.__tablename__)
if not table.exists():
if __DEBUG__:
print 'DEBUG: Session is {}, engine is {}, table is {}'.format(session, session.get_bind(), dir(table))
table.create()
#classmethod
def is_provisioned(cls):
for table in Base.metadata.sorted_tables:
# Find a right table in the list of tables
if table.name == cls.__tablename__:
if __DEBUG__:
print 'DEBUG: Found table {} that equal to the class table {}'.format(table.name, cls.__tablename__)
return table.exists()
class Continent(Base, TemplateBase):
__tablename__ = 'continent'
id = Column(Integer, Sequence('id'), nullable=False, unique=True, primary_key=True, autoincrement=True)
name = Column(String(64), unique=True, nullable=False)
def __init__(self, session, provision, continent_description):
if type(continent_description) != dict:
raise AttributeError('Continent should be described by the dictionary!')
else:
if 'continent' not in continent_description:
raise ReferenceError('No continent can be created without a name!. Dictionary is {}'.
format(continent_description))
else:
self.name = continent_description['continent']
if __DEBUG__:
print 'DEBUG: Continent name is {} '.format(self.name)
It gives the following:
1. Class methods is_provisioned and create_table can be called during initial code start and will reflect the database state
2. Class inheritance is done from the second class where these methods are being kept and which is not interfering with the ORM classes, hence is not being linked.
As the result of the Base.metadata.sorted_tables loop is just a class table, the code can be optimized even further removing the loop. The following action would be to organize classes to have their tables checked and possibly created in a form of a list with keeping in mind their linkages and then loop through them using is_provisioned and, if necessary, create table methods.
Hope it helps the others.
Regards

Python SQLalchemy access huge DB data without creating models

I am using flaks python and sqlalchemy to connect to a huge db, where a lot of stats are saved. I need to create some useful insights with the use of these stats, so I only need to read/get the data and never modify.
The issue I have now is the following:
Before I can access a table I need to replicate the table in my models file. For example I see the table Login_Data in the DB. So I go into my models and recreate the exact same table.
class Login_Data(Base):
__tablename__ = 'login_data'
id = Column(Integer, primary_key=True)
date = Column(Date, nullable=False)
new_users = Column(Integer, nullable=True)
def __init__(self, date=None, new_users=None):
self.date = date
self.new_users = new_users
def get(self, id):
if self.id == id:
return self
else:
return None
def __repr__(self):
return '<%s(%r, %r, %r)>' % (self.__class__.__name__, self.id, self.date, self.new_users)
I do this because otherwise I cant query it using:
some_data = Login_Data.query.limit(10)
But this feels unnecessary, there must be a better way. Whats the point in recreating the models if they are already defined. What shall I use here:
some_data = [SOMETHING HERE SO I DONT NEED TO RECREATE THE TABLE].query.limit(10)
Simple question but I have not found a solution yet.

Thanks to Tryph for the right sources.
To access the data of an existing DB with sqlalchemy you need to use automap. In your configuration file where you load/declare your DB type. You need to use the automap_base(). After that you can create your models and use the correct table names of the DB without specifying everything yourself:
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import stats_config
Base = automap_base()
engine = create_engine(stats_config.DB_URI, convert_unicode=True)
# reflect the tables
Base.prepare(engine, reflect=True)
# mapped classes are now created with names by default
# matching that of the table name.
LoginData = Base.classes.login_data
db_session = Session(engine)
After this is done you can now use all your known sqlalchemy functions on:
some_data = db_session.query(LoginData).limit(10)

You may be interested by reflection and automap.
Unfortunately, since I never used any of those features, I am not able to tell you more about them. I just know that they allow to use the database schema without explicitly declaring it in Python.

Sqlalchemy with Flask - deleted data is still shown

Description
I have an Flask application with original SQLalchemy. Application is intended to be used internally in a company for easier saving of measurement data with MySQL
On one page I have a table with all devices used for measurement and a form that is used to add, remove or modify measurement devices.
Problem
The problem is that when I enter a new device in the database, the page is automatically refreshed to fetch new data from DB and new device is sometimes shown and sometimes it is not when I refresh the page. In other words, added row in table is appearing and dissapearing even though the row is visible on database. Same goes when i try to delete the device from database. The row is sometimes shown, sometimes not when refreshing the page with row being deleted from DB.
The same problem appears for all examples similar to this one (adding, deleting and modifying data).
What i have tried
Bellow is the code for table model:
class DvDevice(Base):
__tablename__ = "dvdevice"
id = Column("device_id", Integer, primary_key=True, autoincrement=True)
name = Column("device_name", String(50), nullable=True)
code = Column("device_code", String(10), nullable=True, unique=True)
hw_ver = Column("hw_ver", String(10), nullable=True)
fw_ver = Column("fw_ver", String(10), nullable=True)
sw_ver = Column("sw_ver", String(10), nullable=True)
And here is the code that inserts/deletes data from table.
#Insertion
device = DvDevice()
device.code = self.device_code
device.name = self.device_name
device.hw_ver = self.hw_ver
device.fw_ver = self.fw_ver
device.sw_ver = self.sw_ver
ses.add(device)
ses.commit()
ses.expire_all() #Should this be here?
# Deletion
ses.query(DvDevice).filter_by(id=self.device_id).delete()
ses.commit()
ses.expire_all() # Should this be here?
I have read from some posts on stack to include the following decorator function in models.py
#app.teardown_appcontext
def shutdown_session(exception=None):
ses.expire_all() #ses being database session object.
I tried this and it still doesn't work as it should be. Should I put the decorator function somewhere else?
Second thing i tried is to put ses.expire_all() after all commits and it still doesnt work.
What should I do to prevent this from happening?
Edit 1
from sqlalchemy import create_engine, update
from sqlalchemy.orm import sessionmaker
from sqlalchemy.pool import NullPool
from config import MYSQLCONNECT
engine = create_engine(MYSQLCONNECT)
Session = sessionmaker(bind=engine)
session = Session()

I solved the problem with the use of following function from http://docs.sqlalchemy.org/en/latest/orm/session_basics.html#when-do-i-construct-a-session-when-do-i-commit-it-and-when-do-i-close-it:
from contextlib import contextmanager
#contextmanager
def session_scope():
"""Provide a transactional scope around a series of operations."""
session = Session()
try:
yield session
session.commit()
except:
session.rollback()
raise
finally:
session.close()
with session_scope() as session:
... # code that uses session
The problem was that I created the session object in the beggining and then never closed it.

Saving spider results to database

Currently thinking about a good way to save my scraped data into a database.
App flow:
Run spider (data scraper), file located in spiders/
When data has been collected successfully save the data/items (title, link, pubDate) to the database by use of the class in pipeline.py
I would like your help with on how to save the scraped data (title, link, pubDate) from spider.py into the database through pipeline.py, currently I have nothing connecting these files together. When the data has been successfully scraped pipelines needs to be notified, receive the data and save it
I'm very thankful for your help
Spider.py
import urllib.request
import lxml.etree as ET
opener = urllib.request.build_opener()
tree = ET.parse(opener.open('https://nordfront.se/feed'))
items = [{'title': item.find('title').text, 'link': item.find('link').text, 'pubdate': item.find('pubDate').text} for item in tree.xpath("/rss/channel/item")]
for item in items:
print(item['title'], item['link'], item['pubdate'])
Models.py
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
from sqlalchemy import UniqueConstraint
import datetime
import settings
def db_connect():
"""
Performs database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
DeclarativeBase = declarative_base()
# <--snip-->
def create_presstv_table(engine):
DeclarativeBase.metadata.create_all(engine)
def create_nordfront_table(engine):
DeclarativeBase.metadata.create_all(engine)
def _get_date():
return datetime.datetime.now()
class Nordfront(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "nordfront"
id = Column(Integer, primary_key=True)
title = Column('title', String)
description = Column('description', String, nullable=True)
link = Column('link', String, unique=True)
date = Column('date', String, nullable=True)
created_at = Column('created_at', DateTime, default=_get_date)
Pipeline.py
from sqlalchemy.orm import sessionmaker
from models import Nordfront, db_connect, create_nordfront_table
class NordfrontPipeline(object):
"""Pipeline for storing scraped items in the database"""
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_nordfront_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save data in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Nordfront(**item)
if session.query(Nordfront).filter_by(link=item['link']).first() == None:
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
Settings.py
DATABASE = {'drivername': 'postgres',
'host': 'localhost',
'port': '5432',
'username': 'toothfairy',
'password': 'password123',
'database': 'news'}

As far as I understand, this is a Scrapy-specific question. If so, you just need to activate your pipeline in settings.py:
ITEM_PIPELINES = {
'myproj.pipeline.NordfrontPipeline': 100
}
This would let the engine know to send the crawled items to the pipeline (see control flow):
If we are not talking about Scrapy, then, call process_item() directly from your spider:
from pipeline import NordfrontPipeline
...
pipeline = NordfrontPipeline()
for item in items:
pipeline.process_item(item, None)
You may also remove the spider argument from the process_item() pipeline method since it is not used.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy and DuplicatesPipeline avoid saving duplicates to db - python

Related

SQLAlchemy strategy: ORM + Core for classes with large amounts of data

SQL Alchemy ORM tables on-demand

Python SQLalchemy access huge DB data without creating models

Sqlalchemy with Flask - deleted data is still shown

Saving spider results to database

Categories

Resources