Structuring dedupe results in a database

Structuring dedupe results in a database - python

I am using the python project dedupe to find duplicate organization names in my data. Many of the examples are focused on how to process the data and not how the results are implemented. Are there any best practices for taking the results, putting it into your database, and querying to group records that are duplicates?
My thoughts so far are to structure the two tables like this (using sqlalchemy), but I feel like something is off about it:
class Organization(Base):
__tablename__ = 'organization'
id = Column(Integer, primary_key=True)
name = Column(String)
cluster_id = Column(Integer, ForeignKey('duplicate_organization.cluster_id'))
class DuplicateOrganzation(Base):
__tablename__ = 'duplicate_organization'
id = Column(Integer, primary_key=True)
cluster_id = Column(Integer)
name = Column(String)
organizations = relationship("Organization")

Related

django query to get data

I am into a very confusing situation where I have one to many relation and I want to query data like I want all parent table data but want to get only data from child tables which fulfill condition of site_id = 100.
class Policy(Base):
"""table containing details for Policies"""
__tablename__ = "UmbrellaPolicy"
id = Column(Integer, primary_key=True)
policy_id = Column(Integer, nullable=False, index=True)
user_defined_name = Column(String(255), nullable=True)
and child is like this
class Site(Base):
__tablename__ = "Site"
id = Column(Integer, primary_key=True)
policy_id = Column(Integer, ForeignKey("Policy.id"))
site_id = Column(String(32), nullable=False, index=True)
policy = relationship("Policy", backref="sites")

You should be able to filter join relations like this
parents = Policy.objects.filter(site__site_id=100)
You can find more info about the Django query API here but its generally of the form where you reference the relation with classname__columnname there are many other ways to filter/query that you can reference in the docs

Linking two tables where the same value exists, without Primary key - SQLAlchemy

I have the following tables defined (very simplified version):
class Orders(db.Model):
id = db.Column(db.Integer, primary_key=True)
order_id = db.Column(db.Integer,nullable=False)
date_created = db.Column(db.DateTime, nullable=False)
class ProductOrders(db.Model):
id = db.Column(db.Integer, primary_key=True)
order_id = db.Column(db.Integer, nullable=False)
product_id = db.Column(db.Integer, nullable=False)
base_price = db.Column(db.Float, nullable=False)
I am using BigCommerce API and have multiple order_ids in both tables. The order_id is not unique globally but is unique per store. I am trying to work out how to link the two tables. I do have a Store table (shown below) that holds the store.id for each store, but I just cannot work out how to join the Orders and ProductOrders tables together so I can access both tables data where the store.id is the same. I just want to query, for example, a set of Orders.order_id or Orders.date_created and get ProductOrders.base_price as well.
class Store(db.Model):
id = db.Column(db.Integer, primary_key=True)
Any ideas?

Assuming id in both queries is the store_id and order_id is unique per store, you will have to apply join with AND statement.
For example: (in SQL)
Orders join ProductOrders on Orders.id = ProductOrders.id and Orders.order_id = ProductOrders.order_id
Answer is based on what I have understood from your question, sorry if that's not your required answer.
Edit:
In sqlalchemy it would be something like below:
from sqlalchemy import and_
session.query(Orders, ProductOrders).filter(and_(Orders.id == ProductOrders.id, Orders.order_id == ProductOrders.order_id)).all()
References:
https://www.tutorialspoint.com/sqlalchemy/sqlalchemy_orm_working_with_joins.htm
Using OR in SQLAlchemy

"has_many :through" construct in sqlalchemy

In rails we can simply define relationships with the has_many :through syntax in order to access 2nd, 3rd .. nth degree relations.
In SQLAlchemy however, this seems to be more difficult. I'm trying to avoid going down the route of writing joins, as I find them to be anti-patterns in trying to keep a clean code base.
My tables look like following:
class Message(db.Model):
__tablename__ = 'message'
id = db.Column(db.Integer, primary_key=True)
text = db.Column(db.String())
user_id = db.Column(db.ForeignKey("user.id"))
user = db.relationship('User', backref="messages")
class User(db.Model):
__tablename__ = 'user'
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String())
class Level(db.Model):
__tablename__ = 'level'
number = db.Column(db.Integer, nullable=False, primary_key=True)
name = db.Column(db.String(), nullable=False, primary_key=True)
users = db.relationship(
"User",
secondary="user_level",
backref="levels")
class UserLevel(db.Model):
__tablename__ = 'user_level'
user_id = db.Column(db.Integer, db.ForeignKey('user.id'), primary_key=True)
number = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(), primary_key=True)
__table_args__ = (
db.ForeignKeyConstraint(
['number', 'name'],
['level.number', 'level.name']
),
)
The idea is that a user can have multiple authorisation levels (e.g. a user can be at level 1, 3 and 6 at the same time). As the data I have does not contain unique sequence numbers for available levels, I had to resort to the use of composite keys to keep the data consistent with future updates.
To get all messages for a level I can currently do something like this:
users = Level.query[0].users
for user in users:
results.append(user.messages)
return results
This gives me all users on a level. But in order to get all messages for a certain level, I have to loop through these users and append them to a results list.
What I'd like to do is:
return Level.query[0].users.messages
This is more like the syntax I am used to from rails. How would one accomplish this in flask-SQLAlchemy?

SQLAlchemy Many-to-many table with multiple foreign key entires

I'm new with sqlalchemy and I want to do this as simply as possible, yet correctly. I want to track domain use across multiple companies on a monthly basis, so I set up the following tables:
class Company(Base):
__tablename__ = 'company'
id = Column(Integer, primary_key = True)
name = Column('name', String)
class Domains(Base):
__tablename__ = 'domains'
id = Column(Integer, primary_key=True)
name = Column('name', String, unique=True)
class MonthlyUsage(Base):
'''
Track domain usage across all
companies on a monthly basis.
'''
__tablename__ = 'monthlyusage'
month = Column(DateTime)
company_id = Column(Integer, ForeignKey('company.id'))
domain_id = Column(Integer, ForeignKey('domains.id'))
# <...other columns snipped out...>
company = relationship('Company', backref='company_assoc')
domain = relationship('Domains', backref='domain_assoc')
This works fine, until I add usage details for the second month. Then I get duplicate key value errors:
*sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates unique constraint "monthlyusage_pkey"*
Does this mean I have to split out the "monthlyusage" into a third table? That seems unnecessarily complicated, since all that needs to be unique is the month, company_id, and domain_id fields.
Any suggestions for my layout here, to keep it as simple as possible, yet still correct?
TIA!

Ok, I needed to add a primary key column to MonthlyUsage. The code below now works...
class MonthlyUsage(Base):
'''
Track domain usage across all
companies on a monthly basis.
'''
__tablename__ = 'monthlyusage'
month = Column(DateTime)
month_id = Column(Integer, primary_key=True)
company_id = Column(Integer, ForeignKey('company.id'), primary_key=True)
domain_id = Column(Integer, ForeignKey('domains.id'), primary_key=True)
# <...other columns snipped out...>
company = relationship('Company', backref='company_assoc')
domain = relationship('Domains', backref='domain_assoc')

query to hierarchical mapper

I have two models, related with many-to-many, one of them is hierarchical model:
#hierarchical model
class Tag(Base):
__tablename__ = "tags"
id = Column(Integer, primary_key=True)
name = Column(String)
Tag.parent_id = Column(Integer, ForeignKey(Tag.id, ondelete='CASCADE'))
Tag.childs = relationship(Tag, backref=backref('parent', remote_side=[Tag.id]),
cascade="all, delete")
class Subject(Base):
__tablename__ = "subjects"
id = Column(Integer, primary_key=True, doc="ID")
name = Column(String)
tags = relationship(Tag, secondary="tags_subjects", backref="subjects")
#many-to-many relations model
class TagsSubjects(Base):
__tablename__ = "tags_subjects"
id = Column(Integer, primary_key=True)
tag_id = Column(Integer, ForeignKey("tags.id"))
subject_id = Column(Integer, ForeignKey("subjects.id"))
So, I'll try to explain what I want to do... I want to make one (or several) query, for search all Subject's objects,
that have 'name' field value like 'foo' OR that has related tags having names with values like 'foo'
OR that has related tags, that has one or more parents (or above by hierarchy) tag with 'name' value like 'foo'
I've tried to do somethis like this:
>>> subjects = session.query(Subject).filter(or_(
Subject.name.ilike('%{0}%'.format('foo')),
Subject.tags.any(
Tag.name.ilike('%{0}%'.format('foo')))
)).order_by(Subject.name).all()
But it isn't correct and "flat" query, without hierarchical feature :(
How to do this by SQLAlchemy's API?
Thanks!
P.S. I'm using SQLite backend

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Structuring dedupe results in a database - python

Related

django query to get data

Linking two tables where the same value exists, without Primary key - SQLAlchemy

"has_many :through" construct in sqlalchemy

SQLAlchemy Many-to-many table with multiple foreign key entires

query to hierarchical mapper

Categories

Resources