SQLAlchemy: Select count of related many-to-many elements - python

I have a many to many relationship between two tables/objects: Tag and Content. Tag.content is the relationship from a tag to all content which has this tag.
Now I'd like to find out the number of content objects assigned to a tag (for all tags, otherwise I'd simply use len()). The following code almost works:
cnt = db.func.count()
q = db.session.query(Tag, cnt) \
.outerjoin(Tag.content) \
.group_by(Tag) \
.order_by(cnt.desc())
However, it will never return a zero count for obvious reasons - there is at least one row per tag after all due to the LEFT JOIN used. This is a problem though since I'd like to get the correct count for all tags - i.e. 0 if a tag is orphaned.
So I wonder if there's a way to achieve this - obviously without sending n+1 queries to the database. A pure-SQL solution might be ok too, usually it's not too hard to map such a solution to SA somehow.
.filter(Tag.content.any()) removes the results with the incorrect count, but it will do so by removing the rows from the resultset altogether which is not what I want.

Solved it. I needed to use DISTINCT in the COUNTs:
cnt = db.func.count(db.distinct(Content.id))

Related

lxml xpath exponential performance behavior

I'm trying to use xpath to query a large html with multiple tables and only extract a few tables that contain a specific pattern in one of the cells. I'm running into time related challenges.
I've tried to minimize by issue as much as possible.
code setup: - creates 10 (300x15) tables with random values between 0-100
import pandas as pd
import numpy as np
dataframes = [pd.DataFrame(np.random.randint(0,100, (300, 15)), columns=[f"Col-{i}" for i in range(15)]) for k in range(10)]
html_strings = [df.to_html() for df in dataframes]
combined_html = '\n'.join(html_strings)
source_html = f'<html><body>{combined_html}</body></html>'
code execution: I want to extract all tables that have the value "80" in them (in this case it will be all 10 tables)
from lxml import etree
root = etree.fromstring(source_html.encode())
PAT = '80' # this should result in returning all 10 tables as 80 will definitely be there in all of them (pandas index)
# method 1: query directly using xpath - this takes a long time to run - and this seems to exhibit exponential time behavior
xpath_query = "//table//*[text() = '{PAT}']/ancestor::table"
tables = root.xpath(xpath_query)
# method 2: this runs in under a second. first get all the tables and then run the same xpath expression within the table context
all_tables = root.xpath('//table')
table_xpath_individual = ".//*[text() = '{PAT}']/ancestor::table"
selected_tables = [table for table in all_tables if table.xpath(table_xpath_individual)]
method 1 takes 40-50s to finish
method 2 takes <1s
I'm not sure whether it's the xpath expression in method 1 that's problematic or it's an lxml issue here. I've switched to using method 2 for now - but unsure whether it's a 100% equivalent behavior
I don't know if this is relevant (but I suspect so). You can simplify these XPaths by eliminating the //* step and the trailing /ancestor::table step.
//table[descendant::text() = '{PAT}']
Note that in your problematic XPath, for each table you will find every descendant element whose text is 80 (there might be many within a table) and for each one, return all of that element's table ancestors (again, because in theory there might be more than one if you had a table containing a table, the XPath processor is going to have to laboriously traverse all those ancestor paths). That will return a potentially large number of results which the XPath processor will then have to deduplicate, so that it doesn't return multiple instances of any given table (an XPath 1.0 nodeset is guaranteed not to contain duplicates).

How can I count across several relationships in django

For a small project I have a registry of matches and results. Every match is between teams (could be a single player team), and has a winner. So I have Match and Team models, joined by a MatchTeam model. This looks like so (simplified)see below for notes
class Team(models.Model):
...
class Match(models.Model):
teams = ManyToManyField(Team, through='MatchTeam')
...
class MatchTeam(models.Model):
match = models.ForeignKey(Match, related_name='matchteams',)
team = models.ForeignKey(Team)
winner = models.NullBooleanField()
...
Now I want to do some stats on the matches, starting with looking up who is the person that beats you the most. I'm not completely sure how to do this, at least, not efficiently.
In SQL (just approximating here), I would mean something like this:
SELECT their_matchteam.id, COUNT(*) as cnt
FROM matchteam AS your_mt
JOIN matchteam AS their_mt ON your_mt.match_id = their_mt.match_id
WHERE your.matchteam.id IN <<:your teams>>
your_matchteam.winner = false
GROUP BY their_matchteam.team_id
ORDER BY cnt DESC
(this also needs a "their_mt is not your_mt" clause btw, but the concept is clear, right?)
While I have not tested this as SQL, it's just to give an insight to what I'm looking for: I want to find this result via a Django aggregation.
According to the manual I can annotate results with an aggregation, in this case a Count. Joining MatchTeams straight on MatchTeams as I'm doing in the SQL is a bit of a shortcut maybe, as there 'should' be a Match in between? At least, I wouldn't know how to translate that into Django
So maybe I need to find certain matches for my team, and then annotate them with the count of the other team? But what is 'the other team'?
Quick write-up would look like:
nemesis = Match.objects \
.filter(matchteams__in=yourteams) \
.annotate(cnt=Count('<<otherteam>>')).order_by('-cnt')[0]
If this is the right track, how should I define the Count here.
And if it's not the right track, what is?
As is, this is all about teams instead of users. This is just to keep things simple :)
An additional question might be: should I even do this with that Django ORM stuff, or am I better off just adding SQL? That has the obvious disadvantage that you're stuck with writing very generic code (is this even possible?) or fixing your DB-backend. If not needed, I'd like to avoid that.
About the model: I really want to understand what I can change about the model to make it better, but I can't really see a solution without downsides. Let me try to explain:
I want to support matches with arbitrary amount of teams, so for instance a 5-team-match. This means I have many-to-many relationship and not one that is for instance 1 match to 2 teams. If that was the case, you could denormalize and put the winners/scores in the team table. But this is not the case.
Extra data about the results of one team (e.g. their final score, their time) is by definition a property of the relation. It cannot go into the team table (as it would be per match and you can have an undefined amount of matches), and it cannot go in the match table for the same reason mutatis mutandis.
Example: I have teams A,B,C,D and E playing a match. Team A and Team B have 10 points, the rest all have 0 points. I want to save the amount of points, and that Team A and Team B are the winners of this match.
So to the comments suggesting I need a 'better' design, by all means, if you have one I would gladly see it, but if you want to support what I support, it's going to be hard.
And as a final remark: This data can be easilly retrieved in SQL, so the model seems fine to me: I'm just too much of a beginner in Django to be able to do it in Django's ORM!
Funny problem ! I think I have the answer (get the team that beats yourteams the most):
Team.objects.get( # the expected result is a team
pk=list( # filter out yourteams
filter(lambda x: x not in [ y.id for y in yourteams ],
list(
Match.objects # search matches
.filter(matchteams__in=yourteams) # in which you were involved
.filter(matchteams__winner=False) # that you loose
.annotate(cnt=Count('teams')) # and count them
.order_by('-cnt') # sort appropriately
.values_list('teams__id', flat=True) # finally get only pks
)
)
)[0] # take the first item that should be the super winner
)
I did not test it explicitly, but if does not work, I think it may be the right track.
You can do something like this
matches_won_aginst_my_team = MatchTeam.objects.filter(team=my_team, winner=False).select_related(matches)
teams_won_matches_aginst_my_team = matches_won_aginst_my_team.filter(winner=True).values_list('matchteams__team')
But as suggested you can probably model better.
I would hold two fields in the MatchModel: home_team, away_team.
Simpler and more indicative.

Removing Paragraph From Cell In Python-Docx

I am attempting to create a table with a two row header that uses a simple template format for all of the styling. The two row header is required because I have headers that are the same under two primary categories. It appears that the only way to handle this within Word so that a document will format and flow with repeating header across pages is to nest a two row table into the header row of a main content table.
In Python-DocX a table cell is always created with a single empty paragraph element. For my use case I need to be able to remove this empty paragraph element entirely not simply clear it with an empty string. Or else I have line break above my nested table that ruins my illusion of a single table.
So the question is how do you remove the empty paragraph?
If you know of a better way to handle the two row header implementation... that would also be appreciated info.
While Paragraph.delete() is not implemented yet in python-docx, there is a workaround function documented here: https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
Note that a table cell must always end with a paragraph. So you'll need to add an empty one after your table otherwise I believe you'll get a so-called "repair-step" error when you try to load the document.
Probably worth a try without the extra paragraph just to confirm; I'm expect it would look better without it, but last time I tried that I got the error.
As #scanny said before, it can delete the current graph if pass the p to self-defined delete function.
I just want to do a supplement, in case if you want to delete multiple paragraphs.
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
paragraph._p = paragraph._element = None
def remove_multiple_para(doc):
i = 0
while i < len(doc.paragraphs):
if 'xxxx' in doc.paragraphs[i].text:
for j in range(i+2, i-2, -1):
# delete the related 4 lines
delete_paragraph(doc.paragraphs[j])
i += 1
doc.save('outputDoc.docx')
doc = docx.Document('./inputDoc.docx')
remove_multiple_para(doc)

Select attributes from different joined tables with Flask and SQLAlchemy

Can't get myself to do something as easy as good ol'
SELECT phrase.content, meaning.content
FROM phrase JOIN meaning
ON phrase.id = meaning.phrase_id
All the examples I can find in the documentation/SO are variations of
a = Phrase.query.join(Meaning).all()
which doesn't really work cause then a is a list of Phrase objects, whereas I want to select one attribute from Phrase and one from Meaning.
Anybody? Thanks
q = db.session.query(Phrase.content, Meaning.content).join(Meaning).all()

django conjunctive filter __in query

Consider an array of Tags, T.
Each PhotoSet has a many-to-many relationship to Tags.
We also have a filter, F (consisting of a set of Tags), and we want to return all PhotoSets who have ALL the tags contained in F.
i.e,. if F = ['green', 'dogs', 'cats'], we want every PhotoSet instance that has all the tags in F.
Naturally
PhotoSet.objects.filter(tags__in=F)
Does not do the trick, since it returns every PhotoSet contain any member of F.
I see it's possible to use similar things using "Q" expressions, but that only seemed for a finite amount of conjunctive parameters. Is this something that can be done using a list comprehension??
Thanks in advance!
EDIT -- SOLUTION:
I found the solution using an obvious way. Simply chaining filters...
results = PhotoSets.objects
for f in F:
results = results.filter(tags__in=[f])
results = results.all()
Was staring me in the face the whole time!
Little quick and dirty, but it'll do the trick:
query = None
for tag in F:
if query is None:
query = Q(tags=tag)
else:
query &= Q(tags=tag)
PhotoSet.objects.filter(query)

Categories