Efficient Django full-text search without Haystack

Efficient Django full-text search without Haystack - python

What's the next best option for database-agnostic full-text search for Django without Haystack?
I have a model like:
class Paper(models.Model):
title = models.CharField(max_length=1000)
class Person(models.Model):
name = models.CharField(max_length=100)
class PaperReview(models.Model):
paper = models.ForeignKey(Paper)
person = models.ForeignKey(Person)
I need to search for papers by title and reviewer name, but I also want to search from the perspective of a person and find which papers they have and haven't reviewed. With Haystack, it's trivial to implement a full-text index to search by title and name fields, but as far as I can tell, there's no way to do the "left outer join" necessary to find papers without a review by a specific person.

Haystack is just a wrapper that exposes a few different search engine backends:
Solr
ElasticSearch
Whoosh
Xapian
There might be other backends as well available as plugins.
So the real question here is, is there a search backend that gives me the desired functionality, and does haystack expose that functionality?
The answer to that is, you can probably use elasticsearch*, but note the asterix.
Generally, when creating a search index, it's a good idea to think about the documents in the same way you might if you were creating a no-rel database and you want those documents to be as flat as possible.
So one possibility might be to have an array of char fields on a paperreview index. The array would contain all of the related foreign key references.
Another might be to use "nested documents" in elasticsearch.
And lastly, to use "parent/child documents" in elasticsearch.
You can still use haystack for indexing, with some hacking, but you will probably want to use one of the raw backends directly, such as pyelasticsearch or pyes.
http://www.elasticsearch.org/guide/reference/mapping/nested-type/
http://www.elasticsearch.org/guide/reference/mapping/parent-field/
http://pyelasticsearch.readthedocs.org/en/latest/
http://pyes.readthedocs.org/en/latest/

I know this question is older, but I spent some time investigation this recently and answered this as well here but it is actually not too hard to implement this yourself, and wanted to share.
I found the SearchVector/SearchQuery approach actually does not catch all cases, for example partial words (see https://www.fusionbox.com/blog/detail/partial-word-search-with-postgres-full-text-search-in-django/632/ for reference). You can implement your own without much trouble, depending on your constraints.
example, within a viewsets' get_queryset method:
...other params...
search_terms = self.request.GET.get('q')
if search_terms:
# remove possible other delimiters and other chars
# that could interfere
cleaned_terms = re.sub(r'[!\'()|&;,]', ' ', search_terms).strip()
if cleaned_terms:
# Check against all the params we want
# apply to previous terms' filtered results
q = reduce(
lambda p, n: p & n,
map(
lambda word:
Q(your_property__icontains=word) | Q(
second_property__icontains=word) | Q(
third_property__icontains=word)
cleaned_terms.split()
)
)
qs = YourModel.objects.filter(q)
return qs

I use Haystack + elastic search and so far its working pretty well. Dont think its trivial . You can easily implement your requirement, if theres a association between paper and person.

I ended up using djorm-ext-pgfulltext, which provides a simple Django interface for PostgreSQL's built-in full text search features.

Related

Django PostgreSQL - Fuzzy Searching on single words

I'm using Django's builtin trigram_similar lookup and TrigramSimilarity class to create a fuzzy searching using a PostgreSQL database. This way, I filter titles of articles. The problem is, this filters based on the whole title. I want to filter it also on parts on title.
Example:
Title of Article: "We are the world | This is the best article ever".
My search: "we"
In this example, my function returns nothing, but I want it to return this article. How can I do that?
This is the code I use:
def search(qs, term, fields=["title"], treshold=.3):
if len(fields) >= 2:
return qs.annotate(
similarity=Greatest(
*[TrigramSimilarity(field, term) for field in fields]
)
).filter(similarity__gte=treshold).order_by("-similarity")
return qs.filter(
**{"{}__trigram_similar".format(fields[0]): term}
)

I think the problem is that you are using trigram similarity with a two letter word. If you try to do this with a word that is three letters, maybe it will work?

Django SearchVector using icontains

I am trying to search for a list of values in multiple columns in postgres (via django). I was able to use SearchQuery and SearchVector and this works great if one of the search values matches a full word. I was hoping to use icontains so that partial strings could also be used in the search. Is this possible and if so could someone point me in the right direction. Here is an example of my approach below.
Example Data:
Superhero.objects.create(
superhero='Batman',
publisher='DC Comics',
alter_ego='Bruce Wayne',
)
Superhero.objects.create(
superhero='Hulk',
publisher='Marvel Comics',
alter_ego='Bruce Banner',
)
Django filter:
from django.contrib.postgres.search import SearchQuery, SearchVector
query = SearchQuery('man') | SearchQuery('Bruce')
vector = SearchVector('superhero', 'alter_ego', 'publisher')
queryset = queryset.annotate(search=vector).filter(search=query)
This would return the Hulk record but I am hoping I can somehow use like 'icontains' so that when searching for 'man' the Batman record would also be returned. Any help is appreciated!

You can apply icontains to the filter like:
queryset = queryset.annotate(search=vector).filter(search__icontains=query)

So SearchQuery and SearchVector are a part of Django's Full Text searching functionality and it doesnt look like you can achieve what I was wanting to do with these functions. I have taken a different approach thanks to Julian Phalip's approach here.. https://www.julienphalip.com/blog/adding-search-to-a-django-site-in-a-snap/

How can I count across several relationships in django

For a small project I have a registry of matches and results. Every match is between teams (could be a single player team), and has a winner. So I have Match and Team models, joined by a MatchTeam model. This looks like so (simplified)see below for notes
class Team(models.Model):
...
class Match(models.Model):
teams = ManyToManyField(Team, through='MatchTeam')
...
class MatchTeam(models.Model):
match = models.ForeignKey(Match, related_name='matchteams',)
team = models.ForeignKey(Team)
winner = models.NullBooleanField()
...
Now I want to do some stats on the matches, starting with looking up who is the person that beats you the most. I'm not completely sure how to do this, at least, not efficiently.
In SQL (just approximating here), I would mean something like this:
SELECT their_matchteam.id, COUNT(*) as cnt
FROM matchteam AS your_mt
JOIN matchteam AS their_mt ON your_mt.match_id = their_mt.match_id
WHERE your.matchteam.id IN <<:your teams>>
your_matchteam.winner = false
GROUP BY their_matchteam.team_id
ORDER BY cnt DESC
(this also needs a "their_mt is not your_mt" clause btw, but the concept is clear, right?)
While I have not tested this as SQL, it's just to give an insight to what I'm looking for: I want to find this result via a Django aggregation.
According to the manual I can annotate results with an aggregation, in this case a Count. Joining MatchTeams straight on MatchTeams as I'm doing in the SQL is a bit of a shortcut maybe, as there 'should' be a Match in between? At least, I wouldn't know how to translate that into Django
So maybe I need to find certain matches for my team, and then annotate them with the count of the other team? But what is 'the other team'?
Quick write-up would look like:
nemesis = Match.objects \
.filter(matchteams__in=yourteams) \
.annotate(cnt=Count('<<otherteam>>')).order_by('-cnt')[0]
If this is the right track, how should I define the Count here.
And if it's not the right track, what is?
As is, this is all about teams instead of users. This is just to keep things simple :)
An additional question might be: should I even do this with that Django ORM stuff, or am I better off just adding SQL? That has the obvious disadvantage that you're stuck with writing very generic code (is this even possible?) or fixing your DB-backend. If not needed, I'd like to avoid that.
About the model: I really want to understand what I can change about the model to make it better, but I can't really see a solution without downsides. Let me try to explain:
I want to support matches with arbitrary amount of teams, so for instance a 5-team-match. This means I have many-to-many relationship and not one that is for instance 1 match to 2 teams. If that was the case, you could denormalize and put the winners/scores in the team table. But this is not the case.
Extra data about the results of one team (e.g. their final score, their time) is by definition a property of the relation. It cannot go into the team table (as it would be per match and you can have an undefined amount of matches), and it cannot go in the match table for the same reason mutatis mutandis.
Example: I have teams A,B,C,D and E playing a match. Team A and Team B have 10 points, the rest all have 0 points. I want to save the amount of points, and that Team A and Team B are the winners of this match.
So to the comments suggesting I need a 'better' design, by all means, if you have one I would gladly see it, but if you want to support what I support, it's going to be hard.
And as a final remark: This data can be easilly retrieved in SQL, so the model seems fine to me: I'm just too much of a beginner in Django to be able to do it in Django's ORM!

Funny problem ! I think I have the answer (get the team that beats yourteams the most):
Team.objects.get( # the expected result is a team
pk=list( # filter out yourteams
filter(lambda x: x not in [ y.id for y in yourteams ],
list(
Match.objects # search matches
.filter(matchteams__in=yourteams) # in which you were involved
.filter(matchteams__winner=False) # that you loose
.annotate(cnt=Count('teams')) # and count them
.order_by('-cnt') # sort appropriately
.values_list('teams__id', flat=True) # finally get only pks
)
)
)[0] # take the first item that should be the super winner
)
I did not test it explicitly, but if does not work, I think it may be the right track.

You can do something like this
matches_won_aginst_my_team = MatchTeam.objects.filter(team=my_team, winner=False).select_related(matches)
teams_won_matches_aginst_my_team = matches_won_aginst_my_team.filter(winner=True).values_list('matchteams__team')
But as suggested you can probably model better.
I would hold two fields in the MatchModel: home_team, away_team.
Simpler and more indicative.

Django, SQLite - Accurate ordering of strings with accented letters

Main problem:
I have a Python (3.4) Django (1.6) web app using an SQLite (3) database containing a table of authors. When I get the ordered list of authors some names with accented characters like ’Čapek’ and ’Örkény’ are the end of list instead of at (or directly after) section ’c’ and ’o’ of the list.
My 1st try:
SQLite can accept collation definitions. I searched for one that was made to order UTF-8 strings correctly for example Localized and Unicode collation in Android (Accented Search in sqlite (android)) but found none.
My 2nd try:I found an old closed Django ticket about my problem: https://code.djangoproject.com/ticket/8384 It suggests sorting with Python as workaround. I found it quite unsatisfying. Firstly if I sort with a Python method (like below) instead of ordering at model level I cannot use generic views. Secondly ordering with a Python method returns the very same result as the SQLite order_by does: ’Čapek’ and ’Örkény’ are placed after section 'z'.
author_list = sorted(Author.objects.all(), key=lambda x: (x.lastname, x.firstname))
How could I get the queryset ordered correctly?

Thanks to the link CL wrote in his comment, I managed to overcome the difficulties that I replied about. I answer my question to share the piece of code that worked because using Pyuca to sort querysets seems to be a rare and undocumented case.
# import section
from pyuca import Collator
# Calling Collator() takes some seconds so you should create it as reusable variable.
c = Collator()
# ...
# main part:
author_list = sorted(Author.objects.all(), key=lambda x: (c.sort_key(x.lastname), c.sort_key(x.firstname)))
The point is to use sort_key method with the attribute you want to sort by as argument. You can sort by multiple attributes as you see in the example.
Last words: In my language (Hungarian) we use four different accented version of the Latin letter ‘o’: ‘o’, ’ó’, ’ö’, ’ő’. ‘o’ and ‘ó’ are equal in sorting, and ‘ö’ and ‘ő’ are equal too, and ‘ö’/’ő’ are after ‘o’/’ó’. In the default collation table the four letters are equal. Now I try to find a way to define or find a localized collation table.

You could create a new field in the table, fill it with the result of unidecode, then sort according to it.
Using a property to provide get/set methods could help in keeping the fields in sync.

QuerySet: LEFT JOIN with AND

I use old Django version 1.1 with hack, that support join in extra(). It works, but now is time for changes. Django 1.2 use RawQuerySet so I've rewritten my code for that solution. Problem is, that RawQuery doesn't support filters etc. which I have many in code.
Digging through Google, on CaktusGroup I've found, that I could use query.join().
It would be great, but in code I have:
LEFT OUTER JOIN "core_rating" ON
("core_film"."parent_id" = "core_rating"."parent_id"
AND "core_rating"."user_id" = %i
In query.join() I've written first part "core_film"."parent_id" = "core_rating"."parent_id" but I don't know how to add the second part after AND.
Does there exist any solution for Django, that I could use custom JOINs without rewritting all the filters code (Raw)?
This is our current fragment of code in extra()
top_films = top_films.extra(
select=dict(guess_rating='core_rating.guess_rating_alg1'),
join=['LEFT OUTER JOIN "core_rating" ON ("core_film"."parent_id" = "core_rating"."parent_id" and "core_rating"."user_id" = %i)' % user_id] + extra_join,
where=['core_film.parent_id in (select parent_id from core_film EXCEPT select film_id from filmbasket_basketitem where "wishlist" IS NOT NULL and user_id=%i)' % user_id,
'( ("core_rating"."type"=1 AND "core_rating"."rating" IS NULL) OR "core_rating"."user_id" IS NULL)',
' "core_rating"."last_displayed" IS NULL'],
)

Unfortunately, the answer here is no.
The Django ORM, like most of Django, follows a philosophy that easy things should be easy and hard things should be possible. In this case, you are definitely in the "hard things" area and the "possible" solution is to simply write the raw query. There are definitely situations like this where writing the raw query can be difficult and feels kinda gross, but from the project's perspective situations like this are too rare to justify the cost of adding such functionality.

Try this patch: https://code.djangoproject.com/ticket/7231

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient Django full-text search without Haystack - python

I use Haystack + elastic search and so far its working pretty well. Dont think its trivial . You can easily implement your requirement, if theres a association between paper and person.

I ended up using djorm-ext-pgfulltext, which provides a simple Django interface for PostgreSQL's built-in full text search features.

Related

Django PostgreSQL - Fuzzy Searching on single words

Django SearchVector using icontains

How can I count across several relationships in django

Django, SQLite - Accurate ordering of strings with accented letters

QuerySet: LEFT JOIN with AND

Categories

Resources