making a search engine in python django - python

I've made a search feature using the python 2.7 toned package but to make it more scalable, I want to use ElasticSearch.
I want to do boolean searches like
(blue or small) purse and not leather
Do I need haystack or just using an ElasticSearch client is enough?
How can I do complex unpredictable boolean search like the example above (the boolean structure of the words is unknown)?
All I find in the docs is SearchQuery which requires me to know the search combination prior to run time.

I investigated I figured out:
I do not need haystack as all.
boolean search can be done via a "simple query search" method in elastic search however it uses "+-|" instead of "AND" "NOT" "OR" so it's just a matter of word replacement.
You can overwrite the search of the admin page to use elasticsearch, then apply the filter query over that. However, elastic search can return no more than 10000 results per page...you can read multiple pages but I ended up retrieving only the first 10000 ids (if there are more than 10000 results) and passing it to admin to do a query mymodel.objects.filter(id__in=[my_ids])
I'm not very happy about doing this so if someones know a better way, let me know.

Related

Full text mysql database search in Django

We have been using a MYSQL database for our project and Django as the backend framework. We want to support a full text search on a particular table and return the Queryset in Django. We know that Django supports full text search on a Postgres database but we can't move to another database now.
From what we have gathered till now -
Using inbuilt search functionality - Here we check on every field if the value exists and then take an OR to combine the results. Similar to the link (Django Search query within multiple fields in same data table).
This approach however straight forward may be inefficient for us because we have huge amounts of data.
Using a library or package - From what we have read Django haystack is something a lot of people are talking about when it comes to full text search.
Django Haystack - https://django-haystack.readthedocs.io/en/master/tutorial.html#installation
We haven't checked the library completely yet because we are trying to avoid using any library for this purpose. Let us know if you people have worked with this and have any views.
Any help is appreciated. Thanks.

Querying a model based on user input in Django

I am new to Django and I am trying to build a basic search/filter feature; for example, a basic version of the refine/filter part on amazon while searching for products. (I am using Sqlite3 in development)
I think I could implement a filter in which you could click part of a form and it would return a page with the database items that match the query, however, I am not sure on how I could do this if the search contained more than one part to the query, for example if the search was to find a book that was published before 2009 and costs more than £4.99, I am unsure on how to do this.
I am looking to build a checkbox type of filter rather than a search like google.
This sort of filter/search
All help is appreciated, Thank You.
https://django-filter.readthedocs.io
This is what you're looking for.

Efficient substring searching in Python with MySQL

I'm trying to implement a live search for my website. One that identifies words, or parts of a word, in a given string. The instant results are then underlined where they match the query.
For example, a query of "Fried green tomatoes" would yield:
SELECT *
FROM articles
WHERE (title LIKE '%fried%' OR
title LIKE '%green%' OR
title LIKE '%tomatoes%)
This works perfectly with a very small dataset. However, once the number of records in the database increases, this query quickly becomes inefficient because it can't utilize indices.
I know this is technically what FULLTEXT searching in MySQL is for, but the quality of results just isn't as good.
What are some alternatives to get a very high quality substring search while keeping the query efficient?
Thanks.
Sphinx will help you to search fast within the huge amount of data
they are many FULLTEXT search engine that you can use like sphinx , Apache Solr, Whoosh (it's pure python) and Xapian. django-haystack (if you are using django) which can interface with the 3 last ones;

A good django search app? — How to perform fuzzy search with Haystack?

I'm using django-haystack at the moment
with apache-solr as the backend.
Problem is I cannot get the app to perform the search functionality I'm looking for
Searching for sub-parts in a word
eg. Searching for "buntu" does not give me "ubuntu"
Searching for similar words
eg. Searching for "ubantu" would give "ubuntu"
Any help would be very much appreciated.
This is really about how you pass the query back to Haystack (and therefore to Solr). You can do a 'fuzzy' search in Solr/Lucene by using a ~ after the word:
ubuntu~
would return both buntu and ubantu. See the Lucene documentation on this.
How you pass this through via Haystack depends on how you're using it at the moment. Assuming you're using the default SearchForm, the best thing would be to either override the form's clean_q method to add the tilde on the end of every word in the search results, or override the search method to do the same thing there before passing it to the SearchQuerySet.

Django objects.filter, how "expensive" would this be?

I am trying to make a search view in Django. It is a search form with freetext input + some options to select, so that you can filter on years and so on. This is some of the code I have in the view so far, the part that does the filtering. And I would like some input on how expensive this would be on the database server.
soknad_list = Soknad.objects.all()
if var1:
soknad_list = soknad_list.filter(pub_date__year=var1)
if var2:
soknad_list = soknad_list.filter(muncipality__name__exact=var2)
if var3:
soknad_list = soknad_list.filter(genre__name__exact=var3)
# TEXT SEARCH
stop_word_list = re.compile(STOP_WORDS, re.IGNORECASE)
search_term = '%s' % request.GET['q']
cleaned_search_term = stop_word_list.sub('', search_term)
cleaned_search_term = cleaned_search_term.strip()
if len(cleaned_search_term) != 0:
soknad_list = soknad_list.filter(Q(dream__icontains=cleaned_search_term) | Q(tags__icontains=cleaned_search_term) | Q(name__icontains=cleaned_search_term) | Q(school__name__icontains=cleaned_search_term))
So what I do is, first make a list of all objects, then I check which variables exists (I fetch these with GET on an earlier point) and then I filter the results if they exists. But this doesn't seem too elegant, it probably does a lot of queries to achieve the result, so is there a better way to this?
It does exactly what I want, but I guess there is a better/smarter way to do this. Any ideas?
filter itself doesn't execute a query, no query is executed until you explicitly fetch items from query (e.g. get), and list( query ) also executes it.
You can see the query that will be generated by using:
soknad_list.query.as_sql()[0]
You can then put that into your database shell to see how long the query takes, or use EXPLAIN (if your database backend supports it) to see how expensive it is.
As Aaron mentioned, you should get a hold of the query text that is going to be run against the database and use an EXPLAIN (or other some method) to view the query execution plan. Once you have a hold of the execution plan for the query you can see what is going on in the database itself. There are a lot of operations that see very expensive to run through procedural code that are very trivial for any database to run, especially if you provide indexes that the database can use for speeding up your query.
If I read your question correctly, you're retrieving a result set of all rows in the Soknad table. Once you have these results back you use the filter() method to trim down your results meet your criteria. From looking at the Django documentation, it looks like this will do an in-memory filter rather than re-query the database (of course, this really depends on which data access layer you're using and not on Django itself).
The most optimal solution would be to use a full-text search engine (Lucene, ferret, etc) to handle this for you. If that is not available or practical the next best option would be to to construct a query predicate (WHERE clause) before issuing your query to the database and let the database perform the filtering.
However, as with all things that involve the database, the real answer is 'it depends.' The best suggestion is to try out several different approaches using data that is close to production and benchmark them over at least 3 iterations before settling on a final solution to the problem. It may be just as fast, or even faster, to filter in memory rather than filter in the database.

Categories