I'm trying to implement a live search for my website. One that identifies words, or parts of a word, in a given string. The instant results are then underlined where they match the query.
For example, a query of "Fried green tomatoes" would yield:
SELECT *
FROM articles
WHERE (title LIKE '%fried%' OR
title LIKE '%green%' OR
title LIKE '%tomatoes%)
This works perfectly with a very small dataset. However, once the number of records in the database increases, this query quickly becomes inefficient because it can't utilize indices.
I know this is technically what FULLTEXT searching in MySQL is for, but the quality of results just isn't as good.
What are some alternatives to get a very high quality substring search while keeping the query efficient?
Thanks.
Sphinx will help you to search fast within the huge amount of data
they are many FULLTEXT search engine that you can use like sphinx , Apache Solr, Whoosh (it's pure python) and Xapian. django-haystack (if you are using django) which can interface with the 3 last ones;
Related
I've made a search feature using the python 2.7 toned package but to make it more scalable, I want to use ElasticSearch.
I want to do boolean searches like
(blue or small) purse and not leather
Do I need haystack or just using an ElasticSearch client is enough?
How can I do complex unpredictable boolean search like the example above (the boolean structure of the words is unknown)?
All I find in the docs is SearchQuery which requires me to know the search combination prior to run time.
I investigated I figured out:
I do not need haystack as all.
boolean search can be done via a "simple query search" method in elastic search however it uses "+-|" instead of "AND" "NOT" "OR" so it's just a matter of word replacement.
You can overwrite the search of the admin page to use elasticsearch, then apply the filter query over that. However, elastic search can return no more than 10000 results per page...you can read multiple pages but I ended up retrieving only the first 10000 ids (if there are more than 10000 results) and passing it to admin to do a query mymodel.objects.filter(id__in=[my_ids])
I'm not very happy about doing this so if someones know a better way, let me know.
I have a search feature in my application which allows users to search for products. Currently the query is
select * from products where title like '%search_term%'
This was a quick and hacky way of implementing this. I now want to improve this and wondering how I can do this.
Three short examples
Being able to search for plurals.
My title for the product might be Golden Delicious Apple then if a users searches for apples. Because of the plural the row will not get returned.
When some words could be one/two words
My title for the product might be Lemon Cupcakes but then if a user searches cup cakes
If a user searches apples and lemons then should i return both rows in example 1 and 2 or should I return nothing? What is considered best practice.
FYI I am using python and peewee. I can think of ideas how to do this but it all gets very complicated very fast.
Well, depending on what database you are using, you have a couple options.
SQLite has a very good full-text search extension that supports stemming (normalizes plural forms, etc). Peewee has rich support for the SQLite FTS:
http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#FTSModel
http://charlesleifer.com/blog/using-sqlite-full-text-search-with-python/
Postgresql has full-text as well via the tsvector data type. Peewee also supports this:
http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#TSVectorField
Good post on postgresql search: http://blog.lostpropertyhq.com/postgres-full-text-search-is-good-enough/
Finally, MySQL also supports full-text search, though I have not experimented with it using Peewee I'm pretty sure it should work out of the box:
https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Regarding question 2, "cup cakes" -> "cupcakes", I'm not sure what the best solution is going to be in that case.
WIth question 3, I know SQLite will correctly handle boolean expressions in queries, e.g. "apples AND lemons" will match documents containing both, whereas "apples OR lemons" will match documents containing one or the other. I imagine postgres and mysql do the same.
We have scanned thousands of old documents and entered key data into a database. One of the fields is author name.
We need to search for documents by a given author but the exact name might have been entered incorrectly as on many documents the data is handwritten.
I thought of searching for only the first few letters of the surname and then presenting a list for the user to select from. I don't know at this stage how many distinct authors there are, I suspect it will be in the hundreds rather than hundreds of thousands. There will be hundreds of thousands of documents.
Is there a better way? Would an SQL database handle it better?
The software is python, and there will be a list of documents each with an author.
I think you can use mongodb where you can set list field with all possible names of author. For example you have handwriten name "black" and you cant recognize what letter in name for example "c" or "e" and you can set origin name as "black" and add to list of possible names "blaek"
You could use Sunburnt which is a Python-Solr library which accesses Solr which is built on top of Lucene.
An excerpt of what Solr is:
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
It will give you all you want for searching documents, including partial hits and potential matches on whatever your search criteria is.
I have a MySQL database created using a custom Python script. I need to implement full-text search on a table in the database. I can use SELECT * FROM myTable WHERE (title LIKE '%hello%' OR title LIKE '%world%'), however I don't think that is a very efficient way of implementing search since the data in the table has nearly one million rows.
I am using innoDB tables so the built in MySQL full text search for MyISAM will not work. Any suggestions on methods or tutorials that will point me in the right direction?
If your data is content like you could use some full-text search specific engine like Lucene:
http://lucene.apache.org/pylucene/
If you are doing Django you have Haystack:
http://haystacksearch.org/
Solr is also a full-text search related technology you might read about:
http://wiki.apache.org/solr/SolPython
I am no expert with MySQL, but I can immediately say that you should not be selecting everything that is like to a value. If the user types in "and", and there are thousands of results, it may be better just to select a certain amount from the database and then load more using the LIMIT parameter when the user goes to the next page (e.g).
SELECT * FROM `myTable` WHERE (`title` LIKE '%hello%' OR `title` LIKE '%world%') LIMIT numberOfValues,startingAtRowNumber
So to answer your question, the query is not efficient, and you should use something like I suggested above.
Take a look at: Fulltext Search with InnoDB. They suggest using an external search engine since there is no really good option to search within innoDB tables.
I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.
A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.
I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)