How to implement a search engine using python mysql? - python

I have a MySQL database created using a custom Python script. I need to implement full-text search on a table in the database. I can use SELECT * FROM myTable WHERE (title LIKE '%hello%' OR title LIKE '%world%'), however I don't think that is a very efficient way of implementing search since the data in the table has nearly one million rows.
I am using innoDB tables so the built in MySQL full text search for MyISAM will not work. Any suggestions on methods or tutorials that will point me in the right direction?

If your data is content like you could use some full-text search specific engine like Lucene:
http://lucene.apache.org/pylucene/
If you are doing Django you have Haystack:
http://haystacksearch.org/
Solr is also a full-text search related technology you might read about:
http://wiki.apache.org/solr/SolPython

I am no expert with MySQL, but I can immediately say that you should not be selecting everything that is like to a value. If the user types in "and", and there are thousands of results, it may be better just to select a certain amount from the database and then load more using the LIMIT parameter when the user goes to the next page (e.g).
SELECT * FROM `myTable` WHERE (`title` LIKE '%hello%' OR `title` LIKE '%world%') LIMIT numberOfValues,startingAtRowNumber
So to answer your question, the query is not efficient, and you should use something like I suggested above.

Take a look at: Fulltext Search with InnoDB. They suggest using an external search engine since there is no really good option to search within innoDB tables.

Related

How to add a SQLServer Full Text Search in Django?

In Django, one can use full text search natively when using Postgres. However, when using it with SQL Server (and django-pyodbc-azure) there is no simple way to do it (as far I know).
To do a full text search in SQL Server you use the CONTAINS(column, word) function as described in the docs, but Django ORM contains do: LIKE '% text %'.
I did find two alternative methods to bypass this problem. One is using RAW SQL the other is using django extra.
Snippet using django raw SQL:
sql = "SELECT id FROM table WHERE CONTAINS(text_field, 'term')"
table = Table.objects.raw(sql)
Using django extra:
where = "CONTAINS(text_field, 'term')"
table = Table.objects.extra(where=[where])
There is two problems with it:
Raw queries are harder to mantain.
Django docs. recommend against using the extra method.
So I want to know if there a better way to do this, using "pure" django ORM if possible.

Python: RE vs. Query

I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.
A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.
I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)

Select Data from Table and Insert into a different DB

I'm using python and psycopg2 to remotely query some psql databases, and I'm trying to figure out the best way to select the data I need from the remote table, and insert it into a table on a separate DB (local application server).
Most of the stuff I've read has directed me to avoid executemany and look toward COPY operations, but I'm unsure how to implement this on a specific select statement as opposed to the entire table. Should I be headed this way or am I completely off?
but I'm unsure how to implement this on a specific select statement as opposed to the entire table
COPY isn't limited to tables, you can use a query as the source as well, check out the examples in the manual, it shows how to use COPY to create a text file based on a query:
http://www.postgresql.org/docs/current/static/sql-copy.html#AEN59055
(3rd example)
Take a look at http://ryrobes.com/featured-articles/using-a-simple-python-script-for-end-to-end-data-transformation-and-etl-part-1/
Granted, this is pulling from Oracle and inserting into SQL Server, but the concepts should be the same.

Efficient substring searching in Python with MySQL

I'm trying to implement a live search for my website. One that identifies words, or parts of a word, in a given string. The instant results are then underlined where they match the query.
For example, a query of "Fried green tomatoes" would yield:
SELECT *
FROM articles
WHERE (title LIKE '%fried%' OR
title LIKE '%green%' OR
title LIKE '%tomatoes%)
This works perfectly with a very small dataset. However, once the number of records in the database increases, this query quickly becomes inefficient because it can't utilize indices.
I know this is technically what FULLTEXT searching in MySQL is for, but the quality of results just isn't as good.
What are some alternatives to get a very high quality substring search while keeping the query efficient?
Thanks.
Sphinx will help you to search fast within the huge amount of data
they are many FULLTEXT search engine that you can use like sphinx , Apache Solr, Whoosh (it's pure python) and Xapian. django-haystack (if you are using django) which can interface with the 3 last ones;

how to make table partitions?

I am not very familiar with databases, and so I do not know how to partition a table using SQLAlchemy.
Your help would be greatly appreciated.
There are two kinds of partitioning: Vertical Partitioning and Horizontal Partitioning.
From the docs:
Vertical Partitioning
Vertical partitioning places different
kinds of objects, or different tables,
across multiple databases:
engine1 = create_engine('postgres://db1')
engine2 = create_engine('postgres://db2')
Session = sessionmaker(twophase=True)
# bind User operations to engine 1, Account operations to engine 2
Session.configure(binds={User:engine1, Account:engine2})
session = Session()
Horizontal Partitioning
Horizontal partitioning partitions the
rows of a single table (or a set of
tables) across multiple databases.
See the “sharding” example in
attribute_shard.py
Just ask if you need more information on those, preferably providing more information about what you want to do.
It's quite an advanced subject for somebody not familiar with databases, but try Essential SQLAlchemy (you can read the key parts on Google Book Search -- p 122 to 124; the example on p. 125-126 is not freely readable online, so you'd have to purchase the book or read it on commercial services such as O'Reilly's Safari -- maybe on a free trial -- if you want to read the example).
Perhaps you can get better answers if you mention whether you're talking about vertical or horizontal partitioning, why you need partitioning, and what underlying database engines you are considering for the purpose.
Automatic partitioning is a very database engine specific concept and SQLAlchemy doesn't provide any generic tools to manage partitioning. Mostly because it wouldn't provide anything really useful while being another API to learn. If you want to do database level partitioning then do the CREATE TABLE statements using custom Oracle DDL statements (see Oracle documentation how to create partitioned tables and migrate data to them). You can use a partitioned table in SQLAlchemy just like you would use a normal table, you just need the table declaration so that SQLAlchemy knows what to query. You can reflect the definition from the database, or just duplicate the table declaration in SQLAlchemy code.
Very large datasets are usually time-based, with older data becoming read-only or read-mostly and queries usually only look at data from a time interval. If that describes your data, you should probably partition your data using the date field.
There's also application level partitioning, or sharding, where you use your application to split data across different database instances. This isn't all that popular in the Oracle world due to the exorbitant pricing models. If you do want to use sharding, then look at SQLAlchemy documentation and examples for that, for how SQLAlchemy can support you in that, but be aware that application level sharding will affect how you need to build your application code.

Categories