Search unreliable author names

Search unreliable author names - python

We have scanned thousands of old documents and entered key data into a database. One of the fields is author name.
We need to search for documents by a given author but the exact name might have been entered incorrectly as on many documents the data is handwritten.
I thought of searching for only the first few letters of the surname and then presenting a list for the user to select from. I don't know at this stage how many distinct authors there are, I suspect it will be in the hundreds rather than hundreds of thousands. There will be hundreds of thousands of documents.
Is there a better way? Would an SQL database handle it better?
The software is python, and there will be a list of documents each with an author.

I think you can use mongodb where you can set list field with all possible names of author. For example you have handwriten name "black" and you cant recognize what letter in name for example "c" or "e" and you can set origin name as "black" and add to list of possible names "blaek"

You could use Sunburnt which is a Python-Solr library which accesses Solr which is built on top of Lucene.
An excerpt of what Solr is:
Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
It will give you all you want for searching documents, including partial hits and potential matches on whatever your search criteria is.

Related

What is the best way to allow users to search for items

I have a search feature in my application which allows users to search for products. Currently the query is
select * from products where title like '%search_term%'
This was a quick and hacky way of implementing this. I now want to improve this and wondering how I can do this.
Three short examples
Being able to search for plurals.
My title for the product might be Golden Delicious Apple then if a users searches for apples. Because of the plural the row will not get returned.
When some words could be one/two words
My title for the product might be Lemon Cupcakes but then if a user searches cup cakes
If a user searches apples and lemons then should i return both rows in example 1 and 2 or should I return nothing? What is considered best practice.
FYI I am using python and peewee. I can think of ideas how to do this but it all gets very complicated very fast.

Well, depending on what database you are using, you have a couple options.
SQLite has a very good full-text search extension that supports stemming (normalizes plural forms, etc). Peewee has rich support for the SQLite FTS:
http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#FTSModel
http://charlesleifer.com/blog/using-sqlite-full-text-search-with-python/
Postgresql has full-text as well via the tsvector data type. Peewee also supports this:
http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#TSVectorField
Good post on postgresql search: http://blog.lostpropertyhq.com/postgres-full-text-search-is-good-enough/
Finally, MySQL also supports full-text search, though I have not experimented with it using Peewee I'm pretty sure it should work out of the box:
https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Regarding question 2, "cup cakes" -> "cupcakes", I'm not sure what the best solution is going to be in that case.
WIth question 3, I know SQLite will correctly handle boolean expressions in queries, e.g. "apples AND lemons" will match documents containing both, whereas "apples OR lemons" will match documents containing one or the other. I imagine postgres and mysql do the same.

Solr & User data

Let's assume I am developing a service that provides a user with articles. Users can favourite articles and I am using Solr to store these articles for search purposes.
However, when the user adds an article to their favourites list, I would like to be able to figure out out which articles the user has added to favourites so that I can highlight the favourite button.
I am thinking of two approaches:
Fetch articles from Solr and then loop through each article to fetch the "favourite-status" of this article for this specific user from MySQL.
Whenever a user favourites an article, add this user's ID to a multi-valued column in Solr and check whether the ID of the current user is in this column or not.
I don't know the capacity of the multivalued column... and I also don't think the second approach would be a "good practice" (saving user-related data in index).
What other options do I have, if any? Is approach 2 a correct approach?

I'd go with a modified version of the first one - it'll keep user specific data that's not going to be used for search out of the index (although if you foresee a case where you want to search for favourite'd articles, it would probably be an interesting field to have in the index) for now. For just display purposes like in this case, I'd take all the id's returned from Solr, fetch them in one SQL statement from the database and then set the UI values depending on that. It's a fast and easy solution.
If you foresee that "search only in my fav'd articles" as a use case, I would try to get that information into the index as well (or other filter applications against whether a specific user has added the field as a favourite). I'd try to avoid indexing anything more than the user id that fav'd the article in that case.
Both solutions would however work, although the latter would require more code - and the required response from Solr could grow large if a large number of users fav's an article, so I'd try to avoid having to return a set of userid's if that's the case (many fav's for a single article).

auto-generate e-commerce tags from item description

We are developing an e-commerce portal that enables users to list their items (name, description, tags) on the site.
However, we realized that users are not understanding item tags very well, some of them write arbitrary words some others leave it blank, so we decided to deal with it, i thought about using an Entity Extractor to generate tags, first, i tried to pass this listing to Calais:
I'm a Filipino Male looking for Office Assistant job,with knowledge in MS Word,Excel,Power Point & Internet Browsing,i'm a quick learner with clear & polite communicative skills,immense flexibility in terms of work assignments and work hours,and performing my duties with full dedication,integrity and honesty.
and i got these tags: Religion Belief, Positive psychology, Integrity, Evaluation, Behavior, Psychology, Skill.
Then i tried Stanford NER and got: Excel, Power, Point, &, Internet, Browsing
after that, i stopped trying these solutions as i thought they will not fit, and started thinking about having an e-commerce-related thesaurus that may contain product/brand names and trade related terms so i can use it with filtering user-generated posts and finding the proper tags but i couldn't find one.
so 1st question: did i miss something?
2nd question: is there better scinarios for this (i.e generating the tags)?

How to implement searching for a specified user's documents?

In my current project, users can like songs, and now I'm going to add a song search so that a user can search for some song she has liked before.
I have implemented search engines using xapian before, which involves building indexes of documents periodically.
In my case, do I have to build indexes for every user's songs independently?
If I want the search results to be more real-time, does this mean that I need to build indexes incrementally every short period of time?

To take your questions separately.
Do I have to build indexes for every user's songs independently?
No; a common technique for this kind of situation is to index each like separately with both the information about the song and additionally the identifier of the user. Then when you search, you want to filter the results of the user's natural text search by the user identifier who's actually logged in.
In Xapian you'd do this by adding a term representing the user (with a suitable prefix, so you might have XU175 for a user with id 175, perhaps), and then using OP_FILTER to restrict the search to just likes by the logged-in user.
Do I need to build indexes incrementally every short period of time [to support real-time indexing]?
This is entirely dependent on the search system you're using. With Xapian you can either do that and periodically 'compact' the databases generated into one base one; or you can index live into the database -- although since Xapian is single-writer, you'd want to find a way of serialising this, such as by putting new likes onto a queue and having a single process that pops them off and indexes into the database. One largely off-the-shelf solution to this would be to use Restpose, written by one of the Xapian developers, which fills the same sort of role as Solr does for Lucene.
You can also get fancier by indexing into one database, then replicating that to another and searching the replicated version, which also gives you options to scale horizontally in future. There's a discussion of replication in the Xapian documentation.

Efficient substring searching in Python with MySQL

I'm trying to implement a live search for my website. One that identifies words, or parts of a word, in a given string. The instant results are then underlined where they match the query.
For example, a query of "Fried green tomatoes" would yield:
SELECT *
FROM articles
WHERE (title LIKE '%fried%' OR
title LIKE '%green%' OR
title LIKE '%tomatoes%)
This works perfectly with a very small dataset. However, once the number of records in the database increases, this query quickly becomes inefficient because it can't utilize indices.
I know this is technically what FULLTEXT searching in MySQL is for, but the quality of results just isn't as good.
What are some alternatives to get a very high quality substring search while keeping the query efficient?
Thanks.

Sphinx will help you to search fast within the huge amount of data

they are many FULLTEXT search engine that you can use like sphinx , Apache Solr, Whoosh (it's pure python) and Xapian. django-haystack (if you are using django) which can interface with the 3 last ones;

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.