How can I make this python function faster? - python

The following code that I wrote takes a set of 68,000 items and tries to find similar items based on text location in the strings. The process takes a bit on this i3 4130 I'm temporarily using to code on - is there any way to speed this up? I'm making a type of 'did you mean?' function, so I need to sort on the spot of what the user enters.
I'm not trying to compare by similarity in a dictionary that's already created using keywords, I'm trying to compare the similar between the user's input on the fly and all existing keys. The user may mistype a key, so that's why it would say "did you mean?", like Google search does.
Sorting does not affect the time, according to averaged tests.
def similar_movies(movie):
start=time.clock()
movie=capitalize(movie)
similarmovies={}
allmovies=all_movies() #returns set of all 68000 movies
for item in allmovies:
'''if similar(movie.lower(),item.lower())>.5 or movie in item: #older algorithm
similarmovies[item]=similar(movie.lower(),item.lower())'''
if movie in item: #newer algorithm,
similarmovies[item]=1.0
print item
else:
similarmovies[item]=similar(movie.lower(),item.lower())
similarmovieshigh=sorted(similarmovies, key=similarmovies.get, reverse=True)[:10]
print time.clock()-start
return similarmovieshigh
Other functions used:
from difflib import SequenceMatcher
def similar(a, b):
output=SequenceMatcher(None, a, b).ratio()
return output
def all_movies(): #returns set of all keys in sub dicts(movies)
people=list(ratings.keys())
allmovies=[]
for item in people:
for i in ratings[item]:
allmovies.append(i)
allmovies=set(allmovies)
return allmovies
The dictionary is in this format, except with thousands of names:
ratings={'Shane': {'Avatar': 4.2, '127 Hours': 4.7}, 'Joe': {'Into The Wild': 4.5, 'Unstoppable': 3.0}}

Your algorithm is going to be O(n2), since within every title, the in operator has to check every sub-string of the title to determine if the entered text is within it. So yeah, I can understand why you would want this to run faster.
An i3 doesn't provide much compute power, so pre-computing as much as possible is the only solution, and running extra software such as a database is probably going to provide poor results, again due to the capability.
You might consider using a dictionary of title words (possibly with pre-computed phonetic changes to eliminate most common misspellings - the Porter Stemmer algorithm should provide some helpful reduction rules, e.g. to allow "unstop" to match "unstoppable").
So, for example, one key in your dictionary would be "wild" (or a phonetic adjustment), and the value associated with that key would be a list of all titles that contain "wild"; you would have the same for "the", "into", "avatar", "hours", "127", and all other words in your list of 68,000 titles. Just as an example, your dictionary's "wild" entry might look like:
"wild": ["Into The Wild", "Wild Wild West", "Wild Things"]
(Yes, I searched for "wild" on IMDB just so this list could have more entries - probably not the best choice, but not many titles have "avatar", "unstoppable", or "hours" in them).
Common words such as "the" might have enough entries that you would want to exclude them, so a persistent copy of the dictionary might be helpful to allow you to make specific adjustments, although it isn't necessary, and the compute time should be relatively quick at start-up.
When the user types in some text, you split the text into words, apply any phonetic reductions if you choose to use them, and then concatenate all of the title lists for all of the words from the user, including duplicates.
Then, count the duplicates and sort by how many times a title was matched. If a user types "The Wild", you'd have two matches on "Into The Wild" ("the" and "wild"), so it should sort higher than titles with only "the" or "wild" but not both in them.
Your list of ratings can be searched after the final sorted list is built, with ratings appended to each entry; this operation should be quick, since your ratings are already within a dictionary, keyed by name.
This turns an O(n2) search into a O(log(n)) search for each word entered, which should make a big difference in performance, if it suits your needs.

In all_movies(): instead of appending to a list you could add to a set and not cast keys() to a list:
def all_movies():
allmovies = set()
for item in ratings.keys():
for i in ratings[item]:
allmovies.add(i)
return allmovies
EDIT: or only using one for-loop:
def all_movies():
result = []
for rating_dict in ratings.values()
result += rating_dict.keys()
return result
Nothing I could spot in similar_movies.
Also have a look at celery: http://docs.celeryproject.org/en/latest/ for multi-processing,
especially the chunks concept: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks

If you're developing for a production system, I'd suggest using a full text search engine like Whoosh (Python), Elastic Search (Java), or Apache Solr (Java). A full text search engine is a server that builds an index to implement full text search including fuzzy or proximity searches efficiently. Many popular database system also features full search text engine like PostgreSQL FTS and MySQL FTS that may be an acceptable alternative if you are already using these database engines.
If this code is developed mostly for self learning and you want to learn how to implement fuzzy searches, you may want to look at normalizing the movie titles in the index and the search terms. There are methods like Soundex and Metaphone that normalizes search terms based on how it likely sounds in English and this normalized term can be used to create the search index. PostgreSQL have implementation of these algorithms. Note that these algorithms are very basic building blocks, a proper full text search engine will take into account misspelling, synonyms, stop words, language specific quirks, and optimizations like parallel/distributed processing, etc.

Related

Match Names of the Companies approximately

I have 12 Million company names in my db. I want to match them with a list offline.
I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like
G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:
\s+inc\.?$ -> Incorporated
\s+corp\.?$ -> Corporation
You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.
You can then use Levenshtein distance or another fuzzy matching algorithm.
You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :
import fuzzyset
fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
fz.add(l)
#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
Interzoid's Company Name Match Advanced API generates similarity keys to help solve this...you call the API to generate a similarity key that eliminates all of the noise, known synonyms, soundex, ML, etc... you then match on the similarity key rather than the data itself for much higher match rates (commercial API, disclaimer: I work for Interzoid)
https://interzoid.com/services/getcompanymatchadvanced
Use MatchKraft to fuzzy match company names on two lists.
http://www.matchkraft.com/
Levenstiens distance is not enough to solve this problem. You also need the following:
Heuristics to improve execution time
Information retrieval (Lucene) and SQL
Company names database
It is better to use an existing tool rather than creating your program in Python.

Normalize all words in a document

I need to normalize all words in a huge corpora. Any ideas how to optimize this code? That's too slow...
texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
for document in documents ]
documents is a list of strings, where each string is single book's text.
morph.normalize works only for upper register, so I apply .upper() to all words. Moreover, it returns a set with one element, which is normalized word (string)
The first and obvious thing I'd do would be to cache the normalized words in a local dict, as to avoid calling morph.normalize() more than once for a given word.
A second optimization is to alias methods to local variables - this avoids going thru the whole attribute lookup + function descriptor invocation + method object instanciation on each turn of the loop.
Then since it's a "huge" corpus you probably want to avoid creating a full list of lists at once, which might eat all your ram, make your computer start to swap (which is garanteed to make it snail slow) and finally crash with a memory error. I don't know what your supposed to do with this list of lists nor how huge each document is but as an example I iter on a per-document result and write it to stdout - what should really be done depends on the context and concrete use case.
NB : untested code, obviously, but at least this should get you started
def iterdocs(documents, morph):
# keep trac of already normalized words
# beware this dict might get too big if you
# have lot of different words. Depending on
# your corpus, you may want to either use a LRU
# cache instead and/or use a per-document cache
# and/or any other appropriate caching strategy...
cache = {}
# aliasing methods as local variables
# is faster for tight loops
normalize = morph.normalize
def norm(word):
upw = word.upper()
if upw in cache:
return cache[upw]
nw = cache[upw] = normalize(upw).pop()
return nw
for doc in documents:
words = [norm(word) for word in document.split() if word]
yield words
for text in iterdocs(docs, morph):
# if you need all the texts for further use
# at least write them to disk or other persistence
# mean and re-read them when needed.
# Here I just write them to sys.stdout as an example
print(text)
Also, I don't know where you get your documents from but if they are text files, you may want to avoid loading them all in memory. Just read them one by one, and if they are themselves huge don't even read a whole file at once (you can iterate over a file line by line - the most obvious choice for text).
Finally, once you made sure your code don't eat up to much memory for a single document, the next obvious optimisation is parallelisation - run a process per available core and split the corpus between processes (each writing it's results to a given place). Then you just have to sum up the results if you need them all at once...
Oh and yes : if that's still not enough you may want to distribute the work with some map reduce framework - your problem looks like a perfect fit for map reduce.

How to iterate through a list of lists in python?

I have a list of lists like this.
documents = [['Human machine interface for lab abc computer applications','4'],
['A survey of user opinion of computer system response time','3'],
['The EPS user interface management system','2']]
Now i need to iterate through the above list and output a list of strings, as shown below (without the numbers in the original list)
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system']
The simplest solution for doing exactly what you specified is:
documents = [sub_list[0] for sub_list in documents]
This is basically equivalent to the iterative version:
temp = []
for sub_list in documents:
temp.append(sub_list[0])
documents = temp
This is however not really a general way of iterating through a multidimensional list with an arbitrary number of dimensions, since nested list comprehensions / nested for loops can get ugly; however you should be safe doing it for 2 or 3-d lists.
If you do decide to you need to flatten more than 3 dimensions, I'd recommend implementing a recursive traversal function which flattens all non-flat layers.
If you want to simply iterate over the loop and do things with the elements (rather than the specific results requested in the question), you could use a basic for loop
for row in documents:
#do stuff with the row
print(row)
for column in row:
#do stuff with the columns for a particular row
print(column)
if(row[1] > 10):
print('The value is much too large!!')
This is a language feature known as "flow control".
Note that if you only want the result given in the question, a list comprehension like machine yearning provided is the best way to do it.
documents = [doc[0] for doc in documents]
Note that it discards your original documents list (you are overwriting the original variable) so use the following if you want to have a copy of the first column as well as a copy of your original list:
document_first_row = [doc[0] for doc in documents]
As explained in http://docs.python.org/library/operator.html#operator.itemgetter, You can also try with
from operator import itemgetter
documents = map(itemgetter(0), documents)
that should be faster than using an explicit loop.
**edit. thanks DSM. This is wrong as it just flattens the lists. I didn't notice the extra data inside the list after the text that OP wants to ignore.
Ok I'll make it really easy for you!
itertools.chain.from_iterable(documents)
As others have said, it depends on what final behavior your need. So if you need something more complex than that, use recursive traversal or if you are like me, use an iterative traversal. I can help you with that if you need it.
The question is dead but still knowing one more way doesn't hurt:
documents = [['Human machine interface for lab abc computer applications','4'],
['A survey of user opinion of computer system response time','3'],
['The EPS user interface management system','2']]
document = []
for first,*remaining in documents:
document.append(first)
print(document)
['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system'
]
You can also use zip with argument unpacking to transform a list of "rows" into a list of columns:
rows=[[1,'a','foo'],
[2,'b','bar'],
[3,'c','baz']]
columns=zip(*rows)
print columns
#[(1,2,3),
# ('a','b','c'),
# ('foo','bar','baz')]
print columns[0]
#(1,2,3)
the * operator passes all the rows in as seperate arguments to zip
zip(*rows) == zip(row1,row2,row3,...)
zip takes all the rows and assembles columns with one item from each list
you could use numpy array
for instance
document = [['the quick brown fox', '2' ],['jumped over the lazy fox ','3']]
#
import numpy as np
document = np.array(document)
document=document[:,0]

Merging two datasets in Python efficiently

What would anyone consider the most efficient way to merge two datasets using Python?
A little background - this code will take 100K+ records in the following format:
{user: aUser, transaction: UsersTransactionNumber}, ...
and using the following data
{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...
to create
{user: aUser, activationNumber: assoiciatedActivationNumber}, ...
N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.
So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:
Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.
Some kind of Set based magic?
Clarification
The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.
Apologies if my notation for the records was misleading, I have updated them accordingly.
Thanks for the replies, I am going to give two ideas a try:
Sorting the lists first (I don't know
how expensive this is)
Creating a
dictionary with the transactionCodes
as the key then store the user and
activation code in a list as the
value
Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming.
Here's a radical approach.
Don't.
You have two CSV files; one (users) is clearly the driver. Leave this alone.
The other -- transaction codes for a user -- can be turned into a simple dictionary.
Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".
Write your application do simply do simple lookups in the other collection.
Create a list of dictionaries and iterate over the list comparing the key each time,
Close. It looks like this. Note: No Sort.
import csv
with open('activations.csv','rb') as act_data:
rdr= csv.DictReader( act_data)
activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
rdr= csv.DictReader( user_data )
with open( 'users_2.csv','wb') as updated_data:
wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
for user in rdr:
user['some_field']= activations[user['user_id_column']]['some_field']
wtr.writerow( user )
This is fast and simple. Save the dictionaries (use shelve or pickle).
however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
False.
One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.
Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.
This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.
I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

A question on python sorting efficiency

Alright so I am making a commandline based implementation of a website search feature. The website has a list of all the links I need in alphabetical order.
Usage would be something like
./find.py LinkThatStartsWithB
So it would navigate to the webpage associated with the letter B.
My questions is what is the most efficient/smartest way to use the input by the user and navigate to the webpage?
What I was thinking at first was something along the lines of using a list and then getting the first letter of the word and using the numeric identifier to tell where to go in list index.
(A = 1, B = 2...)
Example code:
#Use base url as starting point then add extension on end.
Base_URL = "http://www.website.com/"
#Use list index as representation of letter
Alphabetic_Urls = [
"/extensionA.html",
"/extensionB.html",
"/extensionC.html",
]
Or would Dictionary be a better bet?
Thanks
How are you getting this list of URLS?
If your commandline app is crawling the website for links, and you are only looking for a single item, building a dictionary is pointless. It will take at least as long to build the dict as it would to just check as you go! eg, just search as:
for link in mysite.getallLinks():
if link[0] == firstletter:
print link
If you are going to be doing multiple searches (rather than just a single commandline parameter), then it might be worth building a dictionary using something like:
import collections
d=collections.defaultdict(list)
for link in mysite.getallLinks():
d[link[0]].append(link) # Dict of first letter -> list of links
# Print all links starting with firstletter
for link in d[firstletter]:
print link
Though given that there are just 26 buckets, it's not going to make that much of a difference.
The smartest way here will be whatever makes the code simplest to read. When you've only got 26 items in a list, who cares what algorithm it uses to look through it? You'd have to use something really, really stupid to make it have an impact on performance.
If you're really interested in the performance though, you'd need to benchmark different options. Looking at just the complexity doesn't tell the whole story, because it hides the factors involved. For instance, a dictionary lookup will involve computing the hash of the key, looking that up in tables, then checking equality. For short lists, a simple linear search can sometimes be more efficient, depending on how costly the hashing algorithm is.
If your example is really accurate though, can't you just take the first letter of the input string and predict the URL from that? ("/extension" + letter + ".html")
Dictionary!
O(1)
Dictionary would be a good choice if you have (and will always have) a small number of items. If the list of URL's is going to expand in the future you will probably actually want to sort the URL's by their letter and then match the input against that instead of hard-coding the dictionary for each one.
Since it sounds like you're only talking about 26 total items, you probably don't have to worry too much about efficiency. Anything you come up with should be fast enough.
In general, I recommend trying to use the data structure that is the best approximation of your problem domain. For example, it sounds like you are trying to map letters to URLs. E.g., this is the "A" url and this is the "B" url. In that case, a mapping data structure like a dict sounds appropriate:
html_files = {
'a': '/extensionA.html',
'b': '/extensionB.html',
'c': '/extensionC.html',
}
Although in this exact example you could actually cheat it and skip the data structure altogether -- '/extension%s.html' % letter.upper() :)

Categories