A question on python sorting efficiency

A question on python sorting efficiency - python

Alright so I am making a commandline based implementation of a website search feature. The website has a list of all the links I need in alphabetical order.
Usage would be something like
./find.py LinkThatStartsWithB
So it would navigate to the webpage associated with the letter B.
My questions is what is the most efficient/smartest way to use the input by the user and navigate to the webpage?
What I was thinking at first was something along the lines of using a list and then getting the first letter of the word and using the numeric identifier to tell where to go in list index.
(A = 1, B = 2...)
Example code:
#Use base url as starting point then add extension on end.
Base_URL = "http://www.website.com/"
#Use list index as representation of letter
Alphabetic_Urls = [
"/extensionA.html",
"/extensionB.html",
"/extensionC.html",
]
Or would Dictionary be a better bet?
Thanks

How are you getting this list of URLS?
If your commandline app is crawling the website for links, and you are only looking for a single item, building a dictionary is pointless. It will take at least as long to build the dict as it would to just check as you go! eg, just search as:
for link in mysite.getallLinks():
if link[0] == firstletter:
print link
If you are going to be doing multiple searches (rather than just a single commandline parameter), then it might be worth building a dictionary using something like:
import collections
d=collections.defaultdict(list)
for link in mysite.getallLinks():
d[link[0]].append(link) # Dict of first letter -> list of links
# Print all links starting with firstletter
for link in d[firstletter]:
print link
Though given that there are just 26 buckets, it's not going to make that much of a difference.

The smartest way here will be whatever makes the code simplest to read. When you've only got 26 items in a list, who cares what algorithm it uses to look through it? You'd have to use something really, really stupid to make it have an impact on performance.
If you're really interested in the performance though, you'd need to benchmark different options. Looking at just the complexity doesn't tell the whole story, because it hides the factors involved. For instance, a dictionary lookup will involve computing the hash of the key, looking that up in tables, then checking equality. For short lists, a simple linear search can sometimes be more efficient, depending on how costly the hashing algorithm is.
If your example is really accurate though, can't you just take the first letter of the input string and predict the URL from that? ("/extension" + letter + ".html")

Dictionary!
O(1)

Dictionary would be a good choice if you have (and will always have) a small number of items. If the list of URL's is going to expand in the future you will probably actually want to sort the URL's by their letter and then match the input against that instead of hard-coding the dictionary for each one.

Since it sounds like you're only talking about 26 total items, you probably don't have to worry too much about efficiency. Anything you come up with should be fast enough.
In general, I recommend trying to use the data structure that is the best approximation of your problem domain. For example, it sounds like you are trying to map letters to URLs. E.g., this is the "A" url and this is the "B" url. In that case, a mapping data structure like a dict sounds appropriate:
html_files = {
'a': '/extensionA.html',
'b': '/extensionB.html',
'c': '/extensionC.html',
}
Although in this exact example you could actually cheat it and skip the data structure altogether -- '/extension%s.html' % letter.upper() :)

Related

A more efficient way of finding value in dictionary and its position

I have a dictionary which contains (roughly) 6 elements, each of an element which looks like the following:
What I want to do is find a particular domain (that I pass through a method) and if it exists, it stores the keyword and its position within an object. I have tried the following
def parseGoogleResponse(response, website):
i = 0
for item in response['items']:
if(item['formattedUrl'] == website):
print i
break;
i++
This approach seems to be a bit tedious and also i also remains the same at i = 10 and I'm pretty sure that this is a more efficient way. I also have to keep in consideration that if the website is not found the first time, it then queries the API for a maximum up to 5 pages, each page contains 6 search results so I somehow have to calculate the position if it is on a different page.
Any ideas

Dictionaries in Python are not ordered. There is no way to find something's position in a dictionary, unlike list type objects.
You can rather easily check for the existence of a value in the dictionary with something like:
if website in response['items'].values():
# If you enter this section, you know it's in the dictionary
else:
# If you end up here, it isn't in the dictionary

Python: Nested for loops or "next" statement

I'm a rookie hobbyist and I nest for loops when I write python, like so:
dict = {
key1: {subkey/value1: value2}
...
keyn: {subkeyn/valuen: valuen+1}
}
for key in dict:
for subkey/value in key:
do it to it
I'm aware of a "next" keyword that would accomplish the same goal in one line (I asked a question about how to use it but I didn't quite understand it).
So to me, a nested for loop is much more readable. Why, then do people use "next"? I read somewhere that Python is a dynamically-typed and interpreted language and because + both concontinates strings and sums numbers, that it must check variable types for each loop iteration in order to know what the operators are, etc. Does using "next" prevent this in some way, speeding up the execution or is it just a matter of style/preference?

next is precious to advance an iterator when necessary, without that advancement controlling an explicit for loop. For example, if you want "the first item in S that's greater than 100", next(x for x in S if x > 100) will give it to you, no muss, no fuss, no unneeded work (as everything terminates as soon as a suitable x is located) -- and you get an exception (StopIteration) if unexpectedly no x matches the condition. If a no-match is expected and you want None in that case, next((x for x in S if x > 100), None) will deliver that. For this specific purpose, it might be clearer to you if next was actually named first, but that would betray its much more general use.
Consider, for example, the task of merging multiple sequences (e.g., a union or intersection of sorted sequences -- say, sorted files, where the items are lines). Again, next is just what the doctor ordered, because none of the sequences can dominate over the others by controlling A "main for loop". So, assuming for simplicity no duplicates can exist (a condition that's not hard to relax if needed), you keep pairs (currentitem, itsfile) in a list controlled by heapq, and the merging becomes easy... but only thanks to the magic of next to advance the correct file once its item has been used, and that file only.
import heapq
def merge(*theopentextfiles):
theheap = []
for afile in theopentextfiles:
theitem = next(afile, '')
if theitem: theheap.append((theitem, afile))
heapq.heapify(theheap)
while theheap:
theitem, afile = heapq.heappop(theheap)
yielf theitem
theitem = next(afile, '')
if theitem: heapq.heappush(theheap, (theitem, afile))
Just try to do anything anywhere this elegant without next...!-)
One could go on for a long time, but the two use cases "advance an iterator by one place (without letting it control a whole for loop)" and "get just the first item from an iterator" account for most important uses of next.

Excel CSV into Nested Dictionary; List Comprehensions

I have a Excel CSV files with employee records in them. Something like this:
mail,first_name,surname,employee_id,manager_id,telephone_number
blah#blah.com,john,smith,503422,503423,+65(2)3423-2433
foo#blah.com,george,brown,503097,503098,+65(2)3423-9782
....
I'm using DictReader to put this into a nested dictionary:
import csv
gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel')
employees = dict([(row['employee_id'], row) for row in gp_extract])
Is the above the proper way to do it - it does work, but is it the Right Way? Something more efficient? Also, the funny thing is, in IDLE, if I try to print out "employees" at the shell, it seems to cause IDLE to crash (there's approximately 1051 rows).
2. Remove employee_id from inner dict
The second issue issue, I'm putting it into a dictionary indexed by employee_id, with the value as a nested dictionary of all the values - however, employee_id is also a key:value inside the nested dictionary, which is a bit redundant? Is there any way to exclude it from the inner dictionary?
3. Manipulate data in comprehension
Thirdly, we need do some manipulations to the imported data - for example, all the phone numbers are in the wrong format, so we need to do some regex there. Also, we need to convert manager_id to an actual manager's name, and their email address. Most managers are in the same file, while others are in an external_contractors CSV, which is similar but not quite the same format - I can import that to a separate dict though.
Are these two items things that can be done within the single list comprehension, or should I use a for loop? Or does multiple comprehensions work? (sample code would be really awesome here). Or is there a smarter way in Python do it?
Cheers,
Victor

Your first part has one simple issue (which might not even be an issue). You don't handle key collisions at all (unless you intend to simply overwrite).
>>> dict([('a', 'b'), ('a', 'c')])
{'a': 'c'}
If you're guaranteed that employee_id is unique, there isn't an issue though.
2) Sure you can exclude it, but no real harm done. Actually, especially in python, if employee_id is a string or int (or some other primitive), the inner dict's reference and the key actually reference the same thing. They both point to the same spot in memory. The only duplication is in the reference (which isn't that big). If you're worried about memory consumption, you probably don't have to.
3) Don't try to do too much in one list comprehension. Just use a for loop after the first list comprehension.
To sum it all up, it sounds like you're really worried about the performance of iterating over the loop twice. Don't worry about performance initially. Performance problems come from algorithm problems, not specific language constructs like for loops vs list comprehensions.
If you're familiar with Big O notation, the list comprehension and for loop after (if you decide to do that) both have a Big O of O(n). Add them together and you get O(2n), but as we know from Big O notation, we can simplify that to O(n). I've over simplified a lot here, but the point is, you really don't need to worry.
If there are performance concerns, raise them after you written the code and prove it to yourself with a code profiler.
response to comments
As for your #2 reply, python really doesn't have a lot of mechanisms for making one liners cute and extra snazzy. It's meant to force you into simply writing the code out vs sticking it all in one line. That being said, it's still possible to do quite a bit of work in one line. My suggestion is to not worry about how much code you can stick in one line. Python looks a lot more beautiful (IMO) when its written out, not jammed in one line.
As for your #1 reply, you could try something like this:
employees = {}
for row in gd_extract:
if row['employee_id'] in employees:
... handle duplicates in employees dictionary ...
else:
employees[row['employee_id']] = row
As for your #3 reply, not sure what you're looking for and what about the telephone numbers you'd like to fix, but... this may give you a start:
import re
retelephone = re.compile(r'[-\(\)\s]') # remove dashes, open/close parens, and spaces
for empid, row in employees.iteritems():
retelephone.sub('',row['telephone'])

How do I know what data type to use in Python?

I'm working through some tutorials on Python and am at a position where I am trying to decide what data type/structure to use in a certain situation.
I'm not clear on the differences between arrays, lists, dictionaries and tuples.
How do you decide which one is appropriate - my current understanding doesn't let me distinguish between them at all - they seem to be the same thing.
What are the benefits/typical use cases for each one?

How do you decide which data type to use? Easy:
You look at which are available and choose the one that does what you want. And if there isn't one, you make one.
In this case a dict is a pretty obvious solution.

Tuples first. These are list-like things that cannot be modified. Because the contents of a tuple cannot change, you can use a tuple as a key in a dictionary. That's the most useful place for them in my opinion. For instance if you have a list like item = ["Ford pickup", 1993, 9995] and you want to make a little in-memory database with the prices you might try something like:
ikey = tuple(item[0], item[1])
idata = item[2]
db[ikey] = idata
Lists, seem to be like arrays or vectors in other programming languages and are usually used for the same types of things in Python. However, they are more flexible in that you can put different types of things into the same list. Generally, they are the most flexible data structure since you can put a whole list into a single list element of another list, but for real data crunching they may not be efficient enough.
a = [1,"fred",7.3]
b = []
b.append(1)
b[0] = "fred"
b.append(a) # now the second element of b is the whole list a
Dictionaries are often used a lot like lists, but now you can use any immutable thing as the index to the dictionary. However, unlike lists, dictionaries don't have a natural order and can't be sorted in place. Of course you can create your own class that incorporates a sorted list and a dictionary in order to make a dict behave like an Ordered Dictionary. There are examples on the Python Cookbook site.
c = {}
d = ("ford pickup",1993)
c[d] = 9995
Arrays are getting closer to the bit level for when you are doing heavy duty data crunching and you don't want the frills of lists or dictionaries. They are not often used outside of scientific applications. Leave these until you know for sure that you need them.
Lists and Dicts are the real workhorses of Python data storage.

Best type for counting elements like this is usually defaultdict
from collections import defaultdict
s = 'asdhbaklfbdkabhvsdybvailybvdaklybdfklabhdvhba'
d = defaultdict(int)
for c in s:
d[c] += 1
print d['a'] # prints 7

Do you really require speed/efficiency? Then go with a pure and simple dict.

Personal:
I mostly work with lists and dictionaries.
It seems that this satisfies most cases.
Sometimes:
Tuples can be helpful--if you want to pair/match elements. Besides that, I don't really use it.
However:
I write high-level scripts that don't need to drill down into the core "efficiency" where every byte and every memory/nanosecond matters. I don't believe most people need to drill this deep.

How can I make this python function faster?

The following code that I wrote takes a set of 68,000 items and tries to find similar items based on text location in the strings. The process takes a bit on this i3 4130 I'm temporarily using to code on - is there any way to speed this up? I'm making a type of 'did you mean?' function, so I need to sort on the spot of what the user enters.
I'm not trying to compare by similarity in a dictionary that's already created using keywords, I'm trying to compare the similar between the user's input on the fly and all existing keys. The user may mistype a key, so that's why it would say "did you mean?", like Google search does.
Sorting does not affect the time, according to averaged tests.
def similar_movies(movie):
start=time.clock()
movie=capitalize(movie)
similarmovies={}
allmovies=all_movies() #returns set of all 68000 movies
for item in allmovies:
'''if similar(movie.lower(),item.lower())>.5 or movie in item: #older algorithm
similarmovies[item]=similar(movie.lower(),item.lower())'''
if movie in item: #newer algorithm,
similarmovies[item]=1.0
print item
else:
similarmovies[item]=similar(movie.lower(),item.lower())
similarmovieshigh=sorted(similarmovies, key=similarmovies.get, reverse=True)[:10]
print time.clock()-start
return similarmovieshigh
Other functions used:
from difflib import SequenceMatcher
def similar(a, b):
output=SequenceMatcher(None, a, b).ratio()
return output
def all_movies(): #returns set of all keys in sub dicts(movies)
people=list(ratings.keys())
allmovies=[]
for item in people:
for i in ratings[item]:
allmovies.append(i)
allmovies=set(allmovies)
return allmovies
The dictionary is in this format, except with thousands of names:
ratings={'Shane': {'Avatar': 4.2, '127 Hours': 4.7}, 'Joe': {'Into The Wild': 4.5, 'Unstoppable': 3.0}}

Your algorithm is going to be O(n2), since within every title, the in operator has to check every sub-string of the title to determine if the entered text is within it. So yeah, I can understand why you would want this to run faster.
An i3 doesn't provide much compute power, so pre-computing as much as possible is the only solution, and running extra software such as a database is probably going to provide poor results, again due to the capability.
You might consider using a dictionary of title words (possibly with pre-computed phonetic changes to eliminate most common misspellings - the Porter Stemmer algorithm should provide some helpful reduction rules, e.g. to allow "unstop" to match "unstoppable").
So, for example, one key in your dictionary would be "wild" (or a phonetic adjustment), and the value associated with that key would be a list of all titles that contain "wild"; you would have the same for "the", "into", "avatar", "hours", "127", and all other words in your list of 68,000 titles. Just as an example, your dictionary's "wild" entry might look like:
"wild": ["Into The Wild", "Wild Wild West", "Wild Things"]
(Yes, I searched for "wild" on IMDB just so this list could have more entries - probably not the best choice, but not many titles have "avatar", "unstoppable", or "hours" in them).
Common words such as "the" might have enough entries that you would want to exclude them, so a persistent copy of the dictionary might be helpful to allow you to make specific adjustments, although it isn't necessary, and the compute time should be relatively quick at start-up.
When the user types in some text, you split the text into words, apply any phonetic reductions if you choose to use them, and then concatenate all of the title lists for all of the words from the user, including duplicates.
Then, count the duplicates and sort by how many times a title was matched. If a user types "The Wild", you'd have two matches on "Into The Wild" ("the" and "wild"), so it should sort higher than titles with only "the" or "wild" but not both in them.
Your list of ratings can be searched after the final sorted list is built, with ratings appended to each entry; this operation should be quick, since your ratings are already within a dictionary, keyed by name.
This turns an O(n2) search into a O(log(n)) search for each word entered, which should make a big difference in performance, if it suits your needs.

In all_movies(): instead of appending to a list you could add to a set and not cast keys() to a list:
def all_movies():
allmovies = set()
for item in ratings.keys():
for i in ratings[item]:
allmovies.add(i)
return allmovies
EDIT: or only using one for-loop:
def all_movies():
result = []
for rating_dict in ratings.values()
result += rating_dict.keys()
return result
Nothing I could spot in similar_movies.
Also have a look at celery: http://docs.celeryproject.org/en/latest/ for multi-processing,
especially the chunks concept: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks

If you're developing for a production system, I'd suggest using a full text search engine like Whoosh (Python), Elastic Search (Java), or Apache Solr (Java). A full text search engine is a server that builds an index to implement full text search including fuzzy or proximity searches efficiently. Many popular database system also features full search text engine like PostgreSQL FTS and MySQL FTS that may be an acceptable alternative if you are already using these database engines.
If this code is developed mostly for self learning and you want to learn how to implement fuzzy searches, you may want to look at normalizing the movie titles in the index and the search terms. There are methods like Soundex and Metaphone that normalizes search terms based on how it likely sounds in English and this normalized term can be used to create the search index. PostgreSQL have implementation of these algorithms. Note that these algorithms are very basic building blocks, a proper full text search engine will take into account misspelling, synonyms, stop words, language specific quirks, and optimizations like parallel/distributed processing, etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.