Improve speed for list concatenation? - python

I have a list called L inside a loop that must iterate though millions of lines. The salient features are:
for line in lines:
L = ['a', 'list', 'with', 'lots', 'of', 'items']
L[3] = 'prefix_text_to_item3' + L[3]
Do more stuff with L...
Is there a better approach to adding text to a list item that would speed up my code. Can .join be used? Thanks.

In a performance oriented code, it is not a good idea to add 2 strings together, it is preferable to use a "".join(_items2join_) instead. (I found some benchmarks there : http://www.skymind.com/~ocrow/python_string/)

Since accessing an element in a python list is O(1), and appending a list to another is O(1) (which is probably the time complexity of concatenating strings in python), The code you have provided is running as fast as it can as far as I can tell. :) You probably can't afford to do this, but when I need speed I go to C++ or some other compiled language when I need to process that much information. Things run much quicker. For time complexity of list operations in python, you may consult this web site: http://wiki.python.org/moin/TimeComplexity and here: What is the runtime complexity of python list functions?

Don't actually create list objects.
Use generator functions and generator expressions.
def appender( some_list, some_text ):
for item in some_list:
yield item + some_text
This appender function does not actually create a new list. It avoids some of the memory management overheads associated with creating a new list.

There may be a better approach depending on what you are doing with list L.
For instance, if you are printing it, something like this may be faster.
print "{0} {1} {2} {3}{4} {5}".format(L[0], L[1], L[2], 'prefix_text_to_item3', L[3], L[4])
What happens to L later in the program?

Related

str.split() in the for-loop instantiation, does it cause slower execution?

I'm a sucker for reducing code to its bare minimum and love keeping it short and slim, but occasionally I get into the dilemma of whether I'm doing more harm than good. Below is an example of a situation I frequently encounter and where I start pondering if I am minifying at the expense of speed.
str = "my name is john"
##Alternative 1
for el in str.split(" "):
print(el)
##Alternative 2
splittedStr = str.split(" ")
for el in splittedStr:
print(el)
What is faster? I'd assume it's the second one because we don't split the string after every iteration (not even sure we do that)?
str.split(" ") does the exact same thing in both cases. It creates an anonymous list of the split strings. In the second case you have the minor overhead of assigning it to a variable and then fetching the value of the variable. Its wasted time if you don't need to keep the object for other reasons. But this is a trivial amount of time compared to other object referencing taking place in the same loop. Alternative 2 also leaves the data in memory which is another small performance issue.
The real reason Alternative 1 is better than 2, IMHO, is that it doesn't leave the hint that splittedStr is going to be needed later.
Look my friend, if you want to actually reduce the amount of time in the code in general,loop on a tuple instead of list but assigning the result in a variable then using the variable is not the best approach is you just reserved a memory location just to store the value but sometimes you can do that just for the sake of having a clean code like if you have more than one operation in one line like
min(str.split(mylist)[3:10])
In this case, it is better to have a variable called min_value for example just to make things cleaner.
returning back to the performance issue, you could actually notice the difference in performance if you loop through a list or a tuple like
This is looping through a tuple
for i in (1,2,3):
print(i)
& This is looping through a list
for i in [1,2,3]:
print(i)
you will find that using tuple will be faster !

Efficient solution for merging 2 sorted lists in Python

I am teaching my self Python by starting with the crash course published by Google. One of the practice problems is to write a function that takes 2 sorted lists, merges them together, and returns a sorted list. The most obvious solution is:
def linear_merge(list1, list2):
list = list1 + list2
list.sort()
return list
Obviously above is not very efficient, or so I thought, because on the backend the sort function will have to run over the entire output list again. The problem asks for an efficient way of implementing this function, presumably that it can work well on huge lists. My code was similar to Google's answer, but I tweaked it a bit to make it a bit faster:
def linear_merge_goog(list1, list2):
result = []
while len(list1) and len(list2):
if list1[-1] > list2[-1]:
result.append(list1.pop())
else:
result.append(list2.pop())
result.extend(list1)
result.extend(list2)
return result[::-1]
Original Google code was poping from the front of the array, but even they make a note that it's much more efficient to pop form the back of the array and than reverse it.
I tried to run both functions with large 20 million entry arrays, and the simple stupid combine and sort function comes up on top by a margin of 3X+ every time. Sub 1 second vs. over 3 seconds for what should be the more efficient method.
Any ideas? Am I missing something. Does it have to do with built in sort function being compiled while my code is interpreted (doesn't sound likely). Any other ideas?
Its because of the Python implementation of .sort(). Python uses something called Timsort.
Timsort is a type of mergesort. Its distinguishing characteristic is that it identifies "runs" of presorted data that it uses for the merge. In real world data, sorted runs in unsorted data are very common and you can sort two sorted arrays in O(n) time if they are presorted. This can cut down tremendously on sort times which typically run in O(nlog(n)) time.
So what's happening is that when you call list.sort() in Python, it identifies the two runs of sorted data list1 and list2 and merges them in O(n) time. Additionally, this implementation is compiled C code which will be faster than an interpreted Python implementation doing the same thing.

list membership test or set

Is it more efficient to check if an item is already in a list before adding it:
for word in open('book.txt','r').read().split():
if word in list:
pass
else:
list.append(item)
or to add everything then run set() on it? like this:
for word in open('book.txt','r').read().split():
list.append(word)
list = set(list)
If the ultimate intention is to construct a set, construct it directly and don't bother with the list:
words = set(open('book.txt','r').read().split())
This will be simple and efficient.
Just as your original code, this has the downside of first reading the entire file into memory. If that's an issue, this can be solved by reading one line at a time:
words = set(word for line in open('book.txt', 'r') for word in line.split())
(Thanks #Steve Jessop for the suggestion.)
Definitely don't take the first approach in your question, unless you know the list to be short, as it will need to scan the entire list on every single word.
A set is a hash table while a list is an array. set membership tests are O(1) while list membership tests are O(n). If anything, you should be filtering the list using a set, not filtering a set using a list.
It's worth testing to find out; but I frequently use comprehensions to filter my lists, and I find that works well; particularly if the code is experimental and subject to change.
l = list( open( 'book.txt', 'r').read().split() )
unique_l = list(set( l ))
# maybe something else:
good_l = [ word for word in l if not word in naughty_words ]
I have heard that this helps with efficiency; but as I said, a test tells more.
The algorithm with word in list is an expensive operation. Why? Because, to see if an item is in the list, you have to check every item in the list. Every time. It's a Shlemiel the painter algorithm. Every lookup is O(n), and you do it n times. There's no startup cost, but it gets expensive very quickly. And you end up looking at each item way more than one time - on average, len(list)/2 times.
Looking to see if things are in the set, is (usually) MUCH cheaper. Items are hashed, so you calculate the hash, look there, and if it's not there, it's not in the set - O(1). You do have to create the set the first time, so you'll look at every item once. Then you look at each item one more time to see if it's already in your set. Still overall O(n).
So, doing list(set(mylist)) is definitely preferable to your first solution.
#NPE's answer doesn't close the file explicitly. It's better to use a context manager
with open('book.txt','r') as fin:
words = set(fin.read().split())
For normal text files this is probably adequate. If it's an entire DNA sequence for example you probably don't want to read the entire file into memory at once.

Improving Python Snippet Performance

This statement is running quite slowly, and I have run out of ideas to optimize it. Could someone help me out?
[dict(zip(small_list1, small_list2)) for small_list2 in really_huge_list_of_list]
The small_lists contain only about 6 elements.
A really_huge_list_of_list of size 209,510 took approximately 16.5 seconds to finish executing.
Thank you!
Edit:
really_huge_list_of_list is a generator. Apologies for any confusion.
The size is obtained from the result list.
Possible minor improvement:
[dict(itertools.izip(small_list1, small_list2)) for small_list2 in really_huge_list_of_list]
Also, you may consider to use generator instead of list comprehensions.
To expand on what the comments are trying to say, you should use a generator instead of that list comprehension. Your code currently looks like this:
[dict(zip(small_list1, small_list2)) for small_list2 in really_huge_list_of_list]
and you should change it to this instead:
def my_generator(input_list_of_lists):
small_list1 = ["wherever", "small_list1", "comes", "from"]
for small_list2 in input_list_of_lists:
yield dict(zip(small_list1, small_list2))
What you're doing right now is taking ALL the results of iterating over your really huge list, and building up a huge list of the results, before doing whatever you do with that list of results. Instead, you should turn that list comprehension into a generator so that you never have to build up a list of 200,000 results. It's building that result list that's taking up so much memory and time.
... Or better yet, just turn that list comprehension into a generator comprehension by changing its outer brackets into parentheses:
(dict(zip(small_list1, small_list2)) for small_list2 in really_huge_list_of_list)
That's really all you need to do. The syntax for list comprehensions and generator comprehensions is almost identical, on purpose: if you understand a list comprehension, you'll understand the corresponding generator comprehension. (In this case, I wrote out the generator in "long form" first so that you'd see what that comprehension expands to).
For more on generator comprehensions, see here, here and/or here.
Hope this helps you add another useful tool to your Python toolbox!

An ALGORITHM to create one list from "X" nested lists in Python

What is the simplest way to create this list in Python?
First, suppose I have this nested list:
oldList = [ [{'letter':'a'}], [{'letter':'b'}], [{'letter':'c'}] ]
I want a function to spit out:
newList = [ {'letter':a}, {'letter':'b'}, {'letter':'c'} ]
Well, this could be done manually. However, what if there are three nested? ...X nested
Tricky? :)
Recursive solutions are simplest, but limited to at most a few thousand levels of nesting before you get an exception about too-deep recursion. For real generality, you can eliminate recursion by keeping your own stack; iterators are good things to keep on said stack, and the whole function's best written as a generator (just call list(flatten(thelist)) if you really want a huge list result).
def flatten(alist):
stack = [iter(alist)]
while stack:
current = stack.pop()
for item in current:
if isinstance(item, list):
stack.append(current)
stack.append(iter(item))
break
yield item
Now this should let you handle as many levels of nesting as you have virtual memory for;-).
http://www.daniel-lemire.com/blog/archives/2006/05/10/flattening-lists-in-python/
from that link (with a couple minor changes:
def flatten(l):
if isinstance(l, list):
return sum(map(flatten,l),[])
else:
return [l]
The simplest answer is this
Only you can prevent nested lists
Do not create a list of lists using append. Create a flat list using extend.
The Python Cookbook Martelli, Ravenscroft and Asher 2005 O'Reilley also offers a solution to this flattening problem.
See 4.6 Flattening a nested sequence.
This solution uses generators which can be a good thing if the lists are long.
Also, this solution deals equaly well with lists or tuples.
Note: Oops... I rushed a bit. I'm unsure of the legality of reproducing this snippet here...
let me look for policy / precedents in this area.
Edit: later found reference as a Google Books preview.
Here's a link to this section of the book in Google books
I prefer Alex Martelli's post as usual. Just want to add a not recommended trick:
from Tkinter import _flatten
print _flatten(oldList)

Categories