I have 40.000 documents, 93.08 words per doc. on avg., where every word is a number (which can index a dictionary) and every word has a count (frequency). Read more here.
I am between two data structures to store the data and was wondering which one I should choose, which one the Python people would choose!
Triple-list:
A list, where every node:
__ is a list, where every node:
__.... is a list of two values; word_id and count.
Double-dictionary:
A dictionary, with keys the doc_id and values dictionaries.
That value dictionary would have a word_id as a key and the count as a value.
I feel that the first will require less space (since it doesn't store the doc_id), while the second will be more easy to handle and access. I mean, accessing the i-element in the list is O(n), while it is constant in the dictionary, I think. Which one should I choose?
You should use a dictionary. It will make handling your code easier to understand and to program and it will have a lower complexity as well.
The only reason you would use a list, is if you cared about the order of the documents.
If you don't care about the order of the items you should definitely use a dictionary because dictionaries are used to group associated data while lists are generally used to group more generic items.
Moreover lookups in dictionaries are faster than that of a list.
Lookups in lists are O(n) while lookups in dictionaries are O(1). though lists are considerably larger in Memory than lists
Essentially you just want to store a large amount of numbers, for which the most space efficient choice is an array. These are one-dimensional so you could write a class which takes in three indices (the last being 0 for word_id and 1 for count) and does some basic addition and multiplication to find the correct 1D index.
Related
I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.
Context: I am trying to speed up the execution time of k-means. For that, I pre-compute the means before the k-means execution. These means are stored in a dictionary called means_dict which has as a key a sequence of the points id ordered in ascending order and then joining by an underscore ,and as a value the mean of these points.
When I want to access to the mean of a given points set in dict_mean dictionary during the k-means execution, I have to generate the key of that points set ie order the id points in ascending order and joining them by an underscore.
The key generation instruction takes a long time because I the key may contain thousands of integers.
I have for each key a sequence of integers separated by an underscore "-" in a dictionary. I have to sort the sequence of integers before joining them by an underscore in order to make the key unique, I finaly obtain a string key. The problem is this process is so long. I want to use an another type of key that permits to avoid sorting the sequence and that key type should be faster than the string type in terms of access, comparison and search.
# means_dict is the dictionary containing as a key a string (sequence of
# integers joined by underscore "-", for example key="3-76-45-78-344")
# points is a dictionary containing for each value a list of integers
for k in keys:
# this joining instruction is so long
key = "_".join([ str(c) for c in sorted(points[k])])
if( key in means_dict ):
newmu.append( means_dict[key] )
Computing the means is cheap.
Did you profile your program? How much of the time is spent recomputing he means? With proper numpy arrays instead of python boxed arrays, this should be extremely cheap - definitely cheaper than constructing any such key!
The reason why computing the key is expensive is simple: it means constructing an object of varying size. And based on your description, it seems you will be building first a list of boxed integers, then a tuple of boxes integers, then serialize this into a string and then copy the string again to append the underscore. There is no way this is going to be faster than the simple - vectorizable - aggregation when computing the actual mean...
You could even use MacQueens approach to update means rather than recomputing them. But even that is often slower than recomputing them.
I wouldn't be surprised if your approach ends up being 10x slower than regular k-means... And probably 1000x slower than the clever kmeans algorithms such as Hartigan and Wong's.
Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?
The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
0.7802875302271843
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
1.025623542189976
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
0.19846207875508526
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
3.5699193290383846e-06
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.
I have a list of ~30 floats. I want to see if a specific float is in my list. For example:
1 >> # For the example below my list has integers, not floats
2 >> list_a = range(30)
3 >> 5.5 in list_a
False
4 >> 1 in list_a
True
The bottleneck in my code is line 3. I search if an item is in my list numerous times, and I require a faster alternative. This bottleneck takes over 99% of my time.
I was able to speed up my code by making list_a a set instead of a list. Are there any other ways to significantly speed up this line?
The best possible time to check if an element is in list if the list is not sorted is O(n) because the element may be anywhere and you need to look at each item and check if it is what you are looking for
If the array was sorted, you could've used binary search to have O(log n) look up time. You also can use hash maps to have average O(1) lookup time (or you can use built-in set, which is basically a dictionary that accomplishes the same task).
That does not make much sense for a list of length 30, though.
In my experience, Python indeed slows down when we search something in a long list.
To complement the suggestion above, my suggestion will be subsetting the list, of course only if the list can be subset and the query can be easily assigned to the correct subset.
Example is searching for a word in an English dictionary, first subsetting the dictionary into 26 "ABCD" sections based on each word's initials. If the query is "apple", you only need to search the "A" section. The advantage of this is that you have greatly limited the search space and hence the speed boost.
For numerical list, either subset it based on range, or on the first digit.
Hope this helps.
I understand that a list is different from an array. But still, O(1)? That would mean accessing an element in a list would be as fast as accessing an element in a dict, which we all know is not true.
My question is based on this document:
list
----------------------------
| Operation | Average Case |
|-----------|--------------|
| ... | ... |
|-----------|--------------|
| Get Item | O(1) |
----------------------------
and this answer:
Lookups in lists are O(n), lookups in dictionaries are amortized O(1),
with regard to the number of items in the data structure.
If the first document is true, then why is accessing a dict faster than accessing a list if they have the same complexity?
Can anybody give a clear explanation on this please? I would say it always depends on the size of the list/dict, but I need more insight on this.
Get item is getting an item in a specific index, while lookup means searching if some element exists in the list. To do so, unless the list is sorted, you will need to iterate all elements, and have O(n) Get Item operations, which leads to O(n) lookup.
A dictionary is maintaining a smart data structure (hash table) under the hood, so you will not need to query O(n) times to find if the element exists, but a constant number of times (average case), leading to O(1) lookup.
accessing a list l at index n l[n] is O(1) because it is not implemented as a Vanilla linked list where one needs to jump between pointers (value, next-->) n times to reach cell index n.
If the memory is continuous and the entry size would had been fixed, reaching a specific entry would be trivial as we know to jump n times an entry size (like classic arrays in C).
However, since list is variable in entries size, the python implementation uses a continuous memory list just for the pointers to the values. This makes indexing a list (l[n]) an operation whose cost is independent of the size of the list or the value of the index.
For more information see http://effbot.org/pyfaq/how-are-lists-implemented.htm
That is because Python stores the address of each node of a list into a separate array. When we want an element at nth node, all we have to do is look up the nth element of the address array which gives us the address of nth node of the list by which we can get the value of that node in O(1) complexity.
Python does some neat tricks to make these arrays expandable as the list grows. Thus we are getting the flexibility of lists and and the speed of arrays. The trade off here is that the compiler has to reallocate memory for the address array whenever the list grows to a certain extent.
amit has explained in his answer to this question about why lookups in dictionaries are faster than in lists.
The minimum possible I can see is O(log n) from a strict computer science standpoint