I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call
I have a list with ordered dictionaries. These ordered dictionaries have different sizes and can also have the same size(for example, 10 dictionaries can have the length of 30 and 20 dictionaries can have the length of 32). I want to find the maximum number of items a dictionary from the list has. I have tried this, which gets me the correct maximum length:
maximum_len= max(len(dictionary_item) for dictionary_item in item_list)
But how can I find the dictionary fields for which the maximum_len is given? Say that the maximum_len is 30, I want to also have the dictionary with the 30 keys printed. It can be any dictionary with the size 30, not a specific one. I just need the keys of that dictionary.
Well you can always use filter:
output_dics=filter((lambda x: len(x)==maximum_len),item_list)
then you have all the dictionarys that satisfies the condition , pick a random one or the first one
Don't know if this is the easiest or most elegant way to do it but you could just write a simple function that returns 2 values, the max_length you already calculated but also the dict that you can get via the .index method and the max_length of the object you were searching for.
im talking about something like this:
def get_max(list_of_dict):
plot = []
for dict_index, dictionary in enumerate(list_of_dict):
plot.append(len(dictionary))
return max(plot), list_of_dict[plot.index(max(plot))]
maximum_len, max_dict = get_max(test)
tested it, works for my case, although i have just made myself a testlist with just 5 dicts of different length.
EDIT:
changed variable "dict" to "dictionary" to prevent it shadowing from outer scope.
I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!
So I have a 2D list and want to sort it using a second file of keys. Does anyone know how I would go about doing that?
Heres an example input:
And here is an example input file:
first_nm,last_nm,gender,cwid,cred_hrs,qual_pts,gpa
John,Roe,M,44444444,40,150,3.75
Jane,Roe,F,66666666,100,260,2.6
John,Doe,M,22222222,50,140,2.8
Jane,Doe,F,88888888,80,280,3.5
Penny,Lowe,F,55555555,40,140,3.5
Lenny,Lowe,M,11111111,100,280,2.8
Denny,Lowe,M,99999999,80,260,3.25
Benny,Lowe,M,77777777,120,90,0.75
Jenny,Lowe,F,33333333,50,90,1.8
Zoe,Coe,F,0,50,130,2.6
Here are the keys to sort it(there could be more or less, depending on how you want to sort it)
gender,ascend,string
gpa,descend,float
last_nm,ascend,string
And here would be the output for that input and keys:
first_nm,last_nm,gender,cwid,cred_hrs,qual_pts,gpa
Jane,Doe,F,88888888,80,280,3.5
Penny,Lowe,F,55555555,40,140,3.5
Zoe,Coe,F,00000000,50,130,2.6
Jane,Roe,F,66666666,100,260,2.6
Jenny,Lowe,F,33333333,50,90,1.8
John,Roe,M,44444444,40,150,3.75
Denny,Lowe,M,99999999,80,260,3.25
John,Doe,M,22222222,50,140,2.8
Lenny,Lowe,M,11111111,100,280,2.8
Benny,Lowe,M,77777777,120,90,0.75
I was thinking of just using the built in sort() but was not sure if I would be able to use it if I am sorting 3 different times. I think I would have to sort backwards? (last_nm, then gpa, then gender)
You can return a tuple from your key function to create complex sorts. And as a quick trick, multiply numeric values by -1 for a reverse sort. Your example would look something like this:
lists.sort(key = lambda x: (x[2], x[6] * -1, x[1]))
The list sort() method takes a boolean parameter reverse, but it applies to the whole key; you can't say that you want some parts of the key to use ascending sort and others to use descending. Sadly, there isn't a simple way to extend g.d.d.c's trick of multiplying by -1 to non-numeric data.
So if you need to handle arbitrary combinations of ascending and descending then yes, you will have to sort multiple times, working backwards over your list of keys, like you mention in your question. The built-in Python sorting algorithm, timsort, is a stable sort, which means each time you sort your 2D list with a different key the previous sort results won't get scrambled.
I am fairly new to Python and I am interested in listing duplicates within a list. I know how to remove the duplicates ( set() ) within a list and how to list the duplicates within a list by using collections.Counter; however, for the project that I am working on this wouldn't be the most efficient method to use since the run time would be n(n-1)/2 --> O(n^2) and n is anywhere from 5k-50k+ string values.
So, my idea is that since python lists are linked data structures and are assigned to the memory when created that I begin counting duplicates from the very beginning of the creation of the lists.
List is created and the first index value is the word 'dog'
Second index value is the word 'cat'
Now, it would check if the second index is equal to the first index, if it is then append to another list called Duplicates.
Third index value is assigned 'dog', and the third index would check if it is equal to 'cat' then 'dog'; since it matches the first index, it is appended to Duplicates.
Fourth index is assigned 'dog', but it would check the third index only, and not the second and first, because now you can assume that since the third and second are not duplicates that the fourth does not need to check before, and since the third/first are equal, the search stops at the third index.
My project gives me these values and append it to a list, so I would want to implement that above algorithm because I don't care how many duplicates there are, I just want to know if there are duplicates.
I can't think of how to write the code, but I figured the basic structure of it, but I might be completely off (using random numgen for easier use):
for x in xrange(0,10):
list1.append(x)
for rev, y in enumerate(reversed(list1)):
while x is not list1(y):
cond()
if ???
I really don't think you'll get better than a collections.Counter for this:
c = Counter(mylist)
duplicates = [ x for x,y in c.items() if y > 1 ]
building the Counter should be O(n) (unless you're using keys which are particularly bad for hashing -- But in my experience, you need to try pretty hard to make that happen) and then getting the duplicates list is also O(n) giving you a total complexity of O(2n) == O(n) (for typical uses).