How to split a dataframe and select all possible pairs? - python

I have a dataframe that I want to separate in order to apply a certain function.
I have the fields df['beam'], df['track'], df['cycle'] and want to separate it by unique values of each of this three. Then, I want to apply this function (it works between two individual dataframes) to each pair that meets that df['track'] is different between the two. Also, the result doesn't change if you switch the order of the pair, so I'd like to not make unnecessary calls to the function if possible.
I currently work it through with four nested for loops into an if conditional, but I'm absolutely sure there's a better, cleaner way.
I'd appreciate all help!
Edit: I ended up solving it like this:
I split the original dataframe into multiple by using df.groupby()
dfsplit=df.groupby(['beam','track','cycle'])
This generates a dictionary where the keys are all the unique ['beam','track','cycle'] combinations as tuples
I combined all possible ['beam','track','cycle'] pairs with the use of itertools.combinations()
keys=list(itertools.combinations(dfsplit.keys(),2))
This generates a list of 2-element tuples where each element is one ['beam','track','cycle'] tuple itself, and it doesn't include the tuple with the order swapped, so I avoid calling the function twice for what would be the same case.
I removed the combinations where 'track' was the same through a for loop
for k in keys.copy():
if k[0][1]==k[1][1]:
keys.remove(k)
Now I can call my function by looping through the list of combinations
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])
Step 3 is taking a long time, probably because I have a very large number of unique ['beam','track','cycle'] combinations so the list is very long, but also probably because I'm doing it sub-optimally. I'll keep the question open in case someone realizes a better way to do this last step.
EDIT 2:
Solved the problem with step 3, once again with itertools, just by doing
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
itertools.filterfalse returns all elements of the list that return false to the function defined, so it's doing the same as the previous for loop but selecting the false instead of removing the true. It's very fast and I believe this solves my problem for good.

I don't know how to mark the question as solved so I'll just repeat the solution here:
dfsplit=df.groupby(['beam','track','cycle'])
keys=list(itertools.combinations(dfsplit.keys(),2))
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])

Related

Tuple-key dictionary in python: Accessing a whole block of entries

I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call

How to find maximum element from a list and its index?

I have a list with ordered dictionaries. These ordered dictionaries have different sizes and can also have the same size(for example, 10 dictionaries can have the length of 30 and 20 dictionaries can have the length of 32). I want to find the maximum number of items a dictionary from the list has. I have tried this, which gets me the correct maximum length:
maximum_len= max(len(dictionary_item) for dictionary_item in item_list)
But how can I find the dictionary fields for which the maximum_len is given? Say that the maximum_len is 30, I want to also have the dictionary with the 30 keys printed. It can be any dictionary with the size 30, not a specific one. I just need the keys of that dictionary.
Well you can always use filter:
output_dics=filter((lambda x: len(x)==maximum_len),item_list)
then you have all the dictionarys that satisfies the condition , pick a random one or the first one
Don't know if this is the easiest or most elegant way to do it but you could just write a simple function that returns 2 values, the max_length you already calculated but also the dict that you can get via the .index method and the max_length of the object you were searching for.
im talking about something like this:
def get_max(list_of_dict):
plot = []
for dict_index, dictionary in enumerate(list_of_dict):
plot.append(len(dictionary))
return max(plot), list_of_dict[plot.index(max(plot))]
maximum_len, max_dict = get_max(test)
tested it, works for my case, although i have just made myself a testlist with just 5 dicts of different length.
EDIT:
changed variable "dict" to "dictionary" to prevent it shadowing from outer scope.

Collapse list of lists to eliminate redundancy

I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!

Python Sorting 2D List with custom Key

So I have a 2D list and want to sort it using a second file of keys. Does anyone know how I would go about doing that?
Heres an example input:
And here is an example input file:
first_nm,last_nm,gender,cwid,cred_hrs,qual_pts,gpa
John,Roe,M,44444444,40,150,3.75
Jane,Roe,F,66666666,100,260,2.6
John,Doe,M,22222222,50,140,2.8
Jane,Doe,F,88888888,80,280,3.5
Penny,Lowe,F,55555555,40,140,3.5
Lenny,Lowe,M,11111111,100,280,2.8
Denny,Lowe,M,99999999,80,260,3.25
Benny,Lowe,M,77777777,120,90,0.75
Jenny,Lowe,F,33333333,50,90,1.8
Zoe,Coe,F,0,50,130,2.6
Here are the keys to sort it(there could be more or less, depending on how you want to sort it)
gender,ascend,string
gpa,descend,float
last_nm,ascend,string
And here would be the output for that input and keys:
first_nm,last_nm,gender,cwid,cred_hrs,qual_pts,gpa
Jane,Doe,F,88888888,80,280,3.5
Penny,Lowe,F,55555555,40,140,3.5
Zoe,Coe,F,00000000,50,130,2.6
Jane,Roe,F,66666666,100,260,2.6
Jenny,Lowe,F,33333333,50,90,1.8
John,Roe,M,44444444,40,150,3.75
Denny,Lowe,M,99999999,80,260,3.25
John,Doe,M,22222222,50,140,2.8
Lenny,Lowe,M,11111111,100,280,2.8
Benny,Lowe,M,77777777,120,90,0.75
I was thinking of just using the built in sort() but was not sure if I would be able to use it if I am sorting 3 different times. I think I would have to sort backwards? (last_nm, then gpa, then gender)
You can return a tuple from your key function to create complex sorts. And as a quick trick, multiply numeric values by -1 for a reverse sort. Your example would look something like this:
lists.sort(key = lambda x: (x[2], x[6] * -1, x[1]))
The list sort() method takes a boolean parameter reverse, but it applies to the whole key; you can't say that you want some parts of the key to use ascending sort and others to use descending. Sadly, there isn't a simple way to extend g.d.d.c's trick of multiplying by -1 to non-numeric data.
So if you need to handle arbitrary combinations of ascending and descending then yes, you will have to sort multiple times, working backwards over your list of keys, like you mention in your question. The built-in Python sorting algorithm, timsort, is a stable sort, which means each time you sort your 2D list with a different key the previous sort results won't get scrambled.

Python: Listing the duplicates in a list

I am fairly new to Python and I am interested in listing duplicates within a list. I know how to remove the duplicates ( set() ) within a list and how to list the duplicates within a list by using collections.Counter; however, for the project that I am working on this wouldn't be the most efficient method to use since the run time would be n(n-1)/2 --> O(n^2) and n is anywhere from 5k-50k+ string values.
So, my idea is that since python lists are linked data structures and are assigned to the memory when created that I begin counting duplicates from the very beginning of the creation of the lists.
List is created and the first index value is the word 'dog'
Second index value is the word 'cat'
Now, it would check if the second index is equal to the first index, if it is then append to another list called Duplicates.
Third index value is assigned 'dog', and the third index would check if it is equal to 'cat' then 'dog'; since it matches the first index, it is appended to Duplicates.
Fourth index is assigned 'dog', but it would check the third index only, and not the second and first, because now you can assume that since the third and second are not duplicates that the fourth does not need to check before, and since the third/first are equal, the search stops at the third index.
My project gives me these values and append it to a list, so I would want to implement that above algorithm because I don't care how many duplicates there are, I just want to know if there are duplicates.
I can't think of how to write the code, but I figured the basic structure of it, but I might be completely off (using random numgen for easier use):
for x in xrange(0,10):
list1.append(x)
for rev, y in enumerate(reversed(list1)):
while x is not list1(y):
cond()
if ???
I really don't think you'll get better than a collections.Counter for this:
c = Counter(mylist)
duplicates = [ x for x,y in c.items() if y > 1 ]
building the Counter should be O(n) (unless you're using keys which are particularly bad for hashing -- But in my experience, you need to try pretty hard to make that happen) and then getting the duplicates list is also O(n) giving you a total complexity of O(2n) == O(n) (for typical uses).

Categories