matrix holes comprehension - python

This is an offshoot of a previous question which started to snowball. If I have a matrix A and I want to use the mean/average of each row [1:] values to create another matrix B, but keep the row headings intact, this list comprehension works.
from operator import mul,len
# matrix A with row headings and values
A = [('Apple',0.95,0.99,0.89,0.87,0.93),
('Bear',0.33,0.25.0.85,0.44,0.33),
('Crab',0.55,0.55,0.10,0.43,0.22)]
#List Comprehension
def average(lst):
return sum(lst) / len(lst)
B = [(a[0], average(a[1:])) for a in A]
Expected outcome
B = [('Apple', 0.926), ('Bear', 0.44), ('Crab', 0.37)]
However, if the dataset has holes in it (symbolized by 'x'), the analysis won't run, i.e.
# matrix A with row headings and values
A = [('Apple',0.95,x,0.89,0.87,0.93),
('Bear',0.33,0.25.0.85,0.44,0.33),
('Crab',x,0.55,0.10,x,0.22)]
In a matrix where the relative placement of each row and column means something, I can't just delete the "blank" entries, so how can I fill or skip over them and make this work, again? In retrospect, my data has more holes than an old bed sheet.
Also, how would I introduce the filters suggested below into the following definitions (which choke when they hit something that isn't a number) so that hitting an 'X' value would return another 'X' value?
def plus(matrix, i):
return [row[i] for row in matrix]
def minus(matrix, i):
return [1.00-row[i] for row in matrix]

Try this:
B = [(a[0], average(filter(lambda elt: elt != x, a[1:]))) for a in A]
Performance could be improved by using ifilter from itertools, especially for large matrices. This should give you the expected result without changing the average function or modifying A.
EDIT
You may want to consider implementing your matrix differently if it is sparse. If you want to keep your current implementation, you should use the value None to represent missing values. This is the Python equivalent to null that you may be familiar with from other languages.
How you implement the matrix drastically changes how you implement the functions you want, and I'll try to cover your way and an alternate method that could be more efficient for sparse matrices.
For both I'll use your example matrix with holes:
# matrix A with row headings and values
A = [('Apple',0.95, x, 0.89, 0.87, 0.93),
('Bear', 0.33, 0.25, 0.85, 0.44, 0.33),
('Crab', x, 0.55, 0.10, x, 0.22)]
List of lists (or tuples, or whatever)
Like I said before, use None for an empty value:
A = [('Apple', 0.95, None, 0.89, 0.87, 0.93),
('Bear', 0.33, 0.25, 0.85, 0.44, 0.33),
('Crab', None, 0.55, 0.10, None, 0.22)]
B is similar to what I posted earlier:
B = [(a[0], average(filter(lambda x: x is not None, a[1:]))) for a in A]
Define column as a generator (iterable) that returns only the filled values:
def column(M, i):
i += 1 # this will allow you to use zero-based indices if you want
return (row[i] for row in M if row[i] is not None)
Then you can implement minus more easily and efficiently:
from operator import sub
from itertools import imap, repeat
def minus(M, i):
return list(imap(sub, repeat(1.0), column(M, i)))
Dictionaries
Another way to represent your matrix is with Python dicts. There are some advantages here, especially that you don't waste storage space if you have a lot of holes in the matrix. A con to this method is that it can be more of a pain to create the matrix depending on how you construct it.
Your example might become (whitespace for clarity):
A = [('Apple', dict([(0, 0.95), (2, 0.89), (3, 0.87), (4, 0.93)])),
('Bear', dict([(0, 0.33), (1, 0.25), (2, 0.85), (3, 0.44), (4, 0.33)])),
('Crab', dict([ (1, 0.55), (2, 0.10), (4, 0.22)]))]
This is an ugly way to construct it for sure, but if you are constructing the matrix from other data with a loop it can be a lot nicer.
Now,
B = [(a[0], sum(a[1].itervalues())/len(a[1])) for a in A2]
This is uglier than it should be but I'm not so good at Python and I can't get it to do exactly what I want...
You can define a column function which returns a generator that will be more efficient than a list comprehension:
def column(M, i):
return (row[1][i] for row in M if i in row[1])
minus is done exactly as in the other example.
I have a feeling that there is something I'm not getting about what you want, so feel free to let me know what needs fixing. Also, my lack of Python codez probably didn't do the dictionary version justice, but it can be efficient for sparse matrices. This whole example would be easier if you created a matrix class, then you could switch implementations and see which is better for you. Good luck.

This doesn't work because x is not necessarily a number (you don't tell us what it is either).
So you probably have to write your own summing function that checks whether an item is an x or something else (maybe you'll have to use isinstance(element, int) or isinstance(element, float)).

In average(), use a loop to remove all x values from the list with lst.remove(x) before you calculate the average (you'll have to catch the error remove() generates when it can't find what it's looking for).
I recommend using something like "" for representing holes, unless you have something made up already.

Related

In Python, efficiently filter subsequences of possible combinations

Is there a way to efficiently filter items from the list of subsequences of a list?
For a minimal example, consider the list l = [-2, -1, 1, 2] and the subsequences in itertools.combinations(l, r=2). Suppose I'd like to filter the subsequences of this list so that for every number i represented in the list, at most one of (-i, i) should be in any subsequence. In this example, the desired output is [(-2, -1), (-2, 1), (-1, 2), (1, 2)].
Here's a naive way of filtering:
from itertools import combinations
def filtered( iterable, r ):
for i in combinations( iterable , r ):
for x in i:
if x*-1 in i: # condition line
break
else:
yield i
But with increasing input size, this rejection approach wastes a lot of data. For example, filtered( list(range(-20,20), 5 ) will reject about 35% of the original subsequences. It's also much slower (about six times) than combinations().
Ideally, I'd like to keep a configurable condition so that filtering on non-numeric data remains possible.
I think that AKX is somewhat right. You definitely need to iterate through all the possible combinations. You are doing this, as that outer for loop is O(n). However, the checking for duplicates in each combination is not optimized.
The inner for loop
for x in i:
if x*-1 in i: # condition line
break
Is not optimized. This is O(n^2) since the in operator iterates through the list.
You can make this inner for loop O(n) by using a hashset for each tuple (set() or dict() in python).
nums = set()
for num in my_tuple:
if num*-1 in nums:
break
nums.add(num)
The difference is that a hashset has a lookup of O(1), so the in operator is O(1) instead of O(n). We just sacrifice a little bit of space complexity.
The last thing you could do is re-implement the combination method yourself to include this type of checking, so that tuples that violate your condition aren't produced in the first place.
If you want to do this, here is some source code for the itertools.combinations() function, and you just have add some validation to it.

Python: How to generate all combinations of lists of tuples without repeating contents of the tuple

I'm working with a bit of a riddle:
Given a dictionary with tuples for keys: dictionary = {(p,q):n}, I need to generate a list of new dictionaries of every combination such that neither p nor q repeat within the new dictionary. And during the generation of this list of dictionaries, or after, pick one of the dictionaries as the desired one based on a calculation using the dictionary values.
example of what I mean (but much smaller):
dictionary = {(1,1): 1.0, (1,2): 2.0, (1,3): 2.5, (1,4): 5.0, (2,1): 3.5, (2,2): 6.0, (2,3): 4.0, (2,4): 1.0}
becomes
listofdictionaries = [{(1,1): 1.0, (2,2): 6.0}, {(1,1): 1.0, (2,3): 4.0}, (1,1): 1.0, (2,4): 1.0}, {(1,2): 2.0, (2,1): 3.5}, {(1,2): 2.0, (2,3): 4.0}, etc.
a dictionary like: {(1,1): 1.0, (2,1): 3.5} is not allowable because q repeats.
Now my sob story: I'm brand new to coding... but I've been trying to write this script to analyze some of my data. But I also think it's an interesting algorithm riddle. I wrote something that works with very small dictionaries but when I input a large one, it takes way too long to run (copied below). In my script attempt, I actually generated a list of combinations of tuples instead that I use to refer to my master dictionary later on in the script. I'll copy it below:
The dictionary tuple keys were generated using two lists: "ExpList1" and "ExpList2"
#first, I generate all the tuple combinations from my ExpDict dictionary
combos =(itertools.combinations(ExpDict,min(len(ExpList1),len(ExpList2))))
#then I generate a list of only the combinations that don't repeat p or q
uniquecombolist = []
for foo in combos:
counter = 0
listofp = []
listofq = []
for bar in foo:
if bar[0] in listofp or bar[1] in listofq:
counter=+1
break
else:
listofp.append(bar[0])
listofq.append(bar[1])
if counter == 0:
uniquecombolist.append(foo)
After generating this list, I apply a function to all of the dictionary combinations (iterating through the tuple lists and calling their respective values from the master dictionary) and pick the combination with the smallest resulting value from that function.
I also tried to apply the function while iterating through the combinations picking the unique p,q ones and then checking whether the resulting value is smaller than the previous and keeping it if it is (this is instead of generating that list "uniquecombolist", I end up generating just the final tuple list) - still takes too long.
I think the solution lies in embedding the p,q-no-repeat and the final selecting function DURING the generation of combinations. I'm just having trouble wrapping my head around how to actually do this.
Thanks for reading!
Sara
EDIT:
To clarify, I wrote an alternative to my code that incorporates the final function (basically root mean squares) to the sets of pairs.
`combos =(itertools.combinations(ExpDict,min(len(ExpList1),len(ExpList2))))
prevRMSD = float('inf')
for foo in combos:
counter = 0
distanceSUM = 0
listofp = []
listofq = []
for bar in foo:
if bar[0] in listofp or bar[1] in listofq:
counter=+1
break
else:
listofp.append(bar[0])
listofq.append(bar[1])
distanceSUM = distanceSUM + RMSDdict[bar]
RMSD = math.sqrt (distanceSUM**2/len(foo))
if counter == 0 and RMSD< prevRMSD:
chosencombo = foo
prevRMSD = RMSD`
So if I could incorporate the RMS calculation during the set generation and only keep the smallest one, I think that will solve my combinatorial problem.
If I understood your problem, you are interested in all the possible combinations of pairs (p,q) with unique p's and q's respecting a given set of possible values for p's and q's. In my answer I assume those possible values are, respectively, in list_p and list_q (I think this is what you have in ExpList1 and ExpList2 am I right?)
min_size = min(len(list_p), len(list_q))
combos_p = itertools.combinations(list_p, min_size)
combos_q = itertools.permutations(list_q, min_size)
prod = itertools.product(combos_p, combos_q)
uniquecombolist = [tuple(zip(i[0], i[1])) for i in prod]
Let me know if this is what you're looking for. By the way welcome to SO, great question!
Edit:
If you're concerned that your list may become enormous, you can always use a generator expression and apply whatever function you desire to it, e.g.,
min_size = min(len(list_p), len(list_q))
combos_p = itertools.combinations(list_p, min_size)
combos_q = itertools.permutations(list_q, min_size)
prod = itertools.product(combos_p, combos_q)
uniquecombo = (tuple(zip(y[0], y[1])) for y in prod) # this is now a generator expression, not a list -- observe the parentheses
def your_function(x):
# do whatever you want with the values, here I'm just printing and returning
print(x)
return x
# now prints the minimum value
print(min(itertools.imap(your_function, uniquecombo)))
When you use generators instead of lists, the values are computed as they are needed. Here since we're interested in the minimum value, each value is computed and is discarded right away unless it is the minimum.
This answer assume that you are trying to generate sets with |S| elements, where S is the smaller pool of tuple coordinates. The larger pool will be denoted L.
Since the set will contain |S| pairs with no repeated elements, each element from S must occur exactly once. From here, match up the permutations of L where |S| elements are chosen with the ordered elements of S. This will generate all requested sets exhaustively and without repetition.
Note that P(|L|, |S|) is equal to |L|!/(|L|-|S|)!
Depending on the sizes of the tuple coordinate pools, there may be too many permutations to enumerate.
Some code to replicate this enumeration might look like:
from itertools import permutations
S, L = range(2), range(4) # or ExpList1, ExpList2
for p in permutations(L, len(S)):
print(zip(S, p))
In total, your final code might look something like:
S, L = ExpList1, ExpList2
pairset_maker = lambda p: zip(S, p)
if len(S) > len(L):
S, L = L, S
pairset_maker = lambda p: zip(p, S)
n = len(S)
get_perm_value = lambda p: math.sqrt(sum(RMSDdict[t] for t in pairset_maker(p))**2/n)
min_pairset = min(itertools.permutations(L, n), key=get_perm_value)
If this doesn't get you to within an order or magnitude or two of your desired runtime, then you might need to consider an algorithm that doesn't produce an optimal solution.

Pandas Equivalent of R's which()

Variations of this question have been asked before, I'm still having trouble understanding how to actually slice a python series/pandas dataframe based on conditions that I'd like to set.
In R, what I'm trying to do is:
df[which(df[,colnumber] > somenumberIchoose),]
The which() function finds indices of row entries in a column in the dataframe which are greater than somenumberIchoose, and returns this as a vector. Then, I slice the dataframe by using these row indices to indicate which rows of the dataframe I would like to look at in the new form.
Is there an equivalent way to do this in python? I've seen references to enumerate, which I don't fully understand after reading the documentation. My sample in order to get the row indices right now looks like this:
indexfuture = [ x.index(), x in enumerate(df['colname']) if x > yesterday]
However, I keep on getting an invalid syntax error. I can hack a workaround by for looping through the values, and manually doing the search myself, but that seems extremely non-pythonic and inefficient.
What exactly does enumerate() do? What is the pythonic way of finding indices of values in a vector that fulfill desired parameters?
Note: I'm using Pandas for the dataframes
I may not understand clearly the question, but it looks like the response is easier than what you think:
using pandas DataFrame:
df['colname'] > somenumberIchoose
returns a pandas series with True / False values and the original index of the DataFrame.
Then you can use that boolean series on the original DataFrame and get the subset you are looking for:
df[df['colname'] > somenumberIchoose]
should be enough.
See http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
What what I know of R you might be more comfortable working with numpy -- a scientific computing package similar to MATLAB.
If you want the indices of an array who values are divisible by two then the following would work.
arr = numpy.arange(10)
truth_table = arr % 2 == 0
indices = numpy.where(truth_table)
values = arr[indices]
It's also easy to work with multi-dimensional arrays
arr2d = arr.reshape(2,5)
col_indices = numpy.where(arr2d[col_index] % 2 == 0)
col_values = arr2d[col_index, col_indices]
enumerate() returns an iterator that yields an (index, item) tuple in each iteration, so you can't (and don't need to) call .index() again.
Furthermore, your list comprehension syntax is wrong:
indexfuture = [(index, x) for (index, x) in enumerate(df['colname']) if x > yesterday]
Test case:
>>> [(index, x) for (index, x) in enumerate("abcdef") if x > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
Of course, you don't need to unpack the tuple:
>>> [tup for tup in enumerate("abcdef") if tup[1] > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
unless you're only interested in the indices, in which case you could do something like
>>> [index for (index, x) in enumerate("abcdef") if x > "c"]
[3, 4, 5]
And if you need an additional statement panda.Series allows you to do Operations between Series (+, -, /, , *).
Just multiplicate the indexes:
idx1 = df['lat'] == 49
idx2 = df['lng'] > 15
idx = idx1 * idx2
new_df = df[idx]
Instead of enumerate, I usually just use .iteritems. This saves a .index(). Namely,
[k for k, v in (df['c'] > t).iteritems() if v]
Otherwise, one has to do
df[df['c'] > t].index()
This duplicates the typing of the data frame name, which can be very long and painful to type.
A nice simple and neat way of doing this is the following:
SlicedData1 = df[df.colname>somenumber]]
This can easily be extended to include other criteria, such as non-numeric data:
SlicedData2 = df[(df.colname1>somenumber & df.colname2=='24/08/2018')]
And so on...

Average of two consecutive elements in the list in Python

I have a list of even number of float numbers:
[2.34, 3.45, 4.56, 1.23, 2.34, 7.89, ...].
My task is to calculate average of 1 and 2 elements, 3 and 4, 5 and 6, etc. What is the short way to do this in Python?
data = [2.34, 3.45, 4.56, 1.23, 2.34, 7.89]
print [(a + b) / 2 for a, b in zip(data[::2], data[1::2])]
Explanation:
data[::2] is the elements 2.34, 4.56, 2.34
data[1::2] is the elements 3.45, 1.23, 7.89
zip combines them into 2-tuples: (2.34, 3.45), (4.56, 1.23), (2.34, 7.89)
If the list is not too long, Paul Draper's answer is easy. If it is really long, you probably want to consider one of two other options.
First, using iterators, you can avoid copying around giant temporary lists:
avgs = [(a + b) / 2 for a, b in zip(*[iter(data)]*2)]
This does effectively the same thing, but lazily, meaning it only has to store one value at a time in memory (well, three values—a, b, and the average) instead of all of them.
iter(data) creates a lazy iterator over the data.
[iter(data)]*2 creates a list with two references to the same iterator, so when one advances, the other does as well.
Then we're using the same zip and list comprehension that Paul already explained so well. (In Python 2.x, as opposed to 3.x, zip is not lazy, so you're want to use itertools.izip rather than zip.)
If you don't actually need the result list, but just something you can iterate over, change the outer square brackets to parentheses and it becomes a generator expression, meaning it gives you an iterator instead of a list, and you're not storing anything at all.
Notice that the itertools docs have a recipe for a grouper that does the tricky bit (and you can also find it in the third-party module more-itertools), so you can just write grouper(data, 2) instead of zip(*[iter(data)]*2), which is certainly more readable if you're doing it frequently. If you want more explanation, see How grouper works.
Alternatively, you could use NumPy arrays instead of lists:
data_array = np.array(data)
And then you can just do this:
avg_array = (data_array[::2] + data_array[1::2]) / 2
That's not only simpler (no need for explicit loops), it's also about 10x faster, and takes about 1/4th the memory.
If you want to generalize this to arbitrary-length groups…
For the iterator solution, it's trivial:
[sum(group) / size for group in zip(*[iter(data)]*size)]
For the NumPy solution, it's a bit trickier. You have to dynamically create something to iterator over data[::size], data[1::size], …, data[size-1::size], like this:
sum(data[x::size] for x in range(size)) / size
There are other ways to do this in NumPy, but as long as size isn't too big, this will be fine—and it has the advantage that the exact same trick will work for Paul Draper's solution:
[sum(group) / size for group in zip(*(data[x::size] for x in range(size)))]
s= [2.34, 3.45, 4.56, 1.23, 2.34, 7.89, ...]
res= [(s[i]+s[i+1])/2 for i in range(0, len(s)-1, 2)]
Using NumPy to find the mean/average of consecutive two values this is more efficient in terms of Time and Space Complexity:
data=np.array([1,2,3,4,5,6])
k=2 #In your case
data1=np.mean(data.reshape(-1, k), axis=1)
Just use index for the task.
For simple example,
avg = []
list1 = [2.34, 3.45, 4.56, 1.23, 2.34, 7.89]
for i in range(len(list1)):
if(i+1 < len(list1):
avg.append( (list1[i] + list1[i+1]) / 2.0 )
avg2 = []
avg2 = [j for j in avg[::2]]
avg2 is what you want. This maybe easy to understand..

What should itertools.product() yield when supplied an empty list?

I guess it's an academic question, but the second result does not make sense to me. Shouldn't it be as thoroughly empty as the first? What is the rationale for this behavior?
from itertools import product
one_empty = [ [1,2], [] ]
all_empty = []
print [ t for t in product(*one_empty) ] # []
print [ t for t in product(*all_empty) ] # [()]
Updates
Thanks for all of the answers -- very informative.
Wikipedia's discussion of the Nullary Cartesian Product provides a definitive statement:
The Cartesian product of no sets ...
is the singleton set containing the
empty tuple.
And here is some code you can use to work through the insightful answer from sth:
from itertools import product
def tproduct(*xss):
return ( sum(rs, ()) for rs in product(*xss) )
def tup(x):
return (x,)
xs = [ [1, 2], [3, 4, 5] ]
ys = [ ['a', 'b'], ['c', 'd', 'e'] ]
txs = [ map(tup, x) for x in xs ] # [[(1,), (2,)], [(3,), (4,), (5,)]]
tys = [ map(tup, y) for y in ys ] # [[('a',), ('b',)], [('c',), ('d',), ('e',)]]
a = [ p for p in tproduct( *(txs + tys) ) ]
b = [ p for p in tproduct( tproduct(*txs), tproduct(*tys) ) ]
assert a == b
From a mathematical point of view the product over no elements should yield the neutral element of the operation product, whatever that is.
For example on integers the neutral element of multiplication is 1, since 1 ⋅ a = a for all integers a. So an empty product of integers should be 1. When implementing a python function that returns the product of a list of numbers, this happens naturally:
def iproduct(lst):
result = 1
for i in lst:
result *= i
return result
For the correct result to be calculated with this algorithm, result needs to be initialized with 1. This leads to a return value of 1 when the function is called on an empty list.
This return value is also very reasonable for the purpose of the function. With a good product function it shouldn't matter if you first concat two lists and then build the product of the elements, or if you first build the product of both individual lists and then multiply the results:
iproduct(xs + ys) == iproduct(xs) * iproduct(ys)
If xs or ys is empty that only works if iproduct([]) == 1.
Now the more complicated product() on iterators. Here also, from a mathematical point of view, product([]) should return the neutral element of that operation, whatever that is. It is not [] since product([], xs) == [], while for the neutral elements product([], xs) == xs should hold. It turns out, though, that [()] also isn't a neutral element:
>>> list(product([()], [1,2,3]))
[((), 1), ((), 2), ((), 3)]
In fact, product() is not really a very nice mathematical product at all, since this above equation doesn't hold:
product(*(xs + ys)) != product(product(*xs), product(*ys))
Each application of product generates an additional layer of tuples and there is no way around that, so there can't even be a real neutral element. [()] comes pretty close though, it doesn't add or remove any elements, it just adds an empty tuple to each.
[()]would in fact be the neutral element of this slightly adapted product function that only operates on lists of tuples, but doesn't add additional tuple layers on each application:
def tproduct(*xss):
# the parameters have to be lists of tuples
return (sum(rs, ()) for rs in product(*xss))
For this function the above product equation holds:
def tup(x): return (x,)
txs = [map(tup, x) for x in xs]
tys = [map(tup, y) for y in ys]
tproduct(*(txs + tys)) == tproduct(tproduct(*txs), tproduct(*tys))
With the additional preprocessing step of packing the input lists into tuples, tproduct() gives the same result as product(), but behaves nicer from a mathematical point of view. Also its neutral element is [()],
So [()] makes some sense as the neutral element of this kind of list multiplication. Even if it doesn't exactly fit product() it is a good choice for this function since it for example allows to define tproduct() without the need to introduce a special case for empty input.
As #sth already indicated, this behaviour is correct from a mathematical viewpoint. All you really need to convince yourself of is that list(itertools.product()) should have exactly one element, since once you know that it's clear what that element should be: it's got to be (for consistency) a tuple of length 0, and there's only one of those.
But the number of elements of itertools.product(l1, l2, l3, ...) should just be the product of the lengths of l1, l2, l3, ... . So the number of elements of itertools.product() should be the size of the empty product, and there's no shortage of internet sources that should persuade you that the empty product is 1.
I just wanted to point out that this is the correct practical definition as well as the correct mathematical one; that is, it's the definition that's most likely to 'just work' in boundary cases. For an example, suppose that you want to generate all strings of length n consisting of decimal digits, with the first digit nonzero. You might do something like:
import itertools
def decimal_strings(n):
"""Generate all digit strings of length n that don't start with 0."""
for lead_digit in '123456789':
for tail in itertools.product('0123456789', repeat=n-1):
yield lead_digit + ''.join(tail)
What should this produce when n = 1? Well, in that case, you end up calling itertools.product with an empty product (repeat = 0). If it returned nothing, then the body of the inner for loop above would never be executed, so decimal_strings(1) would be an empty iterator; almost certainly not what you want. But since itertools.product('0123456789', repeat=0) returns a single tuple, you get the expected result:
>>> list(decimal_strings(1))
['1', '2', '3', '4', '5', '6', '7', '8', '9']
(When n = 0, of course, this function correctly raises a ValueError.)
So in short, the definition is mathematically sound, and more often that not it's also what you want. It's definitely not a Python bug!

Categories