I have a NumPy array of values, and I am needing to identify groups of consecutive values within the array.
I've tried writing a "for" loop to do this, but I'm running into a host of issues. So I've looked at the documentation for groupby in itertools. I've never used this before, and I'm more than a little confused by the documentation, so here I am.
Could someone give a more "layman speak" explanation of how to use groupby? I don't need a sample code, per se, just a more thorough explanation of the documentation.
a good answer to this is to use a generator to group it (might not be the fastest method)
def groupings(a):
g = []
for val in a:
if not g:
g.append(val)
elif abs(g[-1] - val) <= 1.00001:
g.append(val)
else:
yield g
g = []
print list(groupings(my_numpy_array))
I know this doesnt give you a laymans explanation of group by (group consecutive items that match some criteria... it would be somewhat painful for this type of application)
Related
I have a list of elements with a certain attribute/variable. First of all I need to check if all attributes have the same value, if not I need to adapt this attribute for every element to the highes value.
The thing is, it is not difficult to programm such a loop. However, I would like to know what the most efficient way is to do so.
My current approach works fine, but it loops through the list 2 times, has two local variables and just does not feel efficient enough.
I simplified the code. This is basically what I've got:
biggest_value = 0
re_calc = 0
for _, element in enumerate(element_list):
if element.value > biggest_value :
biggest_value = element.value
re_calc += 1
if re_calc > 1:
for _, element in enumerate(element_list):
element.value = adjust_value(biggest_value)
element_list(_) = element
The thing annoying me is the necessity of the "re_calc" variable. A simple check for the biggest value is no big deal. But this task consists out of 3 steps:
"Compare Attributes--> Finding Biggest Value --> possibly Adjust Others". However I do not want to loop over this list 3 times. Not even two times as my current suggestion does.
There has to be a more efficient way. Any ideas? Thanks in advance.
The first loop is just determing the largest value of the element_list. So an approach can be:
transform the element_list into a numpy array. Unfortunately you do not tell, how the list looks like. But if the list contains numbers then
L = np.array(element_list)
can probably do it.
After that use np.max(L). Numpy commands without for loops are usually much faster.
import numpy as np
nl = 10
L = np.random.rand(nl)
biggest_value = np.max(L)
L, biggest_value
gives
(array([0.70047074, 0.14160459, 0.75061621, 0.89013494, 0.70587705,
0.50218377, 0.31197993, 0.42670057, 0.67869183, 0.04415816]),
0.8901349369179461)
In the second for-loop it is not obvious what you want to achieve. Unfortunately you do not give an input and a desired output and do not not tell what adjust_value has to do. A minimal running code with data would be helpful to give a support.
I have two lists of rendered image names as input to a deep learning training algorithm, and I need to first group them into related groups (there are several files in each group, with differing numbers of samples, but also several groups for one scene, differing in camera angle).
The problem appears when I see the following types of filenames:
Desired sorting:
Scene_1_Camera_000001.exr
Scene_1_Camera_131072.exr
Scene_1_Camera_0_000001.exr
Scene_1_Camera_0_131072.exr
or:
Scene_1_Camera_0_000001.exr
Scene_1_Camera_0_131072.exr
Scene_1_Camera_000001.exr
Scene_1_Camera_131072.exr
but, actual sorting:
Scene_1_Camera_000001.exr
Scene_1_Camera_0_000001.exr
Scene_1_Camera_0_131072.exr
Scene_1_Camera_131072.exr
The trouble is that the sort goes character-by-character and has no concept of the fact that there may be both a Camera and a Camera_0 (I don't have control over these names, the scenes are historical), and thus uses the _0 from the camera over the sample name, thus breaking up my two groups into three groups.
I am currently using the following code in another place (minus error checking for clarity), and could conceivably use something similar in a custom sorting function, using the prefix as primary sorting key, and the sample number as secondary, but I am afraid that it will be horribly inefficient.
res = re.search("(.*_)([0-9]{4,6}).([a-zA-Z]{3})", beauty_file)
prefix = res.group(1)
#sample = res1.group(2)
#suffix = res1.group(3)
Is there some way to use a custom sorting function, and to do so efficiently (there are 32000 5MB files)?
[Edit 1]
It seems that I was not quite clear enough on the desired sort order: it needs to be sorted first on scene/camera, and only then on sample number, i.e. the last six digits are the secondary key. Otherwise I would get all sample numbers together, regardless of scenes and cameras, which wouldn't allow me to gulp up grouped files together.
[Edit 2]
I prefer a solution which works with standard Python, as I may not be able to install packages on the machine on which the script will run. I am developing on Windows, due to the great debugger. I had in mind something similar to the comparison customisation typically available in templated C++ sorting functions.
32000 entries isn't a lot - the file size doesn't matter, since you're not editing the files themselves.
There are a few options I could think of:
Rename the entries in your list. Not the files themselves, but the list you've collected. Append _0 to entries that are missing it, and you'll be able to sort easily.
Sort by the last 6 characters (or all characters after the last '_'). There is already a recognizable pattern in your filenames, just omit the part that doesn't match that pattern.
Use a sorting algorithm that can handle that criteria. natsort is a popular option.
I am not quite sure how to handle this reply when I am answering my own question, but here goes:
jmcampbell's answer led me to the following:
def compare(item1, item2):
res1 = re.search("(.*_)([0-9]{4,6}).([a-zA-Z]{3})", item1)
if res1 == None or len(res1.groups()) != 3:
return item1 < item2
prefix1 = res1.group(1)
sample1 = res1.group(2)
res2 = re.search("(.*_)([0-9]{4,6}).([a-zA-Z]{3})", item2)
if res2 == None or len(res2.groups()) != 3:
return item1 < item2
prefix2 = res2.group(1)
sample2 = res2.group(2)
if prefix1 < prefix2:
return -1
elif prefix1 > prefix2:
return 1
elif sample1 < sample2:
return -1
elif sample1 > sample2:
return 1
else:
return 0
Then call the sorting function as follows:
beauty_files.sort(cmp=compare)
This gives the desired sort order.
Thanks for the brainstorming everyone!
If you can be sure that the numbers will always be six digits long, you could use:
listOfFilenames.sort(key=(lambda s: s[-10:-4]))
This takes a slice of the string's last six characters before the extension and sorts by those.
Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst
I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?
I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?
I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!
I want to loop through a database of documents and calculate a pairwise comparison score.
A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.
Is there a name for the algorithm for doing this task efficiently?
Is there a name for this approach?
Thanks.
Assume all items have a number ItemNumber
Simple solution -- always have the 2nd element's ItemNumber greater than the first item.
eg
for (firstitem = 1 to maxitemnumber)
for (seconditem = firstitemnumber+1 to maxitemnumber)
compare(firstitem, seconditem)
visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.
........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.
I don't think its complicated enough to qualify for a name.
You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.
Unique pairings:
SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id
Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.
C.
You can keep track of which documents you have already compared, e.g. (with numbers ;))
compared = set()
for i in [1,2,3]:
for j in [1,2,3]:
pair = frozenset((i,j))
if i != k and pair not in compared:
compare.add(pair)
compare(i,j)
Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.
Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:
docs = [1,2,3]
l = len(docs)
for i in range(l):
for j in range(i+1,l):
compare(l[i],l[j])
Something like this?
src = [1,2,3]
for i, x in enumerate(src):
for y in src[i:]:
compare(x, y)
Or you might wish to generate a list of pairs instead:
pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]