I'm not sure if 'hierarchical' is the correct way to label this problem, but I have a series of lists of integers that I'm intending to keep in 2D numpy array that I need to keep sorted in the following way:
array[0,:] = [1, 1, 1, 1, 2, 2, 2, 2, ...]
array[1,:] = [1, 1, 2, 2, 1, 1, 2, 2, ...]
array[2,:] = [1, 2, 1, 2, 1, 2, 1, 2, ...]
...
...
array[n,:] = [...]
So the first list is sorted, then the second list is broken into subsections of elements which all have the same value in the first list and those subsections are sorted, and so on down all the lists.
Initially each list will contain only one integer, and I'll then receive new columns that I need to insert into the array in such a way that it remains sorted as discussed above.
The purpose of keeping the lists in this order is that if I'm given a new column of integers I need to check whether an exact copy of that column exists in the array or not as efficiently as possible, and I assume this ordering will help me do it. It may be that there is a better way to make that check than keeping the lists like this - if you have thoughts about that please mention them!
I assume the correct position for a new column can be found by a series of binary searches but my attempts have been messy - any thoughts on doing this in a tidy and efficient way?
thanks!
If I understand your problem correctly, you have a bunch of sequences of numbers that you need to process, but you need to be able to tell if the latest one is a duplicate of one of the sequences you've processed before. Currently you're trying to insert the new sequences as columns in a numpy array, but that's awkward since numpy is really best with fixed-sized arrays (concatenating or inserting things is always going to be slow).
A much better data structure for your needs is a set. Membership tests and the addition of new items on a set are both very fast (amortized O(1) time complexity). The only limitation is that a set's items must be hashable (which is true for tuples, but not for lists or numpy arrays).
Here's the outline of some code you might be able to use:
seen = set()
for seq in sequences:
tup = tuple(sequence) # you only need to make a tuple if seq is not already hashable
if tup not in seen:
seen.add(tup)
# do whatever you want with seq here, it has not been seen before
else:
pass # if you want to do something with duplicated sequences, do it here
You can also look at the unique_everseen recipe in the itertools documentation, which does basically the same as the above, but as a well-optimized generator function.
Related
I'm trying to write a function that generates all possible configurations of a list by swapping certain allowable pairs of elements.
For example, if we have the list:
lst = [1, 2, 3, 4, 5]
And we only allow the swapping of the following pairs of elements:
pairs = [[0, 2], [4, 1]]
i.e., we can only swap the 0th element if the list with the 2nd, and the 4th element with the 1st (there can be any number of allowed pairs of swaps).
I would like the function to return the number of distinct configurations of the list given the allowable swaps.
Since I'm planning on running this for large lists and many allowable swaps, it would be preferable for the function to be as efficient as possible.
I've found examples that generate permutations by swapping all the elements, two at a time, but I can't find a way to specify certain pairs of allowable swaps.
You've been lured off other productive paths by the common term "swap". Switch your attack. Instead, note that you need the product of [a[0], a[2]] and [a[1], a[4]] to get all the possible permutations. You take each of these products (four of them) and distribute the elements in your result sets in the proper sequence. It will look vaguely like this ... I'm using Python as pseudo-code, to some extent.
seq = itertools.product([a[0], a[2]], [a[1], a[4]])
for soln in seq:
# each solution "soln" is a list of 4 elements to be distributed.
# Construct a permutation "b" by putting each in its proper place.
# Map the first two soln values to b[0] and b[2];
# and the last two values to b[1] and b[4]
b = [soln[0], soln[2], soln[1], a[3], soln[4]]
Can you take it from there? That's the idea; I'll leave you to generalize the algorithm.
I am trying to implement some sort of local search. For this I have two arrays, which represent a list, that I am trying to optimize.
So I have one array as the current best array and the one I am currently analyzing.
At the beginning I am shuffling the array and set the current best array to the shuffled array.
random.shuffle(self.matchOrder)
self.bestMatchOrder = self.matchOrder
Then I am swapping a random neighboured pair in the array.
Now the problem I have is that when I am swapping the values in self.matchOrder, the values in self.bestMatchOrder get swapped.
a = self.matchOrder[index]
self.matchOrder[index] = self.matchOrder[index + 1]
self.matchOrder[index + 1] = a
"index" is given to the function as a parameter, it is just randomly generated number.
I guess I did something wrong assigning the variables, but I can't figure out what. So what can I do to only assign the value of the array to the other array and not make it apply the same changes to it, too?
When you use self.bestMatchOrder = self.matchOrder Then Python doesn't allocates a new memory location to self.bestMatchOrder instead both are pointing to the same memory location. And knowing the fact that the lists are mutable data type, Hence any changes made in the self.matchOrder would get reflected in the self.bestMatchOrder.
import copy
self.bestMatchOrder = copy.deepcopy(self.matchOrder)
However if you are using linear lists or simple lists the you can also use self.bestMatchOrder = self.matchOrder[:] But if yo are using nested lists then deepcopy() is the correct choice.
If you want to copy a list, you can use slice operation:
list_a = [1, 2, 3]
list_b = list_a[:]
list_a = [0, 0, 42]
print list_a # returns [0, 0, 42]
print list_b # returns [1, 2, 3], i.e. copied values
But if you have more complex structures, you should use deepcopy, as #anmol_uppal advised above.
I want to analyze some data (one x-value, several y-values). Unfortunately not every x-value has all y-values filled, some values are empty. I want to put all values into lists, so that I have a x-value-list ([1, 2, 3, 4]) and an y-value-list ([[1, 2], [1, 4], [5, 2]]). But if I want to add an element into the list, it has to be a number (after my lists are float lists). Later I want to use these lists to plot the data. Thus I have the problem that I have to add a value to the list while parsing the data, but later I have to omit these values again for plotting, otherwise I get wrong results. My first idea was to simply add an empty space in the list, so that the plotting program skips this value. But that is not allowed to do in python.
What is the best way to circumvent my problem?
Create a DataPoint class that can hold x and y values and put them in a single list. Then you can set the y value to none (or an empty list) for the points that have missing values.
This also ensures you have a valid set of points at all times. You could enter an empty list for when there is no y value but you still run the risk of the x and y lists being out of sync.
We can create multi-dimensional arrays in python by using nested list, such as:
A = [[1,2,3],
[2,1,3]]
etc.
In this case, it is simple nRows= len(A) and nCols=len(A[0]). However, when I have more than three dimensions it would become complicated.
A = [[[1,1,[1,2,3,4]],2,[3,[2,[3,4]]]],
[2,1,3]]
etc.
These lists are legal in Python. And the number of dimensions is not a priori.
In this case, how to determine the number of dimensions and the number of elements in each dimension.
I'm looking for an algorithm and if possible implementation. I believe it has something similar to DFS. Any suggestions?
P.S.: I'm not looking for any existing packages, though I would like to know about them.
I believe to have solve the problem my self.
It is just a simple DFS.
For the example given above: A = [[[1,1,[1,2,3,4]],2,[3,[2,[3,4]]]],
[2,1,3]]
the answer is as follows:
[[3, 2, 2, 2, 3, 4], [3]]
The total number of dimensions is the 7.
I guess I was overthinking... thanks anyway...!
My example code is in python but I'm asking about the general principle.
If I have a set of data in time-value pairs, should I store these as a 2D array or as a list of tuples? for instance, if I have this data:
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
Is it generally better to store it like this:
data=[v,t]
or as a list of tuples:
data=[(1,1),(4,2)(4,3)...]
Is there a "standard" way of doing this?
If speed is your biggest concern, in Python, look at Numpy.
In general, you should choose choose a data structure that makes dealing with the data natural and easy. Worry about speed later, after you know it works!
As for an easy data structure, how about an list of tuples:
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
data=[(1,1),(4,2)(4,3)...]
Then you can unpack like so:
v,t=data[1]
#v,t are 4,2
The aggregate array container is probably the best choice. Assuming that your time points are not regularly spaced (and therefore you need to keep track of it rather than just use the indexing), this allows you to take slices of your entire data set like:
import numpy as np
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
data = np.array([v,t])
Then you could slice it to get a subset of the data easily:
data[:,2:4] #array([[4, 4],[3, 4]])
ii = [1,2,5] # Fancy indexing
data[:,ii] # array([[4, 4, 4],
# [2, 3, 6]])
You could try a dictionary? In other languages this may be known as a hash-map, hash-table, associative array, or some other term which means the same thing. Of course it depends on how you intend to access your data.
Instead of:
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
you'd have:
v_with_t_as_key = {1:1, # excuse the name...
2:4,
3:4,
4:4,
5:23,
6:4}
This is a fairly standard construct in python, although if order is important you might want to look at the ordered dictionary in collections.
I've found that for exploring and prototyping, it's more convenient to store as a list/jagged array of columns, where the first column is the observational index and each column after that is a variable.
data=[(1,2,3,4,5,6),(1,4,4,4,23,4)]
Most of the time i'm loading many observations with many variables, and then performing sorting, formatting, or displaying one or more of those variables, or even joining two sets of data with columns as parameters. It's a lot rarer when I need to pull a subset of observations out. Even if I did, it's more convenient to use a function that returns a subset of the data given a column of observation indexes.
Having said that, I still use functions to convert jagged arrays to 2d arrays and to transpose 2d arrays.