Time-varying data: list of tuples vs 2D array? - python

My example code is in python but I'm asking about the general principle.
If I have a set of data in time-value pairs, should I store these as a 2D array or as a list of tuples? for instance, if I have this data:
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
Is it generally better to store it like this:
data=[v,t]
or as a list of tuples:
data=[(1,1),(4,2)(4,3)...]
Is there a "standard" way of doing this?

If speed is your biggest concern, in Python, look at Numpy.
In general, you should choose choose a data structure that makes dealing with the data natural and easy. Worry about speed later, after you know it works!
As for an easy data structure, how about an list of tuples:
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
data=[(1,1),(4,2)(4,3)...]
Then you can unpack like so:
v,t=data[1]
#v,t are 4,2

The aggregate array container is probably the best choice. Assuming that your time points are not regularly spaced (and therefore you need to keep track of it rather than just use the indexing), this allows you to take slices of your entire data set like:
import numpy as np
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
data = np.array([v,t])
Then you could slice it to get a subset of the data easily:
data[:,2:4] #array([[4, 4],[3, 4]])
ii = [1,2,5] # Fancy indexing
data[:,ii] # array([[4, 4, 4],
# [2, 3, 6]])

You could try a dictionary? In other languages this may be known as a hash-map, hash-table, associative array, or some other term which means the same thing. Of course it depends on how you intend to access your data.
Instead of:
v=[1,4,4,4,23,4]
t=[1,2,3,4,5,6]
you'd have:
v_with_t_as_key = {1:1, # excuse the name...
2:4,
3:4,
4:4,
5:23,
6:4}
This is a fairly standard construct in python, although if order is important you might want to look at the ordered dictionary in collections.

I've found that for exploring and prototyping, it's more convenient to store as a list/jagged array of columns, where the first column is the observational index and each column after that is a variable.
data=[(1,2,3,4,5,6),(1,4,4,4,23,4)]
Most of the time i'm loading many observations with many variables, and then performing sorting, formatting, or displaying one or more of those variables, or even joining two sets of data with columns as parameters. It's a lot rarer when I need to pull a subset of observations out. Even if I did, it's more convenient to use a function that returns a subset of the data given a column of observation indexes.
Having said that, I still use functions to convert jagged arrays to 2d arrays and to transpose 2d arrays.

Related

qucikest way to sort a numpy array of tuple by indexes

I am trying to sort a numpy array of tuples. somehow, I came accross a solution like the following, which would give me flexibilities to control the indexes and order of each.
m = mm[mm[:,0].argsort()] # First sort doesn't need to be stable.
m = mm[mm[:,1].argsort(kind='mergesort')[::-1][:100000]]
m = mm[mm[:,3].argsort(kind='mergesort')[::-1][:100000]]
m = mm[mm[:,2].argsort(kind='mergesort')[::-1][:100000]]
but I have received an error as like
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
does it means this kind of solution does not apply to tuple but list?
would someone suggest a better way that I could speed up the sorting procedure rather than np.argsort(mm, order=('a','b','c','d'))?
how would order be able to cope like np.argsort(mm, order=('a','b',-'c','d')) where I can have the index 'c' in reverse order?
Thanks you
You talk about an "array of tuples", which is without context vague.
Your use of:
np.argsort(mm, order=('a','b','c','d'))
suggests that mm is a structured array, with fields named, 'a','b', etc.
That would also account for the dimension error when doing mm[:,0]. You can't treat a 1d structured array as though it were 2d with columns!
You should have told us mm.shape and mm.dtype.
As for reversing the sort on the 'c' field, I'd make a copy of the array, and assign negative values to the 'c' field.
I don't think there's an alternative to doing this sort of 'ordered' sort. There is a np.lexsort which you might apply to an "unstructured" copy of your "structured" array (assuming the fields have uniform dtypes). But I don't know if the speed's any different.

what the differences between ndarray and list in python?

last week, my teacher asks us: when storing integers from one to one hundred, what the differences between using list and using ndarray. I never use numpy before, so I search this question on the website.
But all my search result told me, they just have dimension difference. Ndarray can store N dimension data, while list storge one. That doesn't satisfy me. Is it really simple, just my overthinking, Or I didn't find the right keyword to search?
I need help.
There are several differences:
-You can append elements to a list, but you can't change the size of a ´numpy.ndarray´ without making a full copy.
-Lists can containt about everything, in numpy arrays all the elements must have the same type.
-In practice, numpy arrays are faster for vectorial functions than mapping functions to lists.
-I think than modification times is not an issue, but iteration over the elements is.
Numpy arrays have many array related methods (´argmin´, ´min´, ´sort´, etc).
I prefer to use numpy arrays when I need to do some mathematical operations (sum, average, array multiplication, etc) and list when I need to iterate in 'items' (strings, files, etc).
A one-dimensional array is like one row graph paper .##
You can store one thing inside of each box
The following picture is an example of a 2-dimensional array
Two-dimensional arrays have rows and columns
I should have changed the numbers.
When I was drawing the picture I just copied the first row many times.
The numbers can be completely different on each row.
import numpy as np
lol = [[1, 2, 3], [4, 5, 6]]
# `lol` is a list of lists
arr_har = np.array(lol, np.int32)
print(type(arr_har)) # <class 'numpy.ndarray'>
print("BEFORE:")
print(arr_har)
# change the value in row 0 and column 2.
arr_har[0][2] = 999
print("\n\nAFTER arr_har[0][2] = 999:")
print(arr_har)
The following picture is an example of a 3-dimensional array
Summary/Conclusion:
A list in Python acts like a one-dimensional array.
ndarray is an abbreviation of "n-dimensional array" or "multi-dimensional array"
The difference between a Python list and an ndarray, is that an ndarray has 2 or more dimensions

Shorthand for extracting a high-dimensional sub-array, i.e. array[beg[0]:end[0],beg[1]:end[1],...]

I am looking for a concise way of extracting a subarray from a high-dimensional array.
For example let's take an array a. I want to extract a subarray, for which I have the beginning and end coordinates stored in two arrays b and e.
Currently, to extract the desired subarray, I type
a[b[0]:e[0],b[1]:e[1],b[2]:e[2],b[3]:e[3],...]
I was wondering if there is a built-in, concise way of slicing such an array. I would love calling something like a[b:e], but this does not work.
zip the two arrays and construct a tuple of slice objects from the resulting tuples from zip:
from itertools import starmap
a[tuple(starmap(slice, zip(b, e)))]
If the subclassing np.s_ approach shown here is too heavy for your liking, you can use:
a[tuple(slice(*idx) for idx in np.broadcast(start, stop, step))]
The beauty of using np.broadcast here is that you can for example mix scalars and lists as in start=None, stop=[1,2,3]. This would be equivalent to a[:1, :2, :3].

Put ordered data back into a dictionary

I have a (normal, unordered) dictionary that is holding my data and I extract some of the data into a numpy array to do some linear algebra. Once that's done I want to put the resulting ordered numpy vector data back into the dictionary with all of data. What's the best, most Pythonic, way to do this?
Joe Kington suggests in his answer to "Writing to numpy array from dictionary" that two solutions include:
Using Ordered Dictionaries
Storing the sorting order in another data structure, such as a dictionary
Here are some (possibly useful) details:
My data is in nested dictionaries. The outer is for groups: {groupKey: groupDict} and group keys start at 0 and count up in order to the total number of groups. groupDict contains information about items: (itemKey: itemDict}. itemDict has keys for the actual data and these keys typically start at 0, but can skip numbers as not all "item locations" are populated. itemDict keys include things like 'name', 'description', 'x', 'y', ...
Getting to the data is easy, dictionaries are great:
data[groupKey][itemKey]['x'] = 0.12
Then I put data such as x and y into a numpy vectors and arrays, something like this:
xVector = numpy.empty( xLength )
vectorIndex = 0
for groupKey, groupDict in dataDict.items()
for itemKey, itemDict in groupDict.items()
xVector[vectorIndex] = itemDict['x']
vectorIndex += 1
Then I go off and do my linear algebra and calculate a z vector that I want to add back into dataDict. The issue is that dataDict is unordered, so I don't have any way of getting the proper index.
The Ordered Dict method would allow me to know the order and then index through the dataDict structure and put the data back in.
Alternatively, I could create another dictionary while inside the inner for loop above that stores the relationship between vectorIndex, groupKey and itemKey:
sortingDict[vectorIndex]['groupKey'] = groupKey
sortingDict[vectorIndex]['itemKey'] = itemKey
Later, when it's time to put the data back, I could just loop through the vectors and add the data:
vectorIndex = 0
for z in numpy.nditer(zVector):
dataDict[sortingDict[vectorIndex]['groupKey']][sortingDict[vectorIndex]['itemKey']]['z'] = z
Both methods seem equally straight forward to me. I'm not sure if changing dataDict to an ordered dictionary will have any other effects elsewhere in my code, but probably not. Adding the sorting dictionary also seems pretty easy as it will get created at the same time as the numpy arrays and vectors. Left on my own I think I would go with the sortingDict method.
Is one of these methods better than the others? Is there a better way I'm not thinking of? My data structure works well for me, but if there's a way to change that to improve everything else I'm open to it.
I ended up going with option #2 and it works quite well.

Whats the best way to iterate over multidimensional array and tracking/doing operations on iteration index

I need to do a lot of operations on multidimensional numpy arrays and therefor i am experimenting towards the best approach on this.
So let's say i have an array like this:
A = np.random.uniform(0, 1, size = 100).reshape(20, 5)
My goal is to get the maximum value numpy.amax() of each entry and it's index. So may A[0] be something like this:
A[0] = [ 0.64570441 0.31781716 0.07268926 0.84183753 0.72194227]
I want to get the maximum and the index of that maximum [0.84183753][0, 3]. No specific representation of the results needed, just an example. I even need the horizontal index only.
I tried using numpy's nditer object:
A_it = np.nditer(A, flags=['multi_index'], op_flags=['readwrite'])
while not A_it.finished:
print(np.amax(A_it.value))
print(A_it.multi_index[1])
A_it.iternext()
I can access every element of the array and its index over the iterations that way but i don't seem to be able to bring the numpy.amax() function in each element and the index together syntax wise. Can i even do it using nditerobject?
Also, in Numpy: Beginner nditer i read that using nditer or using iterations in numpy usually means that i am doing something wrong. But i can't find another convenient way to achieve my goal here without any iterations. Obviously i am a total beginner in numpy and python in general, so any keyword to search for or hint is very much appreciated.
A major problem with nditer is that it iterates over each element, not each row. It's best used as a stepping stone toward a Cython or C rewrite of your code.
If you just want the maximum for each row of your array, a simple iteration or list comprehension will do nicely.
for row in A: print(np.amax(row))
or to turn it back into an array:
np.array([np.amax(row) for row in A])
But you can get the same values by giving amax an axis parameter
np.amax(A,axis=1)
np.argmax identifies the location of the maximum.
np.argmax(A,axis=1)
With the argmax values you could then select the max values as well,
ind=np.argmax(A,axis=1)
A[np.arange(A.shape[0]),ind]
(speed's about the same as repeating the np.amax call).

Categories