Python list or pandas dataframe arbitrary indexing and slicing - python

I have used both R and Python extensively in my work, and at times I get the syntax between them confused.
In R, if I wanted to create a model from only some features of my data set, I can do something like this:
subset = df[1:1000, c(1,5,14:18,24)]
This would take the first 1000 rows (yes, R starts on index 1), and it would take the 1st, 5th, 14th through 18th, and 24th columns.
I have tried to do any combination of slice, range, and similar sorts of functions, and have not been able to duplicate this sort of flexibility. In the end, I just enumerated all of the values.
How can this be done in Python?
Pick an arbitrary subset of elements from a list, some of which are selected individually (as in the commas shown above) and some selected sequentially (as in the colons shown above)?

In a file of index_tricks, numpy defines a class instance that converts a scalars and slices into an enumerated list, using the r_ method:
In [560]: np.r_[1,5,14:18,24]
Out[560]: array([ 1, 5, 14, 15, 16, 17, 24])
It's an instance with a __getitem__ method, so it uses the indexing syntax. It expands 14:18 into np.arange(14,18). It can also expand values with linspace.
So I think you'd rewrite
subset = df[1:1000, c(1,5,14:18,24)]
as
df.iloc[:1000, np.r_[0,4,13:17,23]]

You can use iloc for integer indexing in pandas:
df.iloc[0:10000, [0, 4] + range(13,18) + [23]]
As commented by #root, in Python 3, you need to explicitly convert range() to list by df.iloc[0:10000, [0, 4] + list(range(13,18)) + [23]]

Try this, The first square brackets filter. The second set of square brackets slice.
df[[0,4]+ range(13,18)+[23]][:1000]

Related

2d Array column slicing in Pure Python without for loops

Is it possible to slice a column off a 2d array in pure Python without a for loop or list comprehension? Say for instance you have a 4x4 array of ints:
grid = [[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]]
and let's say you'd like to return the grid without the first row and the last column [[5,6,7],[9,10,11],[13,14,15]]
Is there a slicing syntax that allows you to do this? Excluding the first row is easily achieved with grid = grid[1:4]
However doing something like grid = grid[1:4][0:2] seems like it should work but results in [[5, 6, 7, 8], [9, 10, 11, 12]]. If at all possible, I'd like to avoid having to iterate through it in a for loop/list comprehension. I know that would work, but I'm wondering if there's a more elegant syntax.
To ddejohn's point, this can't be done with just slicing notation. This doesn't use slicing notation but a good answer that doesn't use list comps or for loops is list(zip(*matrix)) if matrix is the input list.

Grouping elements of a NumPy array by sum of indices

I have several large numpy array of dimensions 30*30*30, on which I need to traverse the array, get the sum of each index triplet and bin these elements by this sum. For example, consider this simple 2*2 array:
test = np.array([[2,3],[0,1]])
This array has the indices [0,0],[0,1],[1,0] and [1,1]. This routine would return the list: [2,[3,0],1], because 2 in array test has index sum 0, 3 and 0 have index sum 1 and 1 has index sum 2. I know the brute force method of iterating through the NumPy array and checking the sum would work, but it is far too inefficient for my actual case with large N(=30) and several arrays. Any inputs on using NumPy routines to accomplish this grouping would be appreciated. Thank you in advance.
Here is one way that should be reasonably fast, but not super-fast: 30x30x30 takes 20 ms on my machine.
import numpy as np
# make example
dims = 2,3,4
a = np.arange(np.prod(dims),0,-1).reshape(dims)
# create and sort indices
idx = sum(np.ogrid[tuple(map(slice,dims))])
srt = idx.ravel().argsort(kind='stable')
# use order to arrange and split data
asrt = a.ravel()[srt]
spltpts = idx.ravel().searchsorted(np.arange(1,np.sum(dims)-len(dims)+1),sorter=srt)
out = np.split(asrt,spltpts)
# admire
out
# [array([24]), array([23, 20, 12]), array([22, 19, 16, 11, 8]), array([21, 18, 15, 10, 7, 4]), array([17, 14, 9, 6, 3]), array([13, 5, 2]), array([1])]
You could procedural create a list of index tuplets and use that, but may be getting into a code constant that's too large to be efficient.
[(0,0),[(1,0),(0,1)],(1,1)],
So you need a function to generate these indexes on the fly for an n-demensional array.
For one dimension, a trivial count/increment
[(0),(1),(2),...]
The the second, use the one dimension strategy for the fist dimension, the decrement the first and increment the second to fill in.
[(0...)...,(1...)...,(2...)...,...]
[[(0,0)],[(1,0),(0,1)],[(2,0),(1,1),(0,2)],[...],...]
Notice some of these would be outside the example array, Your generator would need to include a bounds check.
Then three dimensions, give the first two demensions the treatment as above, but at the end, decrement the first dimension, increment the third, repeat until done
[[(0,0,0),...],[(1,0,0),(0,1,0),...],[(2,0,0),(1,1,0),(0,2,0),...],[...],...]
[[(0,0,0)],[(1,0,0),(0,1,0),(0,0,1)],[(2,0,0),(1,1,0),(0,2,0),(1,0,1),(0,1,1)(0,0,2)
Again need bounds checks or cleverer starting/end points to avoid trying to access outside the index, but this general algorithm is how you'd go about generating the indexes on the fly rather than having two large arrays compete for cache and i/o.
Generating the python or nympy equivalent is left as an exercise to the user.

Access elements of a Matrix by a list of indices in Python to apply a max(val, 0.5) to each value without a for loop

I know how to access elements in a vector by indices doing:
test = numpy.array([1,2,3,4,5,6])
indices = list([1,3,5])
print(test[indices])
which gives the correct answer : [2 4 6]
But I am trying to do the same thing using a 2D matrix, something like:
currentGrid = numpy.array( [[0, 0.1],
[0.9, 0.9],
[0.1, 0.1]])
indices = list([(0,0),(1,1)])
print(currentGrid[indices])
this should display me "[0.0 0.9]" for the value at (0,0) and the one at (1,1) in the matrix. But instead it displays "[ 0.1 0.1]". Also if I try to use 3 indices with :
indices = list([(0,0),(1,1),(0,2)])
I now get the following error:
Traceback (most recent call last):
File "main.py", line 43, in <module>
print(currentGrid[indices])
IndexError: too many indices for array
I ultimately need to apply a simple max() operation on all the elements at these indices and need the fastest way to do that for optimization purposes.
What am I doing wrong ? How can I access specific elements in a matrix to do some operation on them in a very efficient way (not using list comprehension nor a loop).
The problem is the arrangement of the indices you're passing to the array. If your array is two-dimensional, your indices must be two lists, one containing the vertical indices and the other one the horizontal ones. For instance:
idx_i, idx_j = zip(*[(0, 0), (1, 1), (0, 2)])
print currentGrid[idx_j, idx_i]
# [0.0, 0.9, 0.1]
Note that the first element when indexing arrays is the last dimension, e.g.: (y, x). I assume you defined yours as (x, y) otherwise you'll get an IndexError
There are already some great answers to your problem. Here just a quick and dirty solution for your particular code:
for i in indices:
print(currentGrid[i[0],i[1]])
Edit:
If you do not want to use a for loop you need to do the following:
Assume you have 3 values of your 2D-matrix (with the dimensions x1 and x2 that you want to access. The values have the "coordinates"(indices) V1(x11|x21), V2(x12|x22), V3(x13|x23). Then, for each dimension of your matrix (2 in your case) you need to create a list with the indices for this dimension of your points. In this example, you would create one list with the x1 indices: [x11,x12,x13] and one list with the x2 indices of your points: [x21,x22,x23]. Then you combine these lists and use them as index for the matrix:
indices = [[x11,x12,x13],[x21,x22,x23]]
or how you write it:
indices = list([(x11,x12,x13),(x21,x22,x23)])
Now with the points that you used ((0,0),(1,1),(2,0)) - please note you need to use (2,0) instead of (0,2), because it would be out of range otherwise:
indices = list([(0,1,2),(0,1,0)])
print(currentGrid[indices])
This will give you 0, 0.9, 0.1. And on this list you can then apply the max() command if you like (just to consider your whole question):
maxValue = max(currentGrid[indices])
Edit2:
Here an example how you can transform your original index list to get it into the correct shape:
originalIndices = [(0,0),(1,1),(2,0)]
x1 = []
x2 = []
for i in originalIndices:
x1.append(i[0])
x2.append(i[1])
newIndices = [x1,x2]
print(currentGrid[newIndices])
Edit3:
I don't know if you can apply max(x,0.5) to a numpy array with using a loop. But you could use Pandas instead. You can cast your list into a pandas Series and then apply a lambda function:
import pandas as pd
maxValues = pd.Series(currentGrid[newIndices]).apply(lambda x: max(x,0.5))
This will give you a pandas array containing 0.5,0.9,0.5, which you can simply cast back to a list maxValues = list(maxValues).
Just one note: In the background you will always have some kind of loop running, also with this command. I doubt, that you will get much better performance by this. If you really want to boost performance, then use a for loop, together with numba (you simply need to add a decorator to your function) and execute it in parallel. Or you can use the multiprocessing library and the Pool function, see here. Just to give you some inspiration.
Edit4:
Accidentally I saw this page today, which allows to do exactly what you want with Numpy. The solution (considerin the newIndices vector from my Edit2) to your problem is:
maxfunction = numpy.vectorize(lambda i: max(i,0.5))
print(maxfunction(currentGrid[newIndices]))
2D indices have to be accessed like this:
print(currentGrid[indices[:,0], indices[:,1]])
The row indices and the column indices are to be passed separately as lists.

maintaining hierarchically sorted lists in python

I'm not sure if 'hierarchical' is the correct way to label this problem, but I have a series of lists of integers that I'm intending to keep in 2D numpy array that I need to keep sorted in the following way:
array[0,:] = [1, 1, 1, 1, 2, 2, 2, 2, ...]
array[1,:] = [1, 1, 2, 2, 1, 1, 2, 2, ...]
array[2,:] = [1, 2, 1, 2, 1, 2, 1, 2, ...]
...
...
array[n,:] = [...]
So the first list is sorted, then the second list is broken into subsections of elements which all have the same value in the first list and those subsections are sorted, and so on down all the lists.
Initially each list will contain only one integer, and I'll then receive new columns that I need to insert into the array in such a way that it remains sorted as discussed above.
The purpose of keeping the lists in this order is that if I'm given a new column of integers I need to check whether an exact copy of that column exists in the array or not as efficiently as possible, and I assume this ordering will help me do it. It may be that there is a better way to make that check than keeping the lists like this - if you have thoughts about that please mention them!
I assume the correct position for a new column can be found by a series of binary searches but my attempts have been messy - any thoughts on doing this in a tidy and efficient way?
thanks!
If I understand your problem correctly, you have a bunch of sequences of numbers that you need to process, but you need to be able to tell if the latest one is a duplicate of one of the sequences you've processed before. Currently you're trying to insert the new sequences as columns in a numpy array, but that's awkward since numpy is really best with fixed-sized arrays (concatenating or inserting things is always going to be slow).
A much better data structure for your needs is a set. Membership tests and the addition of new items on a set are both very fast (amortized O(1) time complexity). The only limitation is that a set's items must be hashable (which is true for tuples, but not for lists or numpy arrays).
Here's the outline of some code you might be able to use:
seen = set()
for seq in sequences:
tup = tuple(sequence) # you only need to make a tuple if seq is not already hashable
if tup not in seen:
seen.add(tup)
# do whatever you want with seq here, it has not been seen before
else:
pass # if you want to do something with duplicated sequences, do it here
You can also look at the unique_everseen recipe in the itertools documentation, which does basically the same as the above, but as a well-optimized generator function.

Accessing elements with offsets in Python's for .. in loops

I've been mucking around a bit with Python, and I've gathered that it's usually better (or 'pythonic') to use
for x in SomeArray:
rather than the more C-style
for i in range(0, len(SomeArray)):
I do see the benefits in this, mainly cleaner code, and the ability to use the nice map() and related functions. However, I am quite often faced with the situation where I would like to simultaneously access elements of varying offsets in the array. For example, I might want to add the current element to the element two steps behind it. Is there a way to do this without resorting to explicit indices?
The way to do this in Python is:
for i, x in enumerate(SomeArray):
print i, x
The enumerate generator produces a sequence of 2-tuples, each containing the array index and the element.
List indexing and zip() are your friends.
Here's my answer for your more specific question:
I might want to add the current element to the element two steps behind it. Is there a way to do this without resorting to explicit indices?
arr = range(10)
[i+j for i,j in zip(arr[:-2], arr[2:])]
You can also use the module numpy if you intend to work on numerical arrays. For example, the above code can be more elegantly written as:
import numpy
narr = numpy.arange(10)
narr[:-2] + narr[2:]
Adding the nth element to the (n-2)th element is equivalent to adding the mth element to the (m+2) element (for the mathematically inclined, we performed the substitution n->m+2). The range of n is [2, len(arr)) and the range of m is [0, len(arr)-2). Note the brackets and parenthesis. The elements from 0 to len(arr)-3 (you exclude the last two elements) is indexed as [:-2] while elements from 2 to len(arr)-1 (you exclude the first two elements) is indexed as [2:].
I assume that you already know list comprehensions.

Categories