Pandas Equivalent of R's which() - python

Variations of this question have been asked before, I'm still having trouble understanding how to actually slice a python series/pandas dataframe based on conditions that I'd like to set.
In R, what I'm trying to do is:
df[which(df[,colnumber] > somenumberIchoose),]
The which() function finds indices of row entries in a column in the dataframe which are greater than somenumberIchoose, and returns this as a vector. Then, I slice the dataframe by using these row indices to indicate which rows of the dataframe I would like to look at in the new form.
Is there an equivalent way to do this in python? I've seen references to enumerate, which I don't fully understand after reading the documentation. My sample in order to get the row indices right now looks like this:
indexfuture = [ x.index(), x in enumerate(df['colname']) if x > yesterday]
However, I keep on getting an invalid syntax error. I can hack a workaround by for looping through the values, and manually doing the search myself, but that seems extremely non-pythonic and inefficient.
What exactly does enumerate() do? What is the pythonic way of finding indices of values in a vector that fulfill desired parameters?
Note: I'm using Pandas for the dataframes

I may not understand clearly the question, but it looks like the response is easier than what you think:
using pandas DataFrame:
df['colname'] > somenumberIchoose
returns a pandas series with True / False values and the original index of the DataFrame.
Then you can use that boolean series on the original DataFrame and get the subset you are looking for:
df[df['colname'] > somenumberIchoose]
should be enough.
See http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

What what I know of R you might be more comfortable working with numpy -- a scientific computing package similar to MATLAB.
If you want the indices of an array who values are divisible by two then the following would work.
arr = numpy.arange(10)
truth_table = arr % 2 == 0
indices = numpy.where(truth_table)
values = arr[indices]
It's also easy to work with multi-dimensional arrays
arr2d = arr.reshape(2,5)
col_indices = numpy.where(arr2d[col_index] % 2 == 0)
col_values = arr2d[col_index, col_indices]

enumerate() returns an iterator that yields an (index, item) tuple in each iteration, so you can't (and don't need to) call .index() again.
Furthermore, your list comprehension syntax is wrong:
indexfuture = [(index, x) for (index, x) in enumerate(df['colname']) if x > yesterday]
Test case:
>>> [(index, x) for (index, x) in enumerate("abcdef") if x > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
Of course, you don't need to unpack the tuple:
>>> [tup for tup in enumerate("abcdef") if tup[1] > "c"]
[(3, 'd'), (4, 'e'), (5, 'f')]
unless you're only interested in the indices, in which case you could do something like
>>> [index for (index, x) in enumerate("abcdef") if x > "c"]
[3, 4, 5]

And if you need an additional statement panda.Series allows you to do Operations between Series (+, -, /, , *).
Just multiplicate the indexes:
idx1 = df['lat'] == 49
idx2 = df['lng'] > 15
idx = idx1 * idx2
new_df = df[idx]

Instead of enumerate, I usually just use .iteritems. This saves a .index(). Namely,
[k for k, v in (df['c'] > t).iteritems() if v]
Otherwise, one has to do
df[df['c'] > t].index()
This duplicates the typing of the data frame name, which can be very long and painful to type.

A nice simple and neat way of doing this is the following:
SlicedData1 = df[df.colname>somenumber]]
This can easily be extended to include other criteria, such as non-numeric data:
SlicedData2 = df[(df.colname1>somenumber & df.colname2=='24/08/2018')]
And so on...

Related

find first element and index matching condition in list

Consider this simple example
mylist = [-1,-2,3,4,5,6]
for idx, el in enumerate(mylist):
if el > 0:
myidx, myel = idx, el
break
myidx, myel
Out[20]: (2, 3)
I am interested in finding the first index and the corresponding first element in a python list that matches a specific condition (here, this is simply > 0).
In the code above, I loop over the elements using enumerate and then use the if clause to find the correct elements. This looks very cumbersome to me. Is there a better way to do this? Using a native python function for instance?
Thanks!
Something like this should work:
l = [-1,-2,3,4,5,6]
list(x > 0 for x in l).index(True)
# Output: 2
To find all patters, we can use python built in functions using
from itertools import filterfalse
f = filterfalse(lambda x: x[1] <= 0, enumerate(l))
print(list(f))
# [(2, 1), (3, 2), (4, 3)]
You could do it in a list comprehension. This is basically the same as your code but condensed into one line, and it builds a list of results that match the criteria.
The first way gets all the matches
mylist = [-1,-2,3,4,5,6]
results = [(i, el) for i, el in enumerate(mylist) if el > 0]
Another way would be to use a generator expression which is probably faster, and just unpack it. This gets the first one.
*next((i, el) for i, el in enumerate(mylist) if el > 0))
This loops the list and checks the condition, then puts the index and element into a tuple. Doing this inside parentheses turns it into a generator, which is much faster because it hasn't actually got to hold everything in memory, it just generates the responses as you need them. Using next() you can iterate through them. As we only use next() once here it just generates the first match. Then we unpack it with *
As there are two other valid answers here I decided to use timeit module to time each of them and post the results. For clarity I also timed the OP's method. Here is what I found:
import timeit
# Method 1 Generator Expression
print(timeit.timeit('next((i, el) for i, el in enumerate([-1,-2,3,4,5,6]) if el > 0)', number=100000))
0.007089499999999999
# Method 2 Getting index of True
print(timeit.timeit('list(x > 0 for x in [-1,-2,3,4,5,6]).index(True)', number=100000))
0.008104599999999997
# Method 3 filter and lambda
print(timeit.timeit('myidx , myel = list(filter(lambda el: el[1] > 0, enumerate([-1,-2,3,4,5,6])))[0]', number=100000))
0.0155314
statement = """
for idx, el in enumerate([-1,-2,3,4,5,6]):
if el > 0:
myidx, myel = idx, el
break
"""
print(timeit.timeit(statement, number=100000))
0.04074070000000002
You can make use of the combination of lambda and filter like this:
mylist = [-1,-2,3,4,5,6]
myidx, myel = list(filter(lambda el: el[1] > 0, enumerate(mylist)))[0]
print("({}, {})".format(myidx, myel))
Explanation:
The filter() function which offers an elegant way to filter out all the elements takes in a function and a list as arguments. Here they are lambda and mylist. Since you want to get the corresponding index, we need to use enumerate to wrap up enumerate(mylist).
Basically, enumerate(mylist) returns a tuple of an index and the corresponding value. Our condition here is the comparison between the value and 0 so that's why we get el[1] instead of el[0] to compare with 0.
The results will be casted to list. This list includes all the pairs (index, value) that meet our condition. Here we want to get the first pair so that's why we have [0] at the end.
Output:
(2, 3)

How to find a unique element in the three-element array?

I have a 3 element tuple where one of the elements is dissimilar from the other two. For example, it could be something like: (0.456, 0.768, 0.456).
What is the easiest way to find the index of this dissimilar element? One way I can think of is consider index (0, 1) and (1, 2) and one of these will be dissimilar. If it is (0, 1) then compare their elements to the element at 2 otherwise, compare elements of (1, 2) to index 0 to find the dissimilar element.
Feels like I am missing a pythonic way to do this.
A simple approach:
def func(arr):
x, y, z = arr
return 2 * (x == y) + (x == z)
Test:
func(['B', 'A', 'A'])
# 0
func(['A', 'B', 'A'])
# 1
func(['A', 'A', 'B'])
# 2
You could count the occurences of each element in the list then find the index of the place where only one element exists but I have a feeling this may not be as performant as your solution. It also wouldn't work if all 3 values are distinct.
my_tuple[[my_tuple.count(x) for x in my_tuple].index(1)]
You could try this:
index = [my_tuple.index(i) for i in my_tuple if my_tuple.count(i) == 1][0]
I'm not sure it is great performance-wise though.
What may look like A huge overkill in python 3, but couldn't help posting:
import collections
a = (0.768, 0.456, 0.456)
print("Dissimilar object index: ", a.index(list(collections.Counter(a).keys())[list(collections.Counter(a).values()).index(1)]))
Explanation:
collections.Counter(a): will return a frequency dict e.g {0.768:1, 0.456:2} etc. Then we just create a list to leverage index(1) to find out the value that's odd one out. Then we use a.index(odd_one_out_val) to find index.

How can I improve this Pandas DataFrame construction?

I wrote this ugly piece of code. It does the job, but it is not elegant. Any suggestion to improve it?
Function returns a dict given i, j.
pairs = [dict({"i":i, "j":j}.items() + function(i, j).items()) for i,j in my_iterator]
pairs = pd.DataFrame(pairs).set_index(['i', 'j'])
The dict({}.items() + function(i, j).items()) is supposed to merge both dict in one as dict().update() does not return the merged dict.
A favourite trick* to return an updated a newly created dictionary:
dict(i=i, j=j, **function(i, j))
*and of much debate on whether this is actually "valid"...
Perhaps also worth mentioning the DataFrame from_records method:
In [11]: my_iterator = [(1, 2), (3, 4)]
In [12]: df = pd.DataFrame.from_records(my_iterator, columns=['i', 'j'])
In [13]: df
Out[13]:
i j
0 1 2
1 3 4
I suspect there would be a more efficient method by vectorizing your function (but it's hard to say what makes more sense without more specifics of your situation)...

Performance for finding the maximum value in a dictionary versus numpy array

I have a large (in thousands) collection of word : value (float) pairs. I need to find the best of the value and extract the corresponding associated word. For example, I have (a,2.4),(b,5.2),(c,1.2),(d,9.2),(e,6.3),(f,0.4). I would like (d,9.2) as the output.
Currently, I am using a dictionary to store these tuples and use the max operator to retrieve the maximum key value in the dictionary. I was wondering if a numpy array would be more efficient. Soliciting expert opinions here.
I don't see how a numpy array would help you in this case.
In particular, any conversion of a data structure into another (in your case a list of tuples in a numpy array or a heapq) will be much slower than finding the maximum value iterating over each tuple). This is because converting the data structure will also require to iterate over the original one, plus instantiating an object for the new structure, plus storing the value into the new structure, plus using the new structure to get the requested value.
Using a built-in function or method of your list will most probably result in a faster computation. The most trivial implementation I can think of:
>>> li = [('a', 10), ('b', 30), ('c', 20)]
>>> max(li, key=lambda e : e[1])[0]
'b'
Other possible ones if you are also interested in stuff like the lowest value or popping off the list the value you found could pass through sorting (so you examine the original list only once!):
>>> li = [('a', 10), ('b', 30), ('c', 20)]
>>> li.sort(key=lambda e : e[1])
>>> li
[('a', 10), ('c', 20), ('b', 30)]
>>> li[-1][0]
'b'
Or:
>>> sorted(li, key=lambda e: e[1])[-1][0]
'b'
HTH!
Using Numpy here would involve keeping the float values in a separate ndarray. Find the index of max value using argmax and get the word from a separate list. This is very fast, but constructing the ndarray only to find the max is not. Example:
import numpy as np
import operator
names = [str(x) for x in xrange(10000)]
values = [float(x) for x in xrange(10000)]
tuples = zip(names, values)
dic = dict(tuples)
npvalues = np.fromiter(values, np.float)
def fa():
return names[npvalues.argmax()]
def fb():
return max(tuples, key=operator.itemgetter(1))[0]
def fc():
return max(dic, key=dic.get)
def fd():
v = np.fromiter((x[1] for x in tuples), np.float)
return tuples[v.argmax()][0]
Timings: fa 67 µs, fb 2300 µs, fc 2580 µs, fd 3780 µs.
So, using Numpy (fa) is over 30 times faster than using a plain list (fb) or dictionary (fc), when the time to construct the Numpy array is not taken into account. (fd takes it into account)

Python List indexed by tuples

I'm a Matlab user needing to use Python for some things, I would really appreciate it if someone can help me out with Python syntax:
(1) Is it true that lists can be indexed by tuples in Python? If so, how do I do this? For example, I would like to use that to represent a matrix of data.
(2) Assuming I can use a list indexed by tuples, say, data[(row,col)], how do I remove an entire column? I know in Matlab, I can do something like
new_data = [data(:,1:x-1) data(:,x+1:end)];
if I wanted to remove column x from data.
(3) How can I easily count the number of non-negative elements in each row. For example, in Matlab, I can do something like this:
sum(data>=0,1)
this would give me a column vector that represents the number of non-negative entries in each row.
Thanks a lot!
You should look into numpy, it's made for just this sort of thing.
No, but dicts can.
Sounds like you want a "2d array", matrix type, or something else. Have you looked at numpy yet?
Depends on what you choose from #2, but Python does have sum and other functions that work directly on iterables. Look at gen-exprs (generator expressions) and list comprehensions. For example:
row_count_of_non_neg = sum(1 for n in row if n >= 0)
# or:
row_count_of_non_neg = sum(n >= 0 for n in row)
# "abusing" True == 1 and False == 0
I agree with everyone. Use Numpy/Scipy. But here are specific answers to your questions.
Yes. And the index can either be a built-in list or a Numpy array. Suppose x = scipy.array([10, 11, 12, 13]) and y = scipy.array([0, 2]). Then x[[0, 2]] and x[y] both return the same thing.
new_data = scipy.delete(data, x, axis=0)
(data>=0).sum(axis=1)
Careful: Example 2 illustrates a common pitfall with Numpy/Scipy. As shown in Example 3, the axis property is usually set to 0 to operate along the first dimension of an array, 1 to operate along the second dimension, and so on. But some commands like delete actually reverse the order of dimensions as shown in Example 2. You know, row major vs. column major.
Here's an example of how to easily create an array (matrix) in numpy:
>>> import numpy
>>> a = numpy.array([[1,2,3],[4,5,6],[7,8,9]])
here is how it is displayed
>>> a
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
and how to get a row or column:
>>> a[0,:]
array([1, 2, 3])
>>> a[:,0]
array([1, 4, 7])
Hope the syntax is clear from the example! Numpy is rather powerfull.
You can expand list functionality to allow indexing with tuples by overloading the __getitem__ and __setitem__ methods of the built-in list. Try the following code:
class my_list(list):
def __getitem__(self, key):
if isinstance(key, tuple) and len(key) > 0:
temp = []
for k in key: temp.append(list.__getitem__(self, k))
return temp
else:
return list.__getitem__(self, key)
def __setitem__(self, key, data):
if isinstance(key, tuple) and len(key) > 0:
for k in key: list.__setitem__(self, k, data)
else:
list.__setitem__(self, key, data)
if __name__ == '__main__':
L = my_list([1, 2, 3, 4, 5])
T = (1,3)
print(L[T])
(1)
I don't think you can use a tuple as an index of python list. You may use list of list ( e.g. a[i][j]) but it seems that it's not your point. You may use a dictionary whose key is tuple.
d = { (1,1):1, (2,1):2 ... }
(2)
If you don't mind about the performance,
map( lambda x: d.remove(x) if x[1] = col_number, d.keys() )
(3)
You can also use the filter to do that.
sum(
map( lambda x:x[1], filter(lambda x,y: x[1] == row_num and y > 0, d.items()))
)
No, it isn't the case that a list can be indexed by anything but an integer. A dictionary, however, is another case. A dictionary is a hash table consisting a key-value pairs. Keys must be unique and immutable. The value can be objects of any type, including integers, tuples, lists, or other dictionaries. For your example, tuples can serve as keys, since they are immutable. Lists, on the other hand, aren't and, thus, can't be dictionary keys.
Some of the capabilities you've asked about could be implemented as a combination of a dictionary and list comprehensions. Others would require subclassing the dictionary and adding methods to implement your desired functionality.
Using native python you could use:
my_list = [0, 1, 2, 3]
index_tuple = (1,2)
x = [item for idx, item in enumerate(my_list) if idx in index_tuple]

Categories