query dataframe column on array values - python

traj0
Out[52]:
state action reward
0 [1.0, 4.0, 6.0] 3.0 4.0
1 [4.0, 6.0, 11.0] 4.0 5.0
2 [6.0, 7.0, 3.0] 3.0 22.0
3 [3.0, 3.0, 2.0] 1.0 10.0
4 [2.0, 9.0, 5.0] 2.0 2.0
Suppose I have a pandas dataframe looking like this where the state column has as its entries, 3-element numpy arrays.
How can I query for the row that has state as np.array([3.0,3.0,2.0]) here?
I know traj0.query("state == '[3.0,3.0,2.0]'") works, I know. But I don't want to hardcode the array value in my query.
I'm looking for something like
x = np.array([3.0,3.0,2.0])
traj0.query('state ==' + x)
=============
It's not a duplicate question because my previous question pandas query with a column consisting of array entries was only for the case where there was only one value in each array. Here I'm looking for if the arrays have multiple values.

You can do this with df.loc and a lambda function using numpy.array_equal:
x = [1., 4., 6.]
traj0.loc[df.state.apply(lambda a: np.array_equal(a, x))]
Basically this checks each element of the state column for equivalence to x and returns only those rows where the column matches.
Example
df = pd.DataFrame(data={'state': [[1., 4., 6.], [4., 5., 6.]],
'value': [5, 6]})
print(df.loc[df.state.apply(lambda a: np.array_equal(a, x))])
state value
0 [1.0, 4.0, 6.0] 5

import numpy as np
import pandas as pd
df = pd.DataFrame([[np.array([1.0, 4.0, 6.0]), 3.0, 4.0],
[np.array([4.0, 6.0, 11.0]), 4.0, 5.0],
[np.array([6.0, 7.0, 3.0]), 3.0, 22.0],
[np.array([3.0, 3.0, 2.0]), 1.0, 10.0],
[np.array([2.0, 9.0, 5.0]), 2.0, 2.0]
], columns=['state','action','reward'])
x = str(np.array([3.0, 3.0, 2.0]))
df[df.state.astype(str) == x]
// to use pd.query
df['state_str'] = df.state.astype(str)
df.query("state_str == '{}'".format(x))
Output
state action reward
3 [3.0, 3.0, 2.0] 1.0 10.0

Best not to use pd.DataFrame.query here. You can perform a vectorised comparison and then use Boolean indexing:
x = [3, 3, 2]
mask = (np.array(df['state'].values.tolist()) == x).all(1)
res = df[mask]
print(res)
state action reward
3 [3.0, 3.0, 2.0] 1.0 10.0
In general, you shouldn't store lists or arrays within a Pandas series. This is inefficient and removes the possibility of direct vectorised operations. Here, we've had to convert to a NumPy array explicitly for a simple comparison.

Related

Keep nan in result when perform statsmodels OLS regression in python

I want to perform OLS regression using python's statsmodels package. But my dataset has nans in it. Currently, I know I can use missing='drop' option when perform OLS regression but some of the results (fitted value or residuals) will have different lengths as the original y variable.
I have the following code as an example:
import numpy as np
import statsmodels.api as sm
yvars = np.array([1.0, 6.0, 3.0, 2.0, 8.0, 4.0, 5.0, 2.0, np.nan, 3.0])
xvars = np.array(
[
[1.0, 8.0],
[8.0, np.nan],
[np.nan, 3.0],
[3.0, 6.0],
[5.0, 3.0],
[2.0, 7.0],
[1.0, 3.0],
[2.0, 2.0],
[7.0, 9.0],
[3.0, 1.0],
]
)
res = sm.OLS(yvar, sm.add_constant(xvars), missing='drop').fit()
res.resid
The result is as follows:
array([-0.71907958, -1.9012464 , 1.78811122, 1.18983701, 2.63854267,
-1.45254075, -1.54362416])
My question is that the result is an array has length 7 (after dropping nans), but the length of yvar is 10. So, what if I want to return the residual of the same length as yvar and just output nan in whatever position where there are at least 1 nan in either yvar or xvars?
Basically, the result I want to get is:
array([-0.71907958, nan , nan , -1.9012464 , 1.78811122, 1.18983701, 2.63854267,
-1.45254075, nan , -1.54362416])
That's too difficult to implement in statsmodels. So users need to handle it themselves.
The results attributes like fittedvalues and resid are for the actual sample used.
The predict method of the results instance preserves nans in the provided predict data exog array, but other methods and attributes do not.
results.predict(xvars_all)
One workaround:
Use a pandas DataFrame for the data.
Then, AFAIR, resid and fittedvalues of the results instance are pandas Series with the appropriate index.
This can then be used to add those to the original index or DataFrame.
That's what the predict method does.

Extract longest block of continuous non-null values from each row in Pandas data frame

Suppose I have a Pandas data frame structured similarly to the following:
data = {
'A' : [5.0, np.nan, 1.0],
'B' : [7.0, np.nan, np.nan],
'C' : [9.0, 2.0, 6.0],
'D' : [np.nan, 4.0, 9.0],
'E' : [np.nan, 6.0, np.nan],
'F' : [np.nan, np.nan, np.nan],
'G' : [np.nan, np.nan, 8.0]
}
df = pd.DataFrame(
data,
index=['11','22','33']
)
From each row, I would like to extract the longest continuous block of non-null values and append them to a list.
So the following values from these rows:
row11: [5,7,9]
row22: [2,4,6]
row33: [6,9]
Giving me a list of values:
[5.0, 7.0, 9.0, 2.0, 4.0, 6.0, 6.0, 9.0]
My current approach uses iterrows() first_valid_index() and last_valid_index():
mylist = []
for i, r in df.iterrows():
start = r.first_valid_index()
end = r.last_valid_index()
mylist.extend(r[start: end].values)
This works fine when the valid digits are blocked together such as row11 and row22. However my approach falls down when digits are interspersed with null values such as in row33. In this case, my approach extracts the entire row as the first and last index contain non-null values. My solution (incorrectly) outputs a final list of:
[5.0, 7.0, 9.0, 2.0, 4.0, 6.0, 1.0, nan, 6.0, 9.0, nan, nan, 8.0]
I have the following questions:
1.) How can I combat the error I'm facing in the example of row33?
2.) Is there a more efficient approach than using iterrows()? My actual data has many thousands of rows. While it isn't necessarily too slow, I'm always wary of resorting to iteration when using Pandas.
One option using a groupby to get the stretches of non-NA and max to filter the longest:
def get_longest(s):
m = s.isna()
return max(s[~m].groupby(m.cumsum()),
key=lambda x: len(x[1])
)[1].dropna().tolist()
out = df.apply(get_longest, axis=1)
Output:
11 [5.0, 7.0, 9.0]
22 [2.0, 4.0, 6.0]
33 [6.0, 9.0]
dtype: object
With numpy.ma.masked_invalid and numpy.ma.clump_unmasked functions to split a row into continuous slices of non-nan values and select slice with largest length:
res = df.apply(lambda x: x[max(np.ma.clump_unmasked(np.ma.masked_invalid(x.values)),
key=lambda sl: sl.stop - sl.start)].tolist(), axis=1)
11 [5.0, 7.0, 9.0]
22 [2.0, 4.0, 6.0]
33 [6.0, 9.0]

Add values when there is a gap between elements

I have a list defined as
A = [1.0, 3.0, 6.0, 7.0, 8.0]
I am trying to fill the gap between the elements of the list with zero values. A gap is an increment between elements that is more than one. So for instance between 1.0 and 3.0 there is one gap: 2.0 and between 3.0 and 6.0 there are two gaps:4.0 and 5.0
I am working with this code but it is not complete and I am missing adding multiple values when the gap is bigger than one increment
B = []
cnt = 0
for i in range(len(A)-1):
if A[i] == A[i+1] - 1:
B.append(A[cnt])
cnt += 1
if A[i] != A[i+1] - 1:
B.append(A[cnt])
B.append(0.0)
cnt += 1
The output of this code is:
B = [1.0, 0.0, 3.0, 0.0, 6.0, 7.0]
But since there are two gaps between 3.0 and 6.0 I need B to look like this:
B = [1.0, 0.0, 3.0, 0.0, 0.0, 6.0, 7.0]
I am a bit stuck on how to do this and I already have a feeling that my code is not very optimized. Any help is appreciated!
You can use a list comprehension. Assuming your list is ordered, you can extract the first and last indices of A. We use set for O(1) lookup complexity within the comprehension.
A = [1.0, 3.0, 6.0, 7.0, 8.0]
A_set = set(A)
res = [i if i in A_set else 0 for i in range(int(A[0]), int(A[-1])+1)]
print(res)
[1, 0, 3, 0, 0, 6, 7, 8]
However, for larger arrays I'd recommend you use a specialist library such as NumPy:
import numpy as np
A = np.array([1.0, 3.0, 6.0, 7.0, 8.0]).astype(int)
B = np.zeros(A.max())
B[A-1] = A
print(B)
array([ 1., 0., 3., 0., 0., 6., 7., 8.])
Based on comments to the question, I can suggest the following solution:
B = [float(x) if x in A else 0.0 for x in range(int(min(A)), int(max(A)) + 1)]

Return first non NaN value in python list

What would be the best way to return the first non nan value from this list?
testList = [nan, nan, 5.5, 5.0, 5.0, 5.5, 6.0, 6.5]
edit:
nan is a float
You can use next, a generator expression, and math.isnan:
>>> from math import isnan
>>> testList = [float('nan'), float('nan'), 5.5, 5.0, 5.0, 5.5, 6.0, 6.5]
>>> next(x for x in testList if not isnan(x))
5.5
>>>
It would be very easy if you were using NumPy:
array[numpy.isfinite(array)][0]
... returns the first finite (non-NaN and non-inf) value in the NumPy array 'array'.
If you're doing it a lot, put it into a function to make it readable and easy:
import math
t = [float('nan'), float('nan'), 5.5, 5.0, 5.0, 5.5, 6.0, 6.5]
def firstNonNan(listfloats):
for item in listfloats:
if math.isnan(item) == False:
return item
firstNonNan(t)
5.5
one line lambda below:
from math import isnan
lst = [float('nan'), float('nan'), 5.5, 5.0, 5.0, 5.5, 6.0, 6.5]
lst
[nan, nan, 5.5, 5.0, 5.0, 5.5, 6.0, 6.5]
first non nan value
lst[lst.index(next(filter(lambda x: not isnan(x), lst)))]
5.5
index of first non nan value
lst.index(next(filter(lambda x: not isnan(x), lst)))
2

Computing mean for non-unique elements of numpy array pairs

I have three arrays, all of the same size:
arr1 = np.array([1.4, 3.0, 4.0, 4.0, 7.0, 9.0, 9.0, 9.0])
arr2 = np.array([2.3, 5.0, 2.3, 2.3, 4.0, 6.0, 5.0, 6.0])
data = np.array([5.4, 7.1, 9.5, 1.9, 8.7, 1.8, 6.1, 7.4])
arr1 can take up any float value and arr2 only a few float values. I want to obtain the unique pairs of arr1 and arr2, e.g.
arr1unique = np.array([1.4, 3.0, 4.0, 7.0, 9.0, 9.0])
arr2unique = np.array([2.3, 5.0, 2.3, 4.0, 6.0, 5.0])
For each non-unique pair I need to average the corresponding elements in the data-array, e.g. averaging the values 9.5 and 1.9 since the pair (arr1[3], arr2[3]) and (arr1[4], arr2[4]) are equal. The same holds for the values in data corresponding to the indices 6 and 8. The data array therefore becomes
dataunique = np.array([5.4, 7.1, 5.7, 8.7, 4.6, 6.1])
Here is a 'pure numpy' solution to the problem. Pure numpy in quotes because it relies on a numpy enhancement proposal which I am still working on, but you can find the full code here:
http://pastebin.com/c5WLWPbp
group_by((arr1, arr2)).mean(data)
Voila, problem solved. Way faster than any of the posted solutions; and much more elegant too, if I may say so myself ;).
defaultdict can help you here:
>>> import numpy as np
>>> arr1 = np.array([1.4, 3.0, 4.0, 4.0, 7.0, 9.0, 9.0, 9.0])
>>> arr2 = np.array([2.3, 5.0, 2.3, 2.3, 4.0, 6.0, 5.0, 6.0])
>>> data = np.array([5.4, 7.1, 9.5, 1.9, 8.7, 1.8, 6.1, 7.4])
>>> from collections import defaultdict
>>> dd = defaultdict(list)
>>> for x1, x2, d in zip(arr1, arr2, data):
... dd[x1, x2].append(d)
...
>>> arr1unique = np.array([x[0] for x in dd.iterkeys()])
>>> arr2unique = np.array([x[1] for x in dd.iterkeys()])
>>> dataunique = np.array([np.mean(x) for x in dd.itervalues()])
>>> print arr1unique
[ 1.4  7.   4.   9.   9.   3. ]
>>> print arr2unique
[ 2.3  4.   2.3  5.   6.   5. ]
>>> print dataunique
[ 5.4  8.7  5.7  6.1  4.6  7.1]
This method gives your answer, but destroys the ordering. If the ordering is important, you can do basically the same thing with collections.OrderedDict
Make a dictionary from arr1 as key and store its equivalent arr2 as value.for each save to dictionary generate its dataunique entry.If key already exists skip that iteration and continue.
All you have to is to create a OrderedDict to store the keys as pair of elements in (arr1,arr2) and the values as a list of elements in data. For any duplicate key (pair of arr1 and arr2), the duplicate entries would be stored in the list. You can then re-traverse the values in the dictionary and create the average. To get the unique keys, just iterate over the keys and split the tuples
Try the following
>>> d=collections.OrderedDict()
>>> for k1,k2,v in zip(arr1,arr2,data):
d.setdefault((k1,k2),[]).append(v)
>>> np.array([np.mean(v) for v in d.values()])
array([ 5.4, 7.1, 5.7, 8.7, 4.6, 6.1])
>>> arr1unique = np.array([e[0] for e in d])
>>> arr2unique = np.array([e[1] for e in d])

Categories