Accessing index of particular values in Python - python

I have a matrix of coordinates (numpy arrays)
arr = [[a,b,c],
[d,e,f],
......]]
where every tuple is unique, but a,b,c,d,e,f are not.
I'm wondering how to obtain the index at which
arr == [d,e,f]
I'm using
np.where(arr==[d,e,f])
but it returns a whole mess of values at which other individual elements are true.
For example,
vals = arr==[d,e,f]
returns
vals = [[False,False,False],
[True,True,True],
...............]]
But doing
np.where(vals==[True,True,True])
returns the other elements that contain only one or two trues, as well as the three trues. I just want the one tuple with all three trues.

You can get the indices of the rows that has all Trues by using numpy.all on 1st axis:
>>> arr1 = np.array(['d', 'e', 'f'])
>>> arr2 = np.array([['a' , 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']])
>>> np.all(arr2==arr1, axis=1)
array([False, True, False], dtype=bool)
# Now get the indices using `numpy.where`
>>> np.where(np.all(arr2==arr1, axis=1))[0]
array([1])
>>> arr2[_]
array([['d', 'e', 'f']],
dtype='|S1')

Related

Python: How to assign elements of an array to multiple columns in DataFrame?

So, I have a function like below:
def do_something(row, args):
# doing something
return arr
where arr = array([1, 2, 3, 4, 5])
and in my main function, I have a pandas DataFrame object df, it has columns A, B, C, D, E along with other columns.
I use function do_something as below, try to assign each of the elements of the returning array to the columns:
df[['A', 'B', 'C', 'D', 'E']] = df.apply(lambda row: do_something(row, args), axis=1)
However, it gave me an error:
ValueError: Must have equal len keys and values when setting with an
iterable
I think it means I'm trying to assign the whole array as one variable to multiple columns in my df, hence the unequal length trigger the error. But I'm not sure how to make my goal.
The error is telling you that you are trying to assign a sequence of values to a sequence of keys.
You can use the zip function to create a dictionary from the two lists.
Then you can use the dictionary to assign the values to the columns.
This is because the return value of apply is a pd.Series which comes with 1 column while you try to assign it to 5. See this example:
import numpy as np
import pandas as pd
def do_something(row, args):
arr = np.array([1,2,3,4,5])
return arr
df = pd.DataFrame(np.ones((2,6)),columns = ['A', 'B', 'C', 'D', 'E', 'some other column'])
df.apply(lambda row: do_something(row, 2), axis=1)
0 [1, 2, 3, 4, 5]
1 [1, 2, 3, 4, 5]
dtype: object
The solution is to convert it to 5 columns so you could do the assign later:
df[['A', 'B', 'C', 'D', 'E']] = df.apply(lambda row: do_something(row, 2), axis=1).to_list()

Process a list of lists, finding all lists that have matching last values?

Given a list of lists
lol = [[0,a], [0,b],
[1,b], [1,c],
[2,d], [2,e],
[2,g], [2,b],
[3,e], [3,f]]
I would like to extract all sublists that have the same last element (lol[n][1]) and end up with something like below:
[0,b]
[1.b]
[2,b]
[2,e]
[3,e]
I know that given two lists we can use an intersection, what is the right way to go about a problem like this, other than incrementing the index value in a for each loop?
1. Using collections.defaultdict
You can use defaultdict to the first group up your items with more than one occurrence, then, iterate over the dict.items to get what you need.
from collections import defaultdict
lol = [[0,'a'], [0,'b'],
[1,'b'], [1,'c'],
[2,'d'], [2,'e'],
[2,'g'], [2,'b'],
[3,'e'], [3,'f']]
d = defaultdict(list)
for v,k in lol:
d[k].append(v)
# d looks like -
# defaultdict(list,
# {'a': [0],
# 'b': [0, 1, 2],
# 'c': [1],
# 'd': [2],
# 'e': [2, 3],
# 'g': [2],
# 'f': [3]})
result = [[v,k] for k,vs in d.items() for v in vs if len(vs)>1]
print(result)
[[0, 'b'], [1, 'b'], [2, 'b'], [2, 'e'], [3, 'e']]
2. Using pandas.duplicated
Here is how you can do this with Pandas -
Convert to pandas dataframe
For key column, find the duplicates and keep all of them
Convert to list of records while ignoring index
import pandas as pd
df = pd.DataFrame(lol, columns=['val','key'])
dups = df[df['key'].duplicated(keep=False)]
result = list(dups.to_records(index=False))
print(result)
[(0, 'b'), (1, 'b'), (2, 'e'), (2, 'b'), (3, 'e')]
3. Using numpy.unique
You can solve this in a vectorized manner using numpy -
Convert to numpy matrix arr
Find unique elements u and their counts c
Filter list of unique elements that occur more than once dup
Use broadcasting to compare the second column of the array and take any over axis=0 to get a boolean which is True for duplicated rows
Filter the arr based on this boolean
import numpy as np
arr = np.array(lol)
u, c = np.unique(arr[:,1], return_counts=True)
dup = u[c > 1]
result = arr[(arr[:,1]==dup[:,None]).any(0)]
result
array([['0', 'b'],
['1', 'b'],
['2', 'e'],
['2', 'b'],
['3', 'e']], dtype='<U21')

python: sort array when sorting other array

I have two arrays:
a = np.array([1,3,4,2,6])
b = np.array(['c', 'd', 'e', 'f', 'g'])
These two array are linked (in the sense that there is a 1-1 correspondence between the elements of the two arrays), so when i sort a by decreasing order I would like to sort b in the same order.
For instance, when I do:
a = np.sort(a)[::-1]
I get:
a = [6, 4, 3, 2, 1]
and I would like to be able to get also:
b = ['g', 'e', 'd', 'f', 'c']
i would do smth like this:
import numpy as np
a = np.array([1,3,4,2,6])
b = np.array(['c', 'd', 'e', 'f', 'g'])
idx_order = np.argsort(a)[::-1]
a = a[idx_order]
b = b[idx_order]
output:
a = [6 4 3 2 1]
b = ['g' 'e' 'd' 'f' 'c']
I don't know how or even if you can do this in numpy arrays. However there is a way using standard lists albeit slightly convoluted. Consider this:-
a = [1, 3, 4, 2, 6]
b = ['c', 'd', 'e', 'f', 'g']
assert len(a) == len(b)
c = []
for i in range(len(a)):
c.append((a[i], b[i]))
r = sorted(c)
for i in range(len(r)):
a[i], b[i] = r[i]
print(a)
print(b)
In your problem statement, there is no relationship between the two tables. What happens here is that we make a relationship by grouping relevant data from each table into a temporary list of tuples. In this scenario, sorted() will carry out an ascending sort on the first element of each tuple. We then just rebuild our original arrays

pandas data frame / numpy array - roll without aggregate function

rolling in python aggregates data:
x = pd.DataFrame([[1,'a'],[2,'b'],[3,'c'],[4,'d']], columns=['a','b'])
y = x.rolling(2).mean()
print(y)
gives:
a b
0 NaN a
1 1.5 b
2 2.5 c
3 3.5 d
what I need is 3 dimension dataframes (or numpy arrays) shifting 3 samples by 1 step (in this example):
[
[[1,'a'],[2,'b'],[3,'c']],
[[2,'b'],[3,'c'],[4,'d']]
]
Whats the right way to do it for 900 samples shifting by 1 each step?
Using np.concantenate
np.concatenate([x.values[:-1],
x.values[1:]], axis=1)\
.reshape([x.shape[0] - 1, x.shape[1], -1])
You can try of concatenating window length associated dataframes based on the window length chosen (as selected 2)
length = df.dropna().shape[0]-1
cols = len(df.columns)
pd.concat([df.shift(1),df],axis=1).dropna().astype(int,errors='ignore').values.reshape((length,cols,2))
Out:
array([[[1, 'a'],
[2, 'b']],
[[2, 'b'],
[3, 'c']],
[[3, 'c'],
[4, 'd']]], dtype=object)
Let me know whether this solution suits your question.
p = x[['a','b']].values.tolist() # create a list of list ,as [i.a,i.b] for every i row in x
#### Output ####
[[1, 'a'], [2, 'b'], [3, 'c'], [4, 'd']]
#iterate through list except last two and for every i, fetch p[i],p[i+1],p[i+2] into a list
list_of_3 = [[p[i],p[i+1],p[i+2]] for i in range(len(p)-2)]
#### Output ####
[
[[1, 'a'], [2, 'b'], [3, 'c']],
[[2, 'b'], [3, 'c'], [4, 'd']]
]
# This is used if in case the list you require is numpy ndarray
from numpy import array
a = array(list_of_3)
#### Output ####
[[['1' 'a']
['2' 'b']
['3' 'c']]
[['2' 'b']
['3' 'c']
['4' 'd']]
]
Since pandas 1.1 you can iterate over rolling objects:
[window.values.tolist() for window in x.rolling(3) if window.shape[0] == 3]
The if makes sure we only get full windows. This solution has the advantage that you can use any parameter of the handy rolling function of pandas.

Remove rows of dataframe of values in a list present in another list

I'm attempting to remove the rows of values in a list within df which are present in lst.
I'm aware of using df[df[x].isin(y)] for singular strings but am not sure as to how to adjust the same method to work with lists within a dataframe.
lst = ['f','a']
df:
Column1 Out1
0 ['x', 'y'] a
1 ['a', 'b'] i
2 ['c', 'd'] o
3 ['e', 'f'] u
etc.
I've attempted to use list comprehension but it doesn't seem to work the same with Pandas
df = df[[i for x in list for i in df['Column1']]]
Error:
TypeError: unhashable type: 'list'
My expected output would be as followed; removing the rows that contain the lists of which have the values in lst:
Column1 Out1
0 ['x', 'y'] a
1 ['c', 'd'] o
etc.
You can use convert values to sets and then use &, for inverting mask use ~:
df = pd.DataFrame({'Column1':[['x','y'], ['a','b'], ['c','d'],['e','f']],
'Out1':list('aiou')})
lst = ['f','a']
df1 = df[~(df['Column1'].apply(set) & set(lst))]
print (df1)
Column1 Out1
0 [x, y] a
2 [c, d] o
Solution with nested list comprehension - get list of booleans, so need all for check if all values are True:
df1 =df[[all([x not in lst for x in i]) for i in df['Column1']]]
print (df1)
Column1 Out1
0 [x, y] a
2 [c, d] o
print ([[x not in lst for x in i] for i in df['Column1']])
[[True, True], [False, True], [True, True], [True, False]]

Categories