pandas data frame / numpy array - roll without aggregate function - python

rolling in python aggregates data:
x = pd.DataFrame([[1,'a'],[2,'b'],[3,'c'],[4,'d']], columns=['a','b'])
y = x.rolling(2).mean()
print(y)
gives:
a b
0 NaN a
1 1.5 b
2 2.5 c
3 3.5 d
what I need is 3 dimension dataframes (or numpy arrays) shifting 3 samples by 1 step (in this example):
[
[[1,'a'],[2,'b'],[3,'c']],
[[2,'b'],[3,'c'],[4,'d']]
]
Whats the right way to do it for 900 samples shifting by 1 each step?

Using np.concantenate
np.concatenate([x.values[:-1],
x.values[1:]], axis=1)\
.reshape([x.shape[0] - 1, x.shape[1], -1])

You can try of concatenating window length associated dataframes based on the window length chosen (as selected 2)
length = df.dropna().shape[0]-1
cols = len(df.columns)
pd.concat([df.shift(1),df],axis=1).dropna().astype(int,errors='ignore').values.reshape((length,cols,2))
Out:
array([[[1, 'a'],
[2, 'b']],
[[2, 'b'],
[3, 'c']],
[[3, 'c'],
[4, 'd']]], dtype=object)

Let me know whether this solution suits your question.
p = x[['a','b']].values.tolist() # create a list of list ,as [i.a,i.b] for every i row in x
#### Output ####
[[1, 'a'], [2, 'b'], [3, 'c'], [4, 'd']]
#iterate through list except last two and for every i, fetch p[i],p[i+1],p[i+2] into a list
list_of_3 = [[p[i],p[i+1],p[i+2]] for i in range(len(p)-2)]
#### Output ####
[
[[1, 'a'], [2, 'b'], [3, 'c']],
[[2, 'b'], [3, 'c'], [4, 'd']]
]
# This is used if in case the list you require is numpy ndarray
from numpy import array
a = array(list_of_3)
#### Output ####
[[['1' 'a']
['2' 'b']
['3' 'c']]
[['2' 'b']
['3' 'c']
['4' 'd']]
]

Since pandas 1.1 you can iterate over rolling objects:
[window.values.tolist() for window in x.rolling(3) if window.shape[0] == 3]
The if makes sure we only get full windows. This solution has the advantage that you can use any parameter of the handy rolling function of pandas.

Related

Python Equivalent for R's order function

According to this post np.argsort() would be the function I am looking for.
However, this is not giving me my desire result.
Below is the R code that I am trying to convert to Python and my current Python code.
R Code
data.frame %>% select(order(colnames(.)))
Python Code
dataframe.iloc[numpy.array(dataframe.columns).argsort()]
The dataframe I am working with is 1,000,000+ rows and 42 columns, so I can not exactly re-create the output.
But I believe I can re-create the order() outputs.
From my understanding each number represents the original position in the columns list
order(colnames(data.frame)) returns
3,2,5,6,8,4,7,10,9,11,12,13,14,15,16,17,18,19,23,20,21,22,1,25,26,28,24,27,38,29,34,33,36,30,31,32,35,41,42,39,40,37
numpy.array(dataframe.columns).argsort() returns
2,4,5,7,3,6,9,8,10,11,12,13,14,15,16,17,18,22,19,20,21,0,24,25,27,23,26,37,28,33,32,35,29,30,31,34,40,41,38,39,36,1
I know R does not have 0 index like python, so I know the first two numbers 3 and 2 are the same.
I am looking for python code that could potentially return the same ordering at the R code.
Do you have mixed case? This is handled differently in python and R.
R:
order(c('a', 'b', 'B', 'A', 'c'))
# [1] 1 4 2 3 5
x <- c('a', 'b', 'B', 'A', 'c')
x[order(c('a', 'b', 'B', 'A', 'c'))]
# [1] "a" "A" "b" "B" "c"
Python:
np.argsort(['a', 'b', 'B', 'A', 'c'])+1
# array([4, 3, 1, 2, 5])
x = np.array(['a', 'b', 'B', 'A', 'c'])
x[np.argsort(x)]
# array(['A', 'B', 'a', 'b', 'c'], dtype='<U1')
You can mimick R's behavior using numpy.lexsort and sorting by lowercase, then by the original array with swapped case:
x = np.array(['a', 'b', 'B', 'A', 'c'])
x[np.lexsort([np.char.swapcase(x), np.char.lower(x)])]
# array(['a', 'A', 'b', 'B', 'c'], dtype='<U1')
np.argsort is the same thing as R's order.
Just experiment
> x=c(1,2,3,10,20,30,5,15,25,35)
> x
[1] 1 2 3 10 20 30 5 15 25 35
> order(x)
[1] 1 2 3 7 4 8 5 9 6 10
>>> x=np.array([1,2,3,10,20,30,5,15,25,35])
>>> x
array([ 1, 2, 3, 10, 20, 30, 5, 15, 25, 35])
>>> x.argsort()+1
array([ 1, 2, 3, 7, 4, 8, 5, 9, 6, 10])
+1 here is just to have index starting with 1, since output of argsort are index (0-based index).
So maybe the problem comes from your columns (shot in the dark: you have 2d-arrays, and are passing lines to R and columns to python, or something like that).
But np.argsort is R's order.

Pandas df return indices of duplicated elements as a list

I'd like to have the indexes of duplicated column elements as a list. So far, the way I found is
test = ['a', 'a', 'b', 'c', 'b']
testdf = pd.DataFrame(test, columns=['test'])
np.asarray(np.where(list(testdf['test'].duplicated()))).tolist()[0]
# [1, 4]
Which seems ridiculously convoluted.
Any better way?
you can use .duplicated() with .tolist()
testdf.index[testdf.test.duplicated()].tolist()
Try this just indexing the index:
testdf.index[testdf['test'].duplicated()]
add to_list:
testdf.index[testdf['test'].duplicated()].to_list()
Output:
[1, 4]
%%time
test = ['a', 'a', 'b', 'c', 'b']
testdf = pd.DataFrame(test, columns=['test'])
testdf[testdf.test.duplicated()].index.to_list()
# Wall time: 2 ms
# [1, 4]

Process a list of lists, finding all lists that have matching last values?

Given a list of lists
lol = [[0,a], [0,b],
[1,b], [1,c],
[2,d], [2,e],
[2,g], [2,b],
[3,e], [3,f]]
I would like to extract all sublists that have the same last element (lol[n][1]) and end up with something like below:
[0,b]
[1.b]
[2,b]
[2,e]
[3,e]
I know that given two lists we can use an intersection, what is the right way to go about a problem like this, other than incrementing the index value in a for each loop?
1. Using collections.defaultdict
You can use defaultdict to the first group up your items with more than one occurrence, then, iterate over the dict.items to get what you need.
from collections import defaultdict
lol = [[0,'a'], [0,'b'],
[1,'b'], [1,'c'],
[2,'d'], [2,'e'],
[2,'g'], [2,'b'],
[3,'e'], [3,'f']]
d = defaultdict(list)
for v,k in lol:
d[k].append(v)
# d looks like -
# defaultdict(list,
# {'a': [0],
# 'b': [0, 1, 2],
# 'c': [1],
# 'd': [2],
# 'e': [2, 3],
# 'g': [2],
# 'f': [3]})
result = [[v,k] for k,vs in d.items() for v in vs if len(vs)>1]
print(result)
[[0, 'b'], [1, 'b'], [2, 'b'], [2, 'e'], [3, 'e']]
2. Using pandas.duplicated
Here is how you can do this with Pandas -
Convert to pandas dataframe
For key column, find the duplicates and keep all of them
Convert to list of records while ignoring index
import pandas as pd
df = pd.DataFrame(lol, columns=['val','key'])
dups = df[df['key'].duplicated(keep=False)]
result = list(dups.to_records(index=False))
print(result)
[(0, 'b'), (1, 'b'), (2, 'e'), (2, 'b'), (3, 'e')]
3. Using numpy.unique
You can solve this in a vectorized manner using numpy -
Convert to numpy matrix arr
Find unique elements u and their counts c
Filter list of unique elements that occur more than once dup
Use broadcasting to compare the second column of the array and take any over axis=0 to get a boolean which is True for duplicated rows
Filter the arr based on this boolean
import numpy as np
arr = np.array(lol)
u, c = np.unique(arr[:,1], return_counts=True)
dup = u[c > 1]
result = arr[(arr[:,1]==dup[:,None]).any(0)]
result
array([['0', 'b'],
['1', 'b'],
['2', 'e'],
['2', 'b'],
['3', 'e']], dtype='<U21')

How to add a value to specific columns of a pandas dataframe?

I have to perform the same arithmetic operation on specific columns of a pandas DataFrame. I do it as
c.loc[:,'col3'] += cons
c.loc[:,'col5'] += cons
c.loc[:,'col6'] += cons
There should be a simpler approach to do all of these in one operation. I mean updating col3,col5,col6 in one command.
pd.DataFrame.loc label indexing accepts lists:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.loc[:, ['B', 'C']] += 10
print(df)
A B C
0 1 12 13
1 4 15 16

Efficient way to find similar items in a list in Python

I have a list of list as follows:
list_1 = [[[1,a],[2,b]], [[3,c],[4,d]], [[1,a],[5,d]], [[8,r],[10,u]]]
I am trying to find whether an element is this list is similar to another element. Right now, I'm looping it twice i.e. for each element, check against the rest. My output is:
[[[1,a],[2,b]], [[1,a],[5,d]]]
Is there a way to do this more efficiently?
Thanks.
You can use itertools.combinations and any functions like this
from itertools import combinations
for item in combinations(list_1, 2):
if any(i in item[1] for i in item[0]):
print item
Output
([[1, 'a'], [2, 'b']], [[1, 'a'], [5, 'd']])
I'm assuming that, by similar, you mean that the element has at least one matching pair within it. In this case, rather than do a nested loop, you could map each element into a dict of lists twice (once for each [number,str] pair within it). When you finish, each key in the dict will map to the list of elements which contain that key (i.e., are similar).
Example code:
list_1 = [[[1,'a'],[2,'b']], [[3,'c'],[4,'d']], [[1,'a'],[5,'d']], [[8,'r'],[10,'u']]]
d = {}
for elt in list_1:
s0 = '%d%s' % (elt[0][0], elt[0][1])
if s0 in d:
d[s0].append(elt)
else:
d[s0] = [elt]
s1 = '%d%s' % (elt[1][0], elt[1][1])
if s1 in d:
d[s1].append(elt)
else:
d[s1] = [elt]
for key in d.keys():
print key, ':', d[key]
Example output:
1a : [[[1, 'a'], [2, 'b']], [[1, 'a'], [5, 'd']]]
8r : [[[8, 'r'], [10, 'u']]]
2b : [[[1, 'a'], [2, 'b']]]
3c : [[[3, 'c'], [4, 'd']]]
5d : [[[1, 'a'], [5, 'd']]]
4d : [[[3, 'c'], [4, 'd']]]
10u : [[[8, 'r'], [10, 'u']]]
Any of the dict entries with length > 1 have similar elements. This will reduce the runtime complexity of your code to O(n), assuming you have a way to obtain a string representation of a, b, c, etc.

Categories