suppose I have the pd.Series
import pandas as pd
import numpy as np
s = pd.Series(np.arange(10), list('abcdefghij'))
I'd like to "shuffle" this series like a deck of cards by interweaving the top half with the bottom half.
I'd expect results like this
a 0
f 5
b 1
g 6
c 2
h 7
d 3
i 8
e 4
j 9
dtype: int32
Conclusions
final function
def perfect_shuffle(s):
n = s.values.shape[0] # get length of s
l = (n + 1) // 2 * 2 # get next even number after n
# use even number to reshape and only use n of them after ravel
a = np.arange(l).reshape(2, -1).T.ravel()[:n]
# construct new series slicing both values and index
return pd.Series(s.values[a], s.index.values[a])
demonstration
s = pd.Series(np.arange(11), list('abcdefghijk'))
print(perfect_shuffle(s))
a 0
g 6
b 1
h 7
c 2
i 8
d 3
j 9
e 4
k 10
f 5
dtype: int64
order='F' vs T
I had suggested using T.ravel() as opposed to ravel(order='F')
After investigation, it hardly matters but ravel(order='F') is better for larger arrays.
d = pd.DataFrame(dict(T=[], R=[]))
for n in np.power(10, np.arange(1, 8)):
a = np.arange(n).reshape(2, -1)
stamp = pd.datetime.now()
for _ in range(100):
a.ravel(order='F')
d.loc[n, 'R'] = (pd.datetime.now() - stamp).total_seconds()
stamp = pd.datetime.now()
for _ in range(100):
a.T.ravel()
d.loc[n, 'T'] = (pd.datetime.now() - stamp).total_seconds()
d
d.plot()
Thanks unutbu and Warren Weckesser
In then special case where the length of the Series is even, you can to do a perfectly shuffle by reshaping its values into two rows and then using ravel(order='F') to read the items off in Fortran order:
In [12]: pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)
Out[12]:
a 0
b 5
c 1
d 6
e 2
f 7
g 3
h 8
i 4
j 9
dtype: int64
Fortran order makes the left-most axis increment fastest. So in a 2D array the
values are read off by going down the rows of one column before progressing to
the next column. This has the effect of interleaving the values, compared to the
usual C-order.
In the general case where the length of the Series could be odd,
perhaps the fastest way is to reassign the values using shifted slices:
import numpy as np
import pandas as pd
def perfect_shuffle(ser):
arr = ser.values
result = np.empty_like(arr)
N = (len(arr)+1)//2
result[::2] = arr[:N]
result[1::2] = arr[N:]
result = pd.Series(result, index=ser.index)
return result
s = pd.Series(np.arange(11), list('abcdefghijk'))
print(perfect_shuffle(s))
yields
a 0
b 6
c 1
d 7
e 2
f 8
g 3
h 9
i 4
j 10
k 5
dtype: int64
To add to #unutbu's answer, some benchmarks:
>>> import timeit
>>> import numpy as np
>>>
>>> setup = '''
... import pandas as pd
... import numpy as np
... s = pd.Series(list('abcdefghij'), np.arange(10))
... '''
>>>
>>> funcs = ['s[np.random.permutation(s.index)]', "pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)",
... 's.iloc[np.random.permutation(s.index)]', "s.values.reshape(-1, 2, order='F').ravel()"]
>>>
>>> for f in funcs:
... print(f)
... print(min(timeit.Timer(f, setup).repeat(3, 50)))
...
s[np.random.permutation(s.index)]
0.029795593000017107
pd.Series(s.values.reshape(2,-1).ravel(order='F'), s.index)
0.0035402200010139495
s.iloc[np.random.permutation(s.index)]
0.010904800990829244
s.values.reshape(-1, 2, order='F').ravel()
0.00019640100072138011
The final f in funcs is > 99% faster than the first np.random.permutation approach, so that's probably your best bet.
Related
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
I have a trivial problem that I have solved using loops, but I am trying to see if there is a way I can attempt to vectorize some of it to try and improve performance.
Essentially I have 2 dataframes (DF_A and DF_B), where the rows in DF_B are based on a sumation of a corresponding row in DF_A and the row above in DF_B. I do have the first row of values in DF_B.
df_a = [
[1,2,3,4]
[5,6,7,8]
[..... more rows]
]
df_b = [
[1,2,3,4]
[ rows of all 0 values here, so dimensions match df_a]
]
What I am trying to achive is that the 2nd row in df_b for example will be the values of the first row in df_b + the values of the second row in df_a. So in this case:
df_b.loc[2] = [6,8,10,12]
I was able to accomplish this using a loop over range of df_a, keeping the previous rows value saved off and then adding the row of the current index to the previous rows value. Doesn't seem super efficient.
Here is a numpy solution. This should be significantly faster than a pandas loop, especially since it uses JIT-compiling via numba.
from numba import jit
a = df_a.values
b = df_b.values
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
df_b = pd.DataFrame(fill_b(a, b))
# 0 1 2 3
# 0 1 2 3 4
# 1 6 8 10 12
# 2 15 18 21 24
# 3 28 32 36 40
# 4 45 50 55 60
Performance benchmarking
import pandas as pd, numpy as np
from numba import jit
df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))
#jit(nopython=True)
def fill_b(a, b):
for i in range(1, len(b)):
b[i] = b[i-1] + a[i]
return b
def jp(df_a):
a = df_a.values
b = np.empty(df_a.values.shape)
b[0] = np.arange(1, 1001)
return pd.DataFrame(fill_b(a, b))
%timeit df_a.cumsum() # 16.1 ms
%timeit jp(df_a) # 6.05 ms
You can just create df_b using the cumulative sum over df_a, like so
df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()
0 1 2 3
0 1 2 3 4
1 6 8 10 12
2 15 18 21 24
3 28 32 36 40
I want to transform a matrix to 3D arrays, or transforming a 3D arrays to a matrix. How to input the data and how to do the transformation work in Python?
I've searched for many places, but there is no answer. please help me
matrix a:
a b c
d 1 2 3
e 2 3 4
f 4 3 2
array b:
a d 1
a e 2
a f 4
b d 2
b e 3
b f 3
c d 3
c e 4
c f 2
can i use stack() to achieve my goal?
like: Python pandas - pd.melt a dataframe with datetime index results in NaN
So your data is not actually 3 dimensional, but 2 dimensional. You are essentially trying to unpivot your 2d data. This is often called melt. Your best option is to load the data into a pandas data frame.
import pandas as pd
df = pd.DataFrame([['d',1,2,3],['e',2,3,4],['f',4,3,2]], columns=['idx','a','b','c'])
df
# returns:
idx a b c
0 d 1 2 3
1 e 2 3 4
2 f 4 3 2
pd.melt(df, id_vars='index', value_vars=list('abc'))
# returns:
idx variable value
0 d a 1
1 e a 2
2 f a 4
3 d b 2
4 e b 3
5 f b 3
6 d c 3
7 e c 4
8 f c 2
I'm not very familiar with the pandas library but here is a rough solution using the python standard library:
#!/usr/bin/env python2
"""
Convert a matrix to 2D arrays and vice versa
http://stackoverflow.com/questions/43289673
"""
from collections import OrderedDict
TEST_MATRIX = """\
a b c
d 1 2 3
e 2 3 4
f 4 3 2
"""
def parse_matrix(matrix_string):
"""Parse a matrix string and return list of tuples representing data"""
matrix_string = matrix_string.strip()
list_of_lines = matrix_string.splitlines()
parsed_list = []
y_headers = list_of_lines[0].split()
data_rows = [i.split() for i in list_of_lines[1:]]
for y in y_headers:
for row in data_rows:
parsed_list.append((y, row[0], row[y_headers.index(y) + 1]))
return parsed_list
def convert_to_matrix(data):
"""
Convert a parsed matrix (in the form of a list of tuples) to a matrix
(string)
"""
# Messes up ordering
# y_headers = set(i[0] for i in data)
# x_headers = set(i[1] for i in data)
y_headers = OrderedDict()
x_headers = OrderedDict()
[(y_headers.setdefault(i[0]), x_headers.setdefault(i[1])) for i in data]
matrix_string = " " + " ".join(y_headers) # header
for x in x_headers:
row = [x]
for y in y_headers:
val = [i[-1] for i in data if i[0] == y and i[1] == x][0]
row.append(val)
row_string = " ".join(row)
matrix_string += "\n" + row_string
return matrix_string
def main():
print("Test matrix:")
print(TEST_MATRIX)
# parse the test matrix string to a list of tuples
parsed_test_matrix = parse_matrix(TEST_MATRIX)
# print the parsed matrix
print("Parsed matrix:")
for row in parsed_test_matrix:
print " ".join(row)
print
# convert parsed matrix back to the original matrix and print
print("Convert parsed matrix back to matrix:")
print(convert_to_matrix(parsed_test_matrix))
if __name__ == "__main__":
main()
For example I would like to get letters indicating a row where period of at least two consecutive drops in other column begins.
Exemplary data:
a b
0 3 a
1 2 b
2 3 c
3 2 d
4 1 e
5 0 f
6 -1 g
7 3 h
8 1 i
9 0 j
Exemplary solution with simple loop:
import pandas as pd
df = pd.DataFrame({'a': [3,2,3,2,1,0,-1,3,1,0], 'b': list('abcdefghij')})
less = 0
l = []
prev_prev_row = df.iloc[0]
prev_row = df.iloc[1]
if prev_row['a'] < prev_prev_row['a']: less = 1
for i, row in df.iloc[2:len(df)].iterrows():
if row['a'] < prev_row['a']:
less = less + 1
else:
less = 0
if less == 2:
l.append(prev_prev_row['b'])
prev_prev_row = prev_row
prev_row = row
This gives list l:
['c', 'h']
Here's one approach with some help from NumPy and Scipy -
from scipy.ndimage.morphology import binary_closing
arr = df.a.values
mask1 = np.hstack((False,arr[1:] < arr[:-1],False))
mask2 = mask1 & (~binary_closing(~mask1,[1,1]))
final_mask = mask2[1:] > mask2[:-1]
out = list(df.b[final_mask])
use rolling(2) in reverse
s = df.a[::-1].diff().gt(0).rolling(2).sum().eq(2)
df.b.loc[s & (s != s.shift(-1))]
2 c
7 h
Name: b, dtype: object
if you actually wanted a list
df.b.loc[s & (s != s.shift(-1))].tolist()
['c', 'h']
When doing:
import numpy
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
B = numpy.array([1,2,3,4,5,6])
A[7:7+len(B)] = B # A[7:7+len(B)] has in fact length 3 !
we get this typical error:
ValueError: could not broadcast input array from shape (6) into shape (3)
This is 100% normal because A[7:7+len(B)] has length 3, and not a length = len(B) = 6, and thus, cannot receive the content of B !
How to prevent this to happen and have easily the content of B copied into A, starting at A[7]:
A[7:???] = B[???]
# i would like [1 2 3 4 5 6 7 1 2 3]
This could be called "auto-broadcasting", i.e. we don't have to worry about length of arrays.
Edit: another example if len(A) = 20:
A = numpy.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
B = numpy.array([1,2,3,4,5,6])
A[7:7+len(B)] = B
A # [ 1 2 3 4 5 6 7 1 2 3 4 5 6 14 15 16 17 18 19 20]
Just tell it when to stop using len(A).
A[7:7+len(B)] = B[:len(A)-7]
Example:
import numpy
B = numpy.array([1,2,3,4,5,6])
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
A[7:7+len(B)] = B[:len(A)-7]
print A # [1 2 3 4 5 6 7 1 2 3]
A = numpy.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
A[7:7+len(B)] = B[:len(A)-7]
print A # [ 1 2 3 4 5 6 7 1 2 3 4 5 6 14 15 16 17 18 19 20]
import numpy
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
B = numpy.array([1,2,3,4,5,6])
numpy.hstack((A[0:7],B))[0:len(A)]
on second thought this fails the case where B fits inside A.
soo....
import numpy
A = numpy.array([1,2,3,4,5,6,7,8,9,10])
B = numpy.array([1,2,3,4,5,6])
if 7 + len(B) > len(A):
A = numpy.hstack((A[0:7],B))[0:len(A)]
else:
A[7:7+len(B)] = B
but, this sort of defeats the purpose of the question! I'm sure you prefer a one-liner!
Same question, but in 2d
Numpy - Overlap 2 matrices at a particular position
There I try to make the case that it is better that you take responsibility for determining which part of B should be copied:
A[7:] = B[:3]
A[7:] = B[-3:]
A[7:] = B[3:6]
np.put will do this sort of clipping for you, but you have to give it an index list, not a slice:
np.put(x, range(7,len(x)), B)
which isn't much better than x[7:]=y[:len(x)-7].
The doc for put tells me there is also a putmask, place, and copyto functions. And the counterpart to put is take.
An interesting thing is that while these other functions give more power than indexing, with modes like clip and repeat, I don't see them being used much. I think that's because it is easier to write a function that handles your special case, than it is to remember/lookup general functions with lots of options.