Creating a Pandas rolling-window series of arrays - python

Suppose I have the following code:
import numpy as np
import pandas as pd
x = np.array([1.0, 1.1, 1.2, 1.3, 1.4])
s = pd.Series(x, index=[1, 2, 3, 4, 5])
This produces the following s:
1 1.0
2 1.1
3 1.2
4 1.3
5 1.4
Now what I want to create is a rolling window of size n, but I don't want to take the mean or standard deviation of each window, I just want the arrays. So, suppose n = 3. I want a transformation that outputs the following series given the input s:
1 array([1.0, nan, nan])
2 array([1.1, 1.0, nan])
3 array([1.2, 1.1, 1.0])
4 array([1.3, 1.2, 1.1])
5 array([1.4, 1.3, 1.2])
How do I do this?

Here's one way to do it
In [294]: arr = [s.shift(x).values[::-1][:3] for x in range(len(s))[::-1]]
In [295]: arr
Out[295]:
[array([ 1., nan, nan]),
array([ 1.1, 1. , nan]),
array([ 1.2, 1.1, 1. ]),
array([ 1.3, 1.2, 1.1]),
array([ 1.4, 1.3, 1.2])]
In [296]: pd.Series(arr, index=s.index)
Out[296]:
1 [1.0, nan, nan]
2 [1.1, 1.0, nan]
3 [1.2, 1.1, 1.0]
4 [1.3, 1.2, 1.1]
5 [1.4, 1.3, 1.2]
dtype: object

Here's a vectorized approach using NumPy broadcasting -
n = 3 # window length
idx = np.arange(n)[::-1] + np.arange(len(s))[:,None] - n + 1
out = s.get_values()[idx]
out[idx<0] = np.nan
This gets you the output as a 2D array.
To get a series with each element holding each window as a list -
In [40]: pd.Series(out.tolist())
Out[40]:
0 [1.0, nan, nan]
1 [1.1, 1.0, nan]
2 [1.2, 1.1, 1.0]
3 [1.3, 1.2, 1.1]
4 [1.4, 1.3, 1.2]
dtype: object
If you wish to have a list of 1D arrays split arrays, you can use np.split on the output, like so -
out_split = np.split(out,out.shape[0],axis=0)
Sample run -
In [100]: s
Out[100]:
1 1.0
2 1.1
3 1.2
4 1.3
5 1.4
dtype: float64
In [101]: n = 3
In [102]: idx = np.arange(n)[::-1] + np.arange(len(s))[:,None] - n + 1
...: out = s.get_values()[idx]
...: out[idx<0] = np.nan
...:
In [103]: out
Out[103]:
array([[ 1. , nan, nan],
[ 1.1, 1. , nan],
[ 1.2, 1.1, 1. ],
[ 1.3, 1.2, 1.1],
[ 1.4, 1.3, 1.2]])
In [104]: np.split(out,out.shape[0],axis=0)
Out[104]:
[array([[ 1., nan, nan]]),
array([[ 1.1, 1. , nan]]),
array([[ 1.2, 1.1, 1. ]]),
array([[ 1.3, 1.2, 1.1]]),
array([[ 1.4, 1.3, 1.2]])]
Memory-efficiency with strides
For memory efficiency, we can use a strided one - strided_axis0, similar to #B. M.'s solution, but a bit more generic one.
So, to get 2D array of values with NaNs precedding the first element -
In [35]: strided_axis0(s.values, fillval=np.nan, L=3)
Out[35]:
array([[nan, nan, 1. ],
[nan, 1. , 1.1],
[1. , 1.1, 1.2],
[1.1, 1.2, 1.3],
[1.2, 1.3, 1.4]])
To get 2D array of values with NaNs as fillers coming after the original elements in each row and the order of elements being flipped, as stated in the problem -
In [36]: strided_axis0(s.values, fillval=np.nan, L=3)[:,::-1]
Out[36]:
array([[1. , nan, nan],
[1.1, 1. , nan],
[1.2, 1.1, 1. ],
[1.3, 1.2, 1.1],
[1.4, 1.3, 1.2]])
To get a series with each element holding each window as a list, simply wrap the earlier methods with pd.Series(out.tolist()) with out being the 2D array outputs -
In [38]: pd.Series(strided_axis0(s.values, fillval=np.nan, L=3)[:,::-1].tolist())
Out[38]:
0 [1.0, nan, nan]
1 [1.1, 1.0, nan]
2 [1.2, 1.1, 1.0]
3 [1.3, 1.2, 1.1]
4 [1.4, 1.3, 1.2]
dtype: object

Your data look like a strided array :
data=np.lib.stride_tricks.as_strided(np.concatenate(([NaN]*2,s))[2:],(5,3),(8,-8))
"""
array([[ 1. , nan, nan],
[ 1.1, 1. , nan],
[ 1.2, 1.1, 1. ],
[ 1.3, 1.2, 1.1],
[ 1.4, 1.3, 1.2]])
"""
Then transform in Series :
pd.Series(map(list,data))
""""
0 [1.0, nan, nan]
1 [1.1, 1.0, nan]
2 [1.2, 1.1, 1.0]
3 [1.3, 1.2, 1.1]
4 [1.4, 1.3, 1.2]
dtype: object
""""

If you attach the missing nans at the beginning and the end of the series, you use a simple window
def wndw(s,size=3):
stretched = np.hstack([
np.array([np.nan]*(size-1)),
s.values.T,
np.array([np.nan]*size)
])
for begin in range(len(stretched)-size):
end = begin+size
yield stretched[begin:end][::-1]
for arr in wndw(s, 3):
print arr

Related

Storing multiple arrays in a np.zeros or np.ones

I'm trying to initialize a dummy array of length n using np.zeros(n) with dtype=object. I want to use this dummy array to store n copies of another array of length m.
I'm trying to avoid for loop to set values at each index.
I tried using the below code but keep getting error -
temp = np.zeros(10, dtype=object)
arr = np.array([1.1,1.2,1.3,1.4,1.5])
res = temp * arr
The desired result should be -
np.array([[1.1,1.2,1.3,1.4,1.5], [1.1,1.2,1.3,1.4,1.5], ... 10 copies])
I keep getting the error -
operands could not be broadcast together with shapes (10,) (5,)
I understand that this error arises since the compiler thinks I'm trying to multiply those arrays.
So how do I achieve the task?
np.tile() is a built-in function that repeats a given array reps times. It looks like this is exactly what you need, i.e.:
res = np.tile(arr, 2)
>>> arr = np.array([1.1,1.2,1.3,1.4,1.5])
>>> arr
array([1.1, 1.2, 1.3, 1.4, 1.5])
>>> np.array([arr]*10)
array([[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5]])

python arrays: averaging slope and intercept of datasets

I am having some difficulties achieving the following. Let's say I have two sets of data obtained from a test:
import numpy as np
a = np.array([[0.0, 1.0, 2.0, 3.0], [0.0, 2.0, 4.0, 6.0]]).T
b = np.array([[0.5, 1.5, 2.5, 3.5], [0.5, 1.5, 2.5, 3.5]]).T
where the data in the 0th column represents (in my case) displacement and the data in the 1th column represents the respective measured force values.
(Given data represents two lines with slopes of 2 and 1, both with a y-intercept of 0.)
Now I am trying to program a script that averages those two arrays despite the mismatched x-values, such that it will yield
c = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5], [0.0, 0.75, 1.5,
2.25, 3.0, 3.75, 4.5, 5.25]]).T
(A line with a slope of 1.5 and a y-intercept of 0.)
I tried my best using slicing and linear interpolation, however it seems like I cannot get my head around it (I am a beginner).
I'd be very glad for any input and tips and hope the information I gave to you is sufficient!
Thanks in advance,
Robert
You can get the coefficients (slope and intercept) of each dataset, obtain the mean, and fit that data to a new array of x values.
Step by Step:
Fit deg-1 polynomial to each array a, and b using polyfit to get the coefficients of each (slope and intercept):
coef_a = np.polyfit(a[:,0], a[:,1], deg=1)
coef_b = np.polyfit(b[:,0], b[:,1], deg=1)
>>> coef_a
array([ 2.00000000e+00, 2.22044605e-16])
>>> coef_b
array([ 1.00000000e+00, 1.33226763e-15])
Get the mean of those coefficients to use as the coefficients of c:
coef_c = np.mean(np.stack([coef_a,coef_b]), axis=0)
>>> coef_c
array([ 1.50000000e+00, 7.77156117e-16])
Create new x-values for c using np.arange
c_x = np.arange(0,4,0.5)
>>> c_x
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5])
use polyval to fit your new c coeficients to your new x values:
c_y = np.polyval(coef_c, c_x)
>>> c_y
array([ 7.77156117e-16, 7.50000000e-01, 1.50000000e+00,
2.25000000e+00, 3.00000000e+00, 3.75000000e+00,
4.50000000e+00, 5.25000000e+00])
Put your c_x and c_y values together using stack:
c = np.stack([c_x, c_y])
>>> c
array([[ 0.00000000e+00, 5.00000000e-01, 1.00000000e+00,
1.50000000e+00, 2.00000000e+00, 2.50000000e+00,
3.00000000e+00, 3.50000000e+00],
[ 7.77156117e-16, 7.50000000e-01, 1.50000000e+00,
2.25000000e+00, 3.00000000e+00, 3.75000000e+00,
4.50000000e+00, 5.25000000e+00]])
If you round that to 2 decimals, you'll see it's the same as your desired outcome:
>>> np.round(c, 2)
array([[ 0. , 0.5 , 1. , 1.5 , 2. , 2.5 , 3. , 3.5 ],
[ 0. , 0.75, 1.5 , 2.25, 3. , 3.75, 4.5 , 5.25]])
In a single statement:
c = np.stack([np.arange(0, 4, 0.5),
np.polyval(np.mean(np.stack([np.polyfit(a.T[0], a.T[1], 1),
np.polyfit(b.T[0], b.T[1], 1)]),
axis=0),
np.arange(0, 4, 0.5))])
>>> c
array([[ 0.00000000e+00, 5.00000000e-01, 1.00000000e+00,
1.50000000e+00, 2.00000000e+00, 2.50000000e+00,
3.00000000e+00, 3.50000000e+00],
[ 7.77156117e-16, 7.50000000e-01, 1.50000000e+00,
2.25000000e+00, 3.00000000e+00, 3.75000000e+00,
4.50000000e+00, 5.25000000e+00]])

How to refine a mesh in python quickly

I have a numpy array([1.0, 2.0, 3.0]), which is actually a mesh in 1 dimension in my problem. What I want to do is to refine the mesh to get this: array([0.8, 0.9, 1, 1.1, 1.2, 1.8, 1.9, 2, 2.1, 2.2, 2.8, 2.9, 3, 3.1, 3.2,]).
The actual array is very large and this procedure costs a lot of time. How to do this quickly (maybe vectorize) in python?
Here's a vectorized approach -
(a[:,None] + np.arange(-0.2,0.3,0.1)).ravel() # a is input array
Sample run -
In [15]: a = np.array([1.0, 2.0, 3.0]) # Input array
In [16]: (a[:,None] + np.arange(-0.2,0.3,0.1)).ravel()
Out[16]:
array([ 0.8, 0.9, 1. , 1.1, 1.2, 1.8, 1.9, 2. , 2.1, 2.2, 2.8,
2.9, 3. , 3.1, 3.2])
Here are a few options(python 3):
Option 1:
np.array([j for i in arr for j in np.arange(i - 0.2, i + 0.25, 0.1)])
# array([ 0.8, 0.9, 1. , 1.1, 1.2, 1.8, 1.9, 2. , 2.1, 2.2, 2.8,
# 2.9, 3. , 3.1, 3.2])
Option 2:
np.array([j for x, y in zip(arr - 0.2, arr + 0.25) for j in np.arange(x,y,0.1)])
# array([ 0.8, 0.9, 1. , 1.1, 1.2, 1.8, 1.9, 2. , 2.1, 2.2, 2.8,
# 2.9, 3. , 3.1, 3.2])
Option 3:
np.array([arr + i for i in np.arange(-0.2, 0.25, 0.1)]).T.ravel()
# array([ 0.8, 0.9, 1. , 1.1, 1.2, 1.8, 1.9, 2. , 2.1, 2.2, 2.8,
# 2.9, 3. , 3.1, 3.2])
Timing on a larger array:
arr = np.arange(100000)
arr
# array([ 0, 1, 2, ..., 99997, 99998, 99999])
%timeit np.array([j for i in arr for j in np.arange(i-0.2, i+0.25, 0.1)])
# 1 loop, best of 3: 615 ms per loop
%timeit np.array([j for x, y in zip(arr - 0.2, arr + 0.25) for j in np.arange(x,y,0.1)])
# 1 loop, best of 3: 250 ms per loop
%timeit np.array([arr + i for i in np.arange(-0.2, 0.25, 0.1)]).T.ravel()
# 100 loops, best of 3: 1.93 ms per loop

how to merge the values of a list of lists and a list into 1 resulting list of lists

I have a list of lists (a) and a list (b) which have the same "length" (in this case "4"):
a = [
[1.0, 2.0],
[1.1, 2.1],
[1.2, 2.2],
[1.3, 2.3]
]
b = [3.0, 3.1, 3.2, 3.3]
I would like to merge the values to obtain the following (c):
c = [
[1.0, 2.0, 3.0],
[1.1, 2.1, 3.1],
[1.2, 2.2, 3.2],
[1.3, 2.3, 3.3]
]
currently I'm doing the following to achieve it:
c = []
for index, elem in enumerate(a):
x = [a[index], [b[index]]] # x assigned here for better readability
c.append(sum(x, []))
my feeling is that there is an elegant way to do this...
note: the lists are a lot larger, for simplicity I shortened them. they are always(!) of the same length.
In python3.5+ use zip() within a list comprehension and in-place unpacking:
In [7]: [[*j, i] for i, j in zip(b, a)]
Out[7]: [[1.0, 2.0, 3.0], [1.1, 2.1, 3.1], [1.2, 2.2, 3.2], [1.3, 2.3, 3.3]]
In python 2 :
In [8]: [j+[i] for i, j in zip(b, a)]
Out[8]: [[1.0, 2.0, 3.0], [1.1, 2.1, 3.1], [1.2, 2.2, 3.2], [1.3, 2.3, 3.3]]
Or use numpy.column_stack in numpy:
In [16]: import numpy as np
In [17]: np.column_stack((a, b))
Out[17]:
array([[ 1. , 2. , 3. ],
[ 1.1, 2.1, 3.1],
[ 1.2, 2.2, 3.2],
[ 1.3, 2.3, 3.3]])

Convert pandas DataFrame into list of lists [duplicate]

This question already has answers here:
Pandas DataFrame to List of Lists
(14 answers)
Closed 3 years ago.
I have a pandas data frame like this:
admit gpa gre rank
0 3.61 380 3
1 3.67 660 3
1 3.19 640 4
0 2.93 520 4
Now I want to get a list of rows in pandas like:
[[0,3.61,380,3], [1,3.67,660,3], [1,3.19,640,4], [0,2.93,520,4]]
How can I do it?
There is a built in method which would be the fastest method also, calling tolist on the .values np array:
df.values.tolist()
[[0.0, 3.61, 380.0, 3.0],
[1.0, 3.67, 660.0, 3.0],
[1.0, 3.19, 640.0, 4.0],
[0.0, 2.93, 520.0, 4.0]]
you can do it like this:
map(list, df.values)
EDIT: as_matrix is deprecated since version 0.23.0
You can use the built in values or to_numpy (recommended option) method on the dataframe:
In [8]:
df.to_numpy()
Out[8]:
array([[ 0.9, 7. , 5.2, ..., 13.3, 13.5, 8.9],
[ 0.9, 7. , 5.2, ..., 13.3, 13.5, 8.9],
[ 0.8, 6.1, 5.4, ..., 15.9, 14.4, 8.6],
...,
[ 0.2, 1.3, 2.3, ..., 16.1, 16.1, 10.8],
[ 0.2, 1.3, 2.4, ..., 16.5, 15.9, 11.4],
[ 0.2, 1.3, 2.4, ..., 16.5, 15.9, 11.4]])
If you explicitly want lists and not a numpy array add .tolist():
df.to_numpy().tolist()

Categories