How to keep the first window constant in sliding window? - python

I am using the following code to apply a sliding window on time-series data. I want to set up my first window as fixed and then apply the sliding window as shown below in the desired output.
df = pd.DataFrame({'B': [0, 1, 2, 3, 4, 5, 6,7,8,9,10]})
def sliding_window(data, size):
return [ data[x:x+size] for x in range( len(data) - size + 1 ) ]
sliding_window(df, 7)
output
[ B
0 0
1 1
2 2
3 3
4 4
5 5
6 6,
B
1 1
2 2
3 3
4 4
5 5
6 6
7 7,
B
2 2
3 3
4 4
5 5
6 6
7 7
8 8,
B
3 3
4 4
5 5
6 6
7 7
8 8
9 9,
B
4 4
5 5
6 6
7 7
8 8
9 9
10 10]
Desired output
Example:
I am using the fixed window of size 5 here. and it always should be the first window, and the sliding window is same as before except it slides from first window. Like the left figure in the images..
[ B
0 0
1 1
2 2
3 3
4 4,
B
0 0
1 1
2 2
3 3
4 4
5 5,
B
0 0
1 1
2 2
3 3
4 4
5 5
6 6,
B
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7,
B
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8,
B
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9,
B
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10]

Try this:
def rolling_window_maybe(data, initial_size: int):
return [ data[:initial_size + x] for x in range( len(data) - initial_size ) ]
For example:
data = [1,2,3,4]
size = 2
data[:size + 0] == [1,2]
data[:size + 1] == [1,2,3]
data[:size + 2] == [1,2,3,4]

Related

Iteratively pop and append to generate new lists using pandas

I have a list of elements mylist = [1, 2, 3, 4, 5, 6, 7, 8] and would like to iteratively:
copy the list
pop the first element of the copied list
and append it to the end of the copied list
repeat this for the next row, etc.
Desired output:
index A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
I suspect a for loop is needed but am having trouble iteratively generating rows based on the prior row.
I think slicing (Understanding slicing) is what you are looking for:
next_iteration = my_list[1:] + [my_list[0]]
and the full loop:
output = []
for i in range(len(my_list)):
output.append(my_list[i:] + my_list[:i])
Use this numpy solution with rolls create by np.arange:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
a = np.array(mylist)
rolls = np.arange(0, -8, -1)
print (rolls)
[ 0 -1 -2 -3 -4 -5 -6 -7]
df = pd.DataFrame(a[(np.arange(len(a))[:,None]-rolls) % len(a)],
columns=list('ABCDEFGH'))
print (df)
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
If need loop solution (slow) is possible use numpy.roll:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
rolls = np.arange(0, -8, -1)
df = pd.DataFrame([np.roll(mylist, i) for i in rolls],
columns=list('ABCDEFGH'))
print (df)
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7
try this:
mylist = [1, 2, 3, 4, 5, 6, 7, 8]
ar = np.roll(np.array(mylist), 1)
data = [ar := np.roll(ar, -1) for _ in range(ar.size)]
df = pd.DataFrame(data, columns=[*'ABCDEFGH'])
print(df)
>>>
A B C D E F G H
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 1
2 3 4 5 6 7 8 1 2
3 4 5 6 7 8 1 2 3
4 5 6 7 8 1 2 3 4
5 6 7 8 1 2 3 4 5
6 7 8 1 2 3 4 5 6
7 8 1 2 3 4 5 6 7

Filter and to stay only rows with the same index

I have two data frame: X_oos_top_10 and y_oos_top_10. I need to filter them by X_oos_top_10["comm"] == 1.
I do it for one:
X_oos_top_10_comm1 = X_oos_top_10[X_oos_top_10["comm"] == 1]
But for another I have the problem: IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
y_oos_top_10_comm1 = y_oos_top_10[X_oos_top_10["comm"] == 1]
I haven't ideas how I can do it.
Assuming, X and y have the same length, you can use indexing.
Setup a minimal reproducible example:
X_oos_top_10 = pd.DataFrame({'comm': np.random.randint(1, 10, 10)})
y_oos_top_10 = pd.DataFrame(np.random.randint(1, 10, (10, 4)), columns=list('ABCD'))
print(X_oos_top_10)
# Output:
comm
0 5
1 6
2 2
3 6
4 1
5 6
6 1
7 4
8 5
9 8
print(y_oos_top_10)
# Output:
A B C D
0 2 9 1 6
1 9 8 5 4
2 1 6 7 6
3 6 3 6 5
4 2 6 8 3
5 2 6 6 5
6 4 4 3 5
7 6 3 7 5
8 2 8 8 7
9 4 9 1 4
1st method
idx = X_oos_top_10[X_oos_top_10["comm"] == 1].index
out = y_oos_top_10.loc[idx]
print(out)
# Output:
A B C D
4 2 6 8 3
6 4 4 3 5
2nd method
Xy_oos_top_10 = X_oos_top_10.join(y_oos_top_10)
out = Xy_oos_top_10[Xy_oos_top_10['comm'] == 1]
print(out)
# Output:
comm A B C D
4 1 2 6 8 3
6 1 4 4 3 5

Column selection with iloc, with both individual indices and ranges

I wonder why this line returns "invalid syntax", and what's the correct syntax to use for selecting both isolated columns and ranges in one go:
X = f1.iloc[:, [2,5,[10:19]]].values
Btw the same happens with:
X = f1.iloc[:, [2,5,10:19]].values
Thanks.
Second is correct syntax, only need numpy.r_ for concanecate indices:
np.random.seed(2019)
f1 = pd.DataFrame(np.random.randint(10, size=(5, 25))).add_prefix('a')
print(f1)
a0 a1 a2 a3 a4 a5 ... a19 a20 a21 a22 a23 a24
0 8 2 5 8 6 8 ... 0 1 6 0 2 6
1 6 3 1 3 5 0 ... 4 8 1 0 6 1
2 8 2 3 0 9 2 ... 7 1 0 7 4 4
3 7 0 8 9 0 7 ... 3 0 8 6 0 2
4 7 3 2 4 9 9 ... 0 8 8 1 4 9
X = f1.iloc[:, np.r_[2,5,10:19]].values
print(X)
[[5 8 5 3 0 2 5 7 8 5 4]
[1 0 2 9 8 3 7 7 7 0 3]
[3 2 6 2 1 1 1 1 8 6 2]
[8 7 7 8 0 5 7 4 1 1 4]
[2 9 7 2 9 3 8 5 2 5 5]]
Also is possible first convert values to numpy array, then iloc is not necessary:
X = f1.values[:, np.r_[2,5,10:19]]
print(X)
[[5 8 5 3 0 2 5 7 8 5 4]
[1 0 2 9 8 3 7 7 7 0 3]
[3 2 6 2 1 1 1 1 8 6 2]
[8 7 7 8 0 5 7 4 1 1 4]
[2 9 7 2 9 3 8 5 2 5 5]]

Python how to subset a data frame with just the column indices?

I have a huge dataframe with 282 columns and 500K rows. I wish to remove a list of columns from the dataframe using the column indices. The below code works for sequential columns.
df1 = df.ix[:,[0:2]]
The problem is that my column indices are not sequential.
For example, I want to remove columns 0,1,2 and 5 from df. I tried the following code:
df1 = df.ix[:,[0:2,5]]
I am getting the following error:
SyntaxError: invalid syntax
Any suggestions?
Select columns other than 0,1,2,5 with:
df.ix[:, [3,4]+list(range(6,282))]
Or a little more dynamic:
df.ix[:, [3,4]+list(range(6,df.shape[1]))]
Is it a numpy array you've got? Try
df1 = df.ix[:, (0,1,2,5)]
or
df1 = df.ix[:, [0,1,2,5]]
or
data[:, [i for i in range(3)]+[5]]
Use np.r_[...] for concatenating slices along the first axis
DF:
In [98]: df = pd.DataFrame(np.random.randint(10, size=(5, 12)))
In [99]: df
Out[99]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0 7 2 9 9 0 7 3 5 8 8 1
1 4 9 0 4 0 2 4 8 8 7 1 9
2 2 1 1 2 7 4 4 6 1 2 9 8
3 1 5 0 8 2 2 4 1 1 4 8 4
4 4 6 3 2 2 4 1 6 2 6 9 0
Solution:
In [107]: df.iloc[:, np.r_[3:5, 6:df.shape[1]]]
Out[107]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0
In [108]: np.r_[3:5, 6:df.shape[1]]
Out[108]: array([ 3, 4, 6, 7, 8, 9, 10, 11])
or
In [110]: df.columns.difference([0,1,2,5])
Out[110]: Int64Index([3, 4, 6, 7, 8, 9, 10, 11], dtype='int64')
In [111]: df[df.columns.difference([0,1,2,5])]
Out[111]:
3 4 6 7 8 9 10 11
0 9 9 7 3 5 8 8 1
1 4 0 4 8 8 7 1 9
2 2 7 4 6 1 2 9 8
3 8 2 4 1 1 4 8 4
4 2 2 1 6 2 6 9 0

Pandas: How to get max and min values and write for every row?

I have a data like that;
>> df
A B C
0 1 5 1
1 1 7 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 8 1
6 2 6 1
7 3 7 1
8 3 9 1
9 4 6 1
10 4 7 1
11 4 1 1
I want to take max and minimum values of the B column depending on the column A(For the each same value of column A, I want to find the min and max in column B) and want to write results on the original table. My code is:
df1 = df.groupby(['A']).B.transform(max)
df1 = df1.rename(columns={'B':'B_max'})
df2 = df.groupby.(['A']).B.transform(min)
df1 = df1.rename(columns={'B':'B_min'})
df3 = df.join(df1['B_max']).join(df2['B_min'])
This is the result.
A B C B_max B_min
0 1 5 1
1 1 7 1 7
2 1 6 1
3 1 4 1 4
4 2 5 1
5 2 8 1 8
6 2 6 1 6
7 3 7 1 7
8 3 9 1 9
9 4 6 1
10 4 7 1 7
11 4 1 1 1
But I want to table look like this;
A B C B_max B_min
0 1 5 1 7 4
1 1 7 1 7 4
2 1 6 1 7 4
3 1 4 1 7 4
4 2 5 1 8 6
5 2 8 1 8 6
6 2 6 1 8 6
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1
interpret the code for the result to look like this
I think you need only assign values to new columns, because transform return Series with same length as df:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'B': [5, 7, 6, 7, 5, 8, 6, 7, 9, 6, 7, 1],
'C': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})
print (df)
A B C
0 1 5 1
1 1 7 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 8 1
6 2 6 1
7 3 7 1
8 3 9 1
9 4 6 1
10 4 7 1
11 4 1 1
df['B_max'] = df.groupby(['A']).B.transform(max)
df['B_min'] = df.groupby(['A']).B.transform(min)
print (df)
A B C B_max B_min
0 1 5 1 7 5
1 1 7 1 7 5
2 1 6 1 7 5
3 1 7 1 7 5
4 2 5 1 8 5
5 2 8 1 8 5
6 2 6 1 8 5
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1
g = df.groupby('A').B
df['B_max'] = g.transform(max)
df['B_min'] = g.transform(min)
print (df)
A B C B_max B_min
0 1 5 1 7 5
1 1 7 1 7 5
2 1 6 1 7 5
3 1 7 1 7 5
4 2 5 1 8 5
5 2 8 1 8 5
6 2 6 1 8 5
7 3 7 1 9 7
8 3 9 1 9 7
9 4 6 1 7 1
10 4 7 1 7 1
11 4 1 1 7 1

Categories