Dataframe column: to find local maxima - python

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio").
I would like to find the local maxima of every non-zero vector contained in column "CumRetperTrade"
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which provides maxima for every vector contained in column "CumRetperTrade" its local max value. The numeric example is below. Thanks in advance.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
1 3 3
1 2 3
1 1 3
0 0 0
0 0 0
0 0 0
1 4 4
1 2 4
1 1 4

You can use :
df1['PeakCumRet'] = (df1.groupby(df1['Portfolio'].ne(df1['Portfolio'].shift()).cumsum())
['CumRetperTrade'].transform('max')
)
Output:
Portfolio CumRetperTrade PeakCumRet
0 1 3 3
1 1 2 3
2 1 1 3
3 0 0 0
4 0 0 0
5 0 0 0
6 1 4 4
7 1 2 4
8 1 1 4

Related

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

creating index columns with python

As a minimal working example, I have a file.txt containing a list of numbers:
1.1
2.1
3.1
4.1
5.1
6.1
7.1
8.1
which actually should be presented with indices that makes it a 3D array
0 0 1.1
1 0 2.1
0 1 3.1
1 1 4.1
0 2 5.1
1 2 6.1
0 3 7.1
1 3 8.1
I want to import the 3D array into python and have been using bash to generate the indices and then pasting the index to file.txt before importing the resulting full.txt in python using pandas:
for ((y=0;y<=3;y++)); do
for ((x=0;x<=1;x++)); do
echo -e "$x\t$y"
done
done
done > index.txt
paste index.txt file.txt> full.txt
The writing of index.txt has been slow in my actual code, which has x up to 9000 and y up to 5000. Is there a way to generate the indices into the first 2 columns of a 2D python numpy array so I only need to import the data from file.txt as as the third column?
I would recommend using pandas for loading the data and managing columns with different types.
We can generate the indices with np.indices with the desired dimensions and reshape to match your format.
Then concatenate 'file.txt'.
Creating the index for (9000,5000) takes about 950ms on a colab instance.
import numpy as np
import pandas as pd
x,y = 2,4 # dimensions, also works with 9000,5000 but assumes 'file.txt' has the correct size
pd.concat([
pd.DataFrame(np.indices((x,y)).ravel('F').reshape(-1,2), columns=['ind1','ind2']),
pd.read_csv('file.txt', header=None, names=['Value'])
], axis=1)
Out:
ind1 ind2 Value
0 0 0 1.1
1 1 0 2.1
2 0 1 3.1
3 1 1 4.1
4 0 2 5.1
5 1 2 6.1
6 0 3 7.1
7 1 3 8.1
How this works
First create the indices for your desired dimensions with np.indices
np.indices((2,4))
Out:
array([[[0, 0, 0, 0],
[1, 1, 1, 1]],
[[0, 1, 2, 3],
[0, 1, 2, 3]]])
Which gives us the right indices but in the wrong order.
With np.ravel('F') we can specify to flatten the array in columns first order
np.indices((2,4)).ravel('F')
Out:
array([0, 0, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 0, 3, 1, 3])
To get the desired columns reshape into a 2D array with shape (8,2). With (-1,2) the first dimension is inferred.
np.indices((2,4)).ravel('F').reshape(-1,2)
Out:
array([[0, 0],
[1, 0],
[0, 1],
[1, 1],
[0, 2],
[1, 2],
[0, 3],
[1, 3]])
Then convert into a dataframe with columns ind1 and ind2.
Working with more dimensions
pd.DataFrame(np.indices((2,4,3)).ravel('F').reshape(-1,3)).add_prefix('ind')
Out:
ind0 ind1 ind2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0
6 0 3 0
7 1 3 0
8 0 0 1
9 1 0 1
10 0 1 1
11 1 1 1
12 0 2 1
13 1 2 1
14 0 3 1
15 1 3 1
16 0 0 2
17 1 0 2
18 0 1 2
19 1 1 2
20 0 2 2
21 1 2 2
22 0 3 2
23 1 3 2
Here is a quick example how to create the 3D array from a 1D array. As a dummy i have random numbers. Then it creates tuples of x,y,value.
It takes about a minute for 45M rows
from random import randrange
x = 5000
y = 9000
numbers = [randrange(100000,999999) for i in range(x*y)]
array = [(a,b, numbers[b*(x-1)+a]) for a in range(x) for b in range(y)]
Output
pd.DataFrame(array)
Out[23]:
0 1 2
0 0 0 878704
1 0 1 524573
2 0 2 943657
3 0 3 496507
4 0 4 802714```
If you want to stick to your bash then you can avoid two loops:
Code:
for ((y=0;y<=3;y++)); do
echo -e "0\t$y\n1\t$y"
done
Output:
0 0
1 0
0 1
1 1
0 2
1 2
0 3
1 3
above in python is:
Code:
for y in range(4):
print(f'0\t{y}\n1\t{y}')
Output:
0 0
1 0
0 1
1 1
0 2
1 2
0 3
1 3

Data frame mode function

HI I want to ask I am using df.mode() function to find the most common in one row. This will give me an extra column how could I have only one column? I am using df.mode(axis=1)
for example I have a data frame
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
so I want the output
1 1
2 0
3 0
but I am getting
1 1 NaN
2 0 NaN
3 0 NaN
Does anyone know why?
The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3.
import pandas as pd
df = pd.DataFrame(
data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]],
index=[1, 2, 3])
df
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
df.mode(axis=1)
0
1 1
2 0
3 0
There could be different data types in your columns and mode cannot be used to compare column of different data type.
Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)

Pandas create a unique id for each row based on a condition

I've a dataset where one of the column is as below. I'd like to create a new column based on the below condition.
For values in column_name, if 1 is present, create a new id. If 0 is present, also create a new id. But if 1 is repeated in more than 1 continuous rows, then id should be same for all rows. The sample output result can be seen below.
column_name
1
0
0
1
1
1
1
0
0
1
column_name -- ID
1 -- 1
0 -- 2
0 -- 3
1 -- 4
1 -- 4
1 -- 4
1 -- 4
0 -- 5
0 -- 6
1 -- 7
Say your Series is
s = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
Then you can use:
>>> ((s != 1) | (s.shift(1) != 1)).cumsum()
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
dtype: int64
This checks that either the current entry is not 1, or that the previous entry is not 1, and then performs a cumulative sum on the result.
Essentially leveraging the fact that a 1 in the Series lagged by another 1 should be treated as part of the same group, while every 0 calls for an increment. One of four things will happen:
1) 0 with a preceding 0 : Increment by 1
2) 0 with a preceding 1 : Increment by 1
3) 1 with a preceding 1 : Increment by 0
4) 1 with a preceding 0: Increment by 1
(df['column_name'] + df['column_name'].shift(1)).\ ## Creates a Series with values 0, 1, or 2 (first field is NaN)
fillna(0).\ ## Fills first field with 0
isin([0,1]).\ ## True for cases 1, 2, and 4 described above, else False (case 3)
astype('int').\ ## Integerizes it
cumsum()
Output:
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
At this stage I would just use a regular python for loop
column_name = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
ID = [1]
for i in range(1, len(column_name)):
ID.append(ID[-1] + ((column_name[i] + column_name[i-1]) < 2))
print(ID)
>>> [1, 2, 3, 4, 4, 4, 4, 5, 6, 7]
And then you can assign ID as a column in your dataframe

pandas DataFrame set non-contiguous sections

I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]

Categories