Transitioning from R to Python, and I am having a difficult time replicating the following code:
df = df %>% group_by(ID) %>% slice(seq_len(min(which(F < 1 & d == 8), n()))
Sample Data:
ID Price F D
1 10.1 1 NAN
1 10.4 1 NAN
1 10.6 .8 8
1 8.1 .8 NAN
1 8.5 .8 NAN
2 22.4 2 NAN
2 22.1 2 NAN
2 21.1 .9 8
2 20.1 .9 NAN
2 20.1 .9 6
with the desired output:
ID Price F D
1 10.1 1 NAN
1 10.4 1 NAN
2 22.4 2 NAN
2 22.1 2 NAN
I believe the code in python would include some sort of:
np.where, cumcount(), and slice.
However, I have no idea how I would go about doing this.
Any help would be appreciated, thank you.
EDIT: To anyone in the future who comes to my question in hopes to finding a solution - yatu's solution worked fine - but I have worked my way into another solution which i found to be a bit more easier to read:
df['temp'] = np.where((df['F'] < 1) & (df['D'] == 8), 1, 0)
mask = df.groupby(ID)['temp'].cumsum().eq(0)
df[mask]
I've read up on masking a bit and it really does help simplify the complexities of python quite a bit!
You could index the dataframe using the conditions bellow:
c1 = ~df.Distro.eq(8).groupby(df.ID).cumsum()
c2 = df.Factor.lt(1).groupby(df.ID).cumsum().eq(0)
df[c1 & c2]
ID Price Factor Distro
0 1 10.1 1.0 NAN
1 1 10.4 1.0 NAN
5 2 22.4 2.0 NAN
6 2 22.1 2.0 NAN
Note that by taking the .cumsum of a boolean series you are essentially propagating the True values, so as soon as a True appears the remaining values will be True. This result, having been negated can be used to remove rows from the dataframe as soon as a value appears.
Details
The following dataframe shows the original dataframe along with the conditions used to index it. In this case given that the specified criteria takes place in the same rows, both conditions show the same behaviour:
df.assign(c1=c1, c2=c2)
ID Price Factor Distro c1 c2
0 1 10.1 1.0 NAN True True
1 1 10.4 1.0 NAN True True
2 1 10.6 0.8 8 False False
3 1 8.1 0.8 NAN False False
4 1 8.5 0.8 NAN False False
5 2 22.4 2.0 NAN True True
6 2 22.1 2.0 NAN True True
7 2 21.1 0.9 8 False False
8 2 20.1 0.9 NAN False False
9 2 20.1 0.9 6 False False
Related
Yo, here's an example of a dataframe I'm currently working on
a b
_____________
0 1 1.1
1 2 1.2
2 3 2.1
3 NaN 2.2
4 NaN 2.3
5 NaN 3.1
6 NaN 3.2
7 NaN 3.3
8 NaN 3.4
However, I wanted to sort this dataframe in a way that it can be displayed as shown below (getting rid of missing values & sorting it with clusters
a b
_____________
1 1.1
1.2
__________
2 2.1
2.2
2.3
__________
3 3.1
3.2
3.3
3.4
I don't really know how to encounter this properly. Any help is appreciated !
You can try
df['new']= df['b'].astype('int')
df.loc[df['new'].duplicated(), 'new'] = ''
df = df.drop(columns=['a'])
df.columns=['b','a']
df.index =df['a']
df_final = df['b'].reset_index()
print(df_final)
output#
a b
0 1 1.1
1 1.2
2 2 2.1
3 2.2
4 2.3
5 3 3.1
6 3.2
7 3.3
8 3.4
If this was my dataframe
a
b
c
12
5
0.1
9
7
8
1.1
2
12.9
I can use the following code to get the max values in each row... (12) (9) (12.9)
df = df.max(axis=1)
But I don't know would you get the max values only comparing columns a & b (12, 9, 2)
Assuming one wants to consider only the columns a and b, and store the maximum value in a new column called max, one can do the following
df['max'] = df[['a', 'b']].max(axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
One can also do that with a custom lambda function, as follows
df['max'] = df[['a', 'b']].apply(lambda x: max(x), axis=1)
[Out]:
a b c max
0 12.0 5 0.1 12.0
1 9.0 7 8.0 9.0
2 1.1 2 12.9 2.0
As per OP's request, if one wants to create a new column, max_of_all, that one will use to store the maximum value for all the dataframe columns, one can use the following
df['max_of_all'] = df.max(axis=1)
[Out]:
a b c max max_of_all
0 12.0 5 0.1 12.0 12.0
1 9.0 7 8.0 9.0 9.0
2 1.1 2 12.9 2.0 12.9
I have a data table:
Index Value
0 NaN
1 1.15
2 2.25
3 2.33
Condition: First check wherever previous row value is not NaN then replace current row value with previous row value.
Desired output:
Index Value
0 NaN
1 1.15
2 1.15
3 1.15
Compare values for missing values, then get first consecutive value and replace another by DataFrame.where, forward filling missing values and last replace original missing values:
df = pd.DataFrame({'Value':[np.nan,1.15,2.15,3.15,np.nan,2.1,2.2,2.3]})
m = df.notna()
df1 = df.where(m.ne(m.shift())).ffill().where(m)
print (df1)
Value
0 NaN
1 1.15
2 1.15
3 1.15
4 NaN
5 2.10
6 2.10
7 2.10
Details:
print (m.ne(m.shift()))
Value
0 True
1 True
2 False
3 False
4 True
5 True
6 False
7 False
print (df.where(m.ne(m.shift())))
Value
0 NaN
1 1.15
2 NaN
3 NaN
4 NaN
5 2.10
6 NaN
7 NaN
print (df.where(m.ne(m.shift())).ffill())
Value
0 NaN
1 1.15
2 1.15
3 1.15
4 1.15
5 2.10
6 2.10
7 2.10
Hi all so my dataframe looks like such:
A | B | C | D | E
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8
I haven't found anything quite like this so apologies if this is a duplicate. But essentially I am trying to locate the column that contains the string 'total' (column B) and their adjacent columns (C and D) and turn them into a dataframe. I feel like I am close with the following code:
test.loc[:,test.columns.str.contains('total')]
which isolates the correct column, but i can't quite figure out how to grab the adjacent two columns. My desired output is:
B | C | D
'USD'
'trading expenses-total'
8.10 2.3 5.5
9.1 1.4 6.1
5.4 5.1 7.8
OLD answer:
Pandas approach:
In [36]: df = pd.DataFrame(np.random.rand(3,5), columns=['A','total','C','D','E'])
In [37]: df
Out[37]:
A total C D E
0 0.789482 0.427260 0.169065 0.112993 0.142648
1 0.303391 0.484157 0.454579 0.410785 0.827571
2 0.984273 0.001532 0.676777 0.026324 0.094534
In [38]: idx = np.argmax(df.columns.str.contains('total'))
In [39]: df.iloc[:, idx:idx+3]
Out[39]:
total C D
0 0.427260 0.169065 0.112993
1 0.484157 0.454579 0.410785
2 0.001532 0.676777 0.026324
UPDATE:
In [118]: df
Out[118]:
A B C D E
0 NaN USD NaN NaN NaN
1 NaN trading expenses-total NaN NaN NaN
2 A 8.10 2.3 5.5 10.0
3 B 9.1 1.4 6.1 11.0
4 C 5.4 5.1 7.8 12.0
In [119]: col = df.select_dtypes(['object']).apply(lambda x: x.str.contains('total').any()).idxmax()
In [120]: cols = df.columns.to_series().loc[col:].head(3).tolist()
In [121]: col
Out[121]: 'B'
In [122]: cols
Out[122]: ['B', 'C', 'D']
In [123]: df[cols]
Out[123]:
B C D
0 USD NaN NaN
1 trading expenses-total NaN NaN
2 8.10 2.3 5.5
3 9.1 1.4 6.1
4 5.4 5.1 7.8
Here's one approach -
from scipy.ndimage.morphology import binary_dilation as bind
mask = test.columns.str.contains('total')
test_out = test.iloc[:,bind(mask,[1,1,1],origin=-1)]
If you don't have access to SciPy, you can also use np.convolve, like so -
test_out = test.iloc[:,np.convolve(mask,[1,1,1])[:-2]>0]
Sample runs
Case #1 :
In [390]: np.random.seed(1234)
In [391]: test = pd.DataFrame(np.random.randint(0,9,(3,5)))
In [392]: test.columns = [['P','total001','g','r','t']]
In [393]: test
Out[393]:
P total001 g r t
0 3 6 5 4 8
1 1 7 6 8 0
2 5 0 6 2 0
In [394]: mask = test.columns.str.contains('total')
In [395]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[395]:
total001 g r
0 6 5 4
1 7 6 8
2 0 6 2
Case #2 :
This also works if you have multiple matching columns and also if you are going out of limits and don't have two columns to the right of the matching columns -
In [401]: np.random.seed(1234)
In [402]: test = pd.DataFrame(np.random.randint(0,9,(3,7)))
In [403]: test.columns = [['P','total001','g','r','t','total002','k']]
In [406]: test
Out[406]:
P total001 g r t total002 k
0 3 6 5 4 8 1 7
1 6 8 0 5 0 6 2
2 0 5 2 6 3 7 0
In [407]: mask = test.columns.str.contains('total')
In [408]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[408]:
total001 g r total002 k
0 6 5 4 1 7
1 8 0 5 6 2
2 5 2 6 7 0
I have a dataframe that has the following basic structure:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({'condition':[0,0,0,0,0,1,1,1,1,1],'x1':[1.2,-2.3,-2.1,2.4,-4.3,2.1,-3.4,-4.1,3.2,-3.3],'y1':[6.5,-7.6,-3.4,-5.3,7.6,5.2,-4.1,-3.3,-5.7,5.3],'decision':[np.nan]*10})
print tempDF
condition decision x1 y1
0 0 NaN 1.2 6.5
1 0 NaN -2.3 -7.6
2 0 NaN -2.1 -3.4
3 0 NaN 2.4 -5.3
4 0 NaN -4.3 7.6
5 1 NaN 2.1 5.2
6 1 NaN -3.4 -4.1
7 1 NaN -4.1 -3.3
8 1 NaN 3.2 -5.7
9 1 NaN -3.3 5.3
Within each row, I want to change the value of the 'decision' column to zero if the 'condition' column equals zero and if 'x1' and 'y1' are both the same sign (either positive or negative) - for the purposes of this script zero is considered to be positive. If the signs of 'x1' and 'y1' are different or if the 'condition' column equals 1 (regardless of the signs of 'x1' and 'y1') then the 'decision' column should equal 1. I hope I've explained that clearly.
I can iterate over each row of the dataframe as follows:
for i in range(len(tempDF)):
if (tempDF.ix[i,'condition'] == 0 and ((tempDF.ix[i,'x1'] >= 0) and (tempDF.ix[i,'y1'] >=0)) or ((tempDF.ix[i,'x1'] < 0) and (tempDF.ix[i,'y1'] < 0))):
tempDF.ix[i,'decision'] = 0
else:
tempDF.ix[i,'decision'] = 1
print tempDF
condition decision x1 y1
0 0 0 1.2 6.5
1 0 0 -2.3 -7.6
2 0 0 -2.1 -3.4
3 0 1 2.4 -5.3
4 0 1 -4.3 7.6
5 1 1 2.1 5.2
6 1 1 -3.4 -4.1
7 1 1 -4.1 -3.3
8 1 1 3.2 -5.7
9 1 1 -3.3 5.3
This produces the right output but it's a bit slow. The real dataframe I have is very large and these comparisons will need to be made many times. Is there a more efficient way to achieve the desired result?
First, use np.sign and the comparison operators to create a boolean array which is True where the decision should be 1:
decision = df["condition"] | (np.sign(df["x1"]) != np.sign(df["y1"]))
Here I've used DeMorgan's laws.
Then cast to int and put it in the dataframe:
df["decision"] = decision.astype(int)
Giving:
>>> df
condition decision x1 y1
0 0 0 1.2 6.5
1 0 0 -2.3 -7.6
2 0 0 -2.1 -3.4
3 0 1 2.4 -5.3
4 0 1 -4.3 7.6
5 1 1 2.1 5.2
6 1 1 -3.4 -4.1
7 1 1 -4.1 -3.3
8 1 1 3.2 -5.7
9 1 1 -3.3 5.3