Pandas convert positive number to 1 and negative number to -1 - python

I have a column of positive and negative number. How to convert this column to a new column to realize convert positive number to 1 and negative number to -1?

You need numpy.sign
df['new'] = np.sign(df['col'])
Sample:
df = pd.DataFrame({ 'col':[-1,3,-5,7,1,0]})
df['new'] = np.sign(df['col'])
print (df)
col new
0 -1 -1
1 3 1
2 -5 -1
3 7 1
4 1 1
5 0 0

It's really easy to perform this task by -
For whole data frame -
df[df < 0] = -1
df[df > 0] = 1
For specific column -
df['column_name'][df['column_name'] < 0] = -1
df['column_name'][df['column_name'] > 0] = 1

df[df < 0] = -1
df[df > 0] = 1
no behaviour defined for df == 0

Related

Replacing positive, negative, and zero values by 1, -1, and 0 respectively

I have a pandas dataframe(100,000 obs) with 11 columns.
I'm trying to assign df['trade_sign'] values based on the df['diff'] (which is a pd.series object of integer values)
If diff is positive, then trade_sign = 1
if diff is negative, then trade_sign = -1
if diff is 0, then trade_sign = 0
What I've tried so far:
pos['trade_sign'] = (pos['trade_sign']>0) <br>
pos['trade_sign'].replace({False: -1, True: 1}, inplace=True)
But this obviously doesn't take into account 0 values.
I also tried for loops with if conditions but that didn't work.
Essentially, how do I fix my .replace function to take account of diff values of 0.
Ideally, I'd prefer a solution that uses numpy over for loops with if conditions.
There's a sign function in numpy:
df["trade_sign"] = np.sign(df["diff"])
If you want integers,
df["trade_sign"] = np.sign(df["diff"]).astype(int)
a = [-1 if df['diff'].values[i] < 0 else 1 for i in range(len(df['diff'].values))]
df['trade_sign'] = a
You could do it this way:
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
The boolean results of the element-wise > and < comparisons automatically get converted to int in order to allow multiplication with 1 and -1, respectively.
This sample input and test code:
import pandas as pd
pos = pd.DataFrame({'diff':[-9,0,9,-8,0,8,-7-6-5,4,3,2,0]})
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
print(pos)
... gives this output:
diff trade_sign
0 -9 -1
1 0 0
2 9 1
3 -8 -1
4 0 0
5 8 1
6 -18 -1
7 4 1
8 3 1
9 2 1
10 0 0
UPDATE: In addition to the solution above, as well as some of the other excellent ideas in other answers, you can use numpy where:
pos['trade_sign'] = np.where(pos['diff'] > 0, 1, np.where(pos['diff'] < 0, -1, 0))

Cumsum column while skipping rows or setting fixed values on a condition based on the result of the actual cumsum

I'm trying to find a vectorized solution in pandas that is quite common in spreadsheets which is to cumsum while skipping or setting fixed values on a condition based on the result of the actual cumsum. I have the following:
A
1 0
2 -1
3 2
4 3
5 -2
6 -3
7 1
8 -1
9 1
10 -2
11 1
12 2
13 -1
14 -2
What I need is to add a second column with the cumsum of 'A' and if one of these sums gives a positive value replace it with 0 and continue the cumsum using that 0. At the same time if the cumsum gives a negative value that is lower than the lowest value in column A recorded after a 0 in column B I will need to replace it with that lowest value in column A. I know this is quite a problem but is there a vectorized solution for this? Maybe using an auxiliary column. The result should look like this:
A B
1 0 0
2 -1 -1 # -1+0 = -1
3 2 0 # -1 + 2 = 1 but 1>0 so this is 0
4 3 0 # same as previous row
5 -2 -2 # -2+0 = -2
6 -3 -3 # -2-3 = -5 but the lowest value in column A since last 0 is -3 so this is replaced by -3
7 1 -2 # 1-3 = -2
8 -1 -3 # -1-2 = -3
9 1 -2 # -3 + 1 = -2
10 -2 -3 # -2-2 = -4 but the lowest value in column A since last 0 is -3 so this is replaced by -3
11 1 -2 # -3 +1 = -2
12 2 0 # -2+2 = 0
13 -1 -1 # 0-1 = -1
14 -2 -2 # -1-2 = -3 but the lowest value in column A since last cap is -2 so this is -2 instead of -3
For the moment I made this but does not work 100% and again is not really efficient:
df['B'] = 0
df['B'][0] = 0
for x in range(len(df)-1):
A = df['A'][x + 1]
B = df['B'][x] + A
if B >= 0:
df['B'][x+1] = 0
elif B < 0 and A < 0 and B < A:
df['B'][x+1] = A
else:
df['B'][x + 1] = B
Using df['A'].expanding(1).apply(function) I could run own function which first get only one row, next 2 rows, next 3 rows, etc. I doesn't give result from previous calculation and it needs to make all calculations again and again but it doesn't need global
variables and hardcoded df['A']
Doc: Series.expanding
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
def function(values):
#print(values)
#print(type(valuse)
#print(len(values))
result = 0
last_zero = 0
for index, value in enumerate(values):
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(values[last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
#print('result:', result)
return result
df['B'] = df['A'].expanding(1).apply(function)
df['B'] = df['B'].astype(int)
print(df)
Result:
A B
0 0 0
1 -1 -1
2 2 0
3 3 0
4 -2 -2
5 -3 -3
6 1 -2
7 -1 -3
8 1 -2
9 -2 -3
10 1 -2
11 2 0
12 -1 -1
13 -2 -2
The same but with normal apply() - it needs global variables and hardcoded df['A']
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
result = 0
last_zero = 0
index = 0
def function(value):
global result
global last_zero
global index
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
index += 1
#print('result:', result)
return result
df['B'] = df['A'].apply(function)
df['B'] = df['B'].astype(int)
print(df)
The same using normal for-loop
A = [0, -1, 2, 3, -2, -3, 1, -1, 1, -2, 1, 2, -1, -2]
import pandas as pd
df = pd.DataFrame({"A": A})
all_values = []
result = 0
last_zero = 0
for index, value in df['A'].iteritems():
result += value
if result >= 0:
result = 0
last_zero = index
else:
minimal = min(df['A'][last_zero:])
#print(index, last_zero, minimal)
#if result < minimal:
# result = minimal
result = max(result, minimal)
all_values.append(result)
df['B'] = all_values
print(df)

Python - how to replace values greater and smaller than a particular value

How can I replace the values of a DataFrame if are smaller or greater than a particular value?
print(df)
name seq1 seq11
0 seq102 -14 -5.99
1 seq103 -5.25 -7.94
I want to set the values < than -8.5 to 1 and > than -8.5 to 0.
I tried this but all the values gets zero;
import pandas as pd
df = pd.read_csv('df.csv')
num = df._get_numeric_data()
num[num < -8.50] = 1
num[num > -8.50] = 0
The desired output should be:
name seq1 seq11
0 seq102 1 0
1 seq103 0 0
Thank you
Try
num.iloc[:,1:] = num.iloc[:,1:].applymap(lambda x: 1 if x < -8.50 else 0)
Note that values equal to -8.50 will be set to zero here.
def thresh(x):
if(x < -8.5):
return 1
elif(x > -8.5):
return 0
return x
print(df[["seq1", "seq2"]].apply(thresh))

Pandas set value if most columns are equal in a dataframe

starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50
You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0

How to iterate over a pandas dataframe while referencing previous rows?

I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.

Categories