Eliminating Negative or Non_Negative values in pandas - python

-)I'm working on an automation task in python wherein in each row the 1st negative value should be added up with the 1st non-negative value from the left. Further, the result should replace the positive value and 0 should replace the negative value
-)This process should continue until the entire row contains all negative or all positive values.
**CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days**
ABC -2 23 2 3 2 2 -1
(>360Days)+(180-360Days)
-1 + 2
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC -2 23 2 3 2 1 0
(<30Days)+(180-360Days)
-2 + 1
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC 0 23 2 3 2 -1 0
(180-360Days)+(120-180Days)
-1 + 2
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC 0 23 2 3 2 0 0

Check this code:
import pandas as pd
#Empty DataFrame
df=pd.DataFrame()
#Enter the data
new_row={'CUSTOMER':'ABC','<30Days':-2,'31-60 Days':23,'61-90Days':2,'91-120Days':3,'120-180Days':2,'180-360Days':2,'>360Days':-1}
df=df.append(new_row,ignore_index=True)
#Keep columns order as per the requirement
df=df[['CUSTOMER','<30Days','31-60 Days','61-90Days','91-120Days','120-180Days','180-360Days','>360Days']]
#Take column names and reverse the order
ls=list(df.columns)
ls.reverse()
#Remove non integer column
ls.remove('CUSTOMER')
#Initialize variables
flag1=1
flag=0
new_ls=[]
new_ls_index=[]
for j in range(len(df)):
while flag1!=0:
#Perform logic
for i in ls:
if int(df[i][j]) < 0 and flag == 0:
new_ls.append(int(df[i][j]))
new_ls_index.append(i)
flag=1
elif flag==1 and int(df[i][j]) >= 0 :
new_ls.append(int(df[i][j]))
new_ls_index.append(i)
flag=2
elif flag==2:
df[new_ls_index[1]]=new_ls[0]+new_ls[1]
df[new_ls_index[0]]=0
flag=0
new_ls=[]
new_ls_index=[]
#Check all values in row either positive or negative
if new_ls==[]:
new_ls_neg=[]
new_ls_pos=[]
for i in ls:
if int(df[i][j]) < 0:
new_ls_neg.append(int(df[i][j]))
if int(df[i][j]) >= 0 :
new_ls_pos.append(int(df[i][j]))
if len(new_ls_neg)==len(ls) or len(new_ls_pos)==len(ls):
flag1=0 #Set flag to stop the loop

Related

Creating a trend streak in Pandas

I'm trying to create trend streak that displays 1,-1,0 (win/loss/no movement) from a pandas database. I'm looking for the streak to increase when positive, and reset on 0, or reset and create a negative streak on -1. The desired results would be something like this:
win streak
0 0
1 1
1 2
1 3
1 4
0 0
0 0
-1 -1
-1 -2
1 1
Currently I have this that creates the win column.
dataframe.loc[dataframe['close'] > dataframe['close_1h'].shift(1), 'win'] = 1
dataframe.loc[dataframe['close'] < dataframe['close_1h'].shift(1), 'win'] = -1
dataframe.loc[dataframe['close'] == dataframe['close_1h'].shift(1), 'win'] = 0
dataframe['streak'] = numpy.nan_to_num(dataframe['win'].cumsum())
But that doesn't correctly reset the streaks as I would like it to. I've played around with the groupby doing dataframe['streak'] = dataframe.groupby([(dataframe['win'] != dataframe['win'].shift()).cumsum()]) but that gave me an error resulting in "ValueError: Length of values (927) does not match length of index (1631)"
try this:
df['streak'] = df.groupby(df['win'].diff().ne(0).cumsum())['win'].cumsum()

How do I count how often a column value changes in a pandas dataframe?

I have a pandas data frame that looks like:
Index Activity
0 0
1 0
2 1
3 1
4 1
5 0
...
1167 1
1168 0
1169 0
I want to count how many times it changes from 0 to 1 and when it changes from 1 to 0, but I do not want to count how many 1's or 0's there are.
For example, if I only wanted to count index 0 to 5, the count for 0 to 1 would be one.
How would I go about this? I have tried using some_value
This is a simple approach that can also tell you the index value when the change happens. Just add the index to a list.
c_1to0 = 0
c_0to1 = 0
for i in range(0, df.shape[0]-1):
if df.iloc[i]['Activity'] == 0 and df.iloc[i+1]['Activity'] == 1:
c_0to1 +=1
elif df.iloc[i]['Activity'] == 1 and df.iloc[i+1]['Activity'] == 0:
c_1to0 +=1

Compare current column value to different column value by row slices

Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.
I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0

How to compare values in a column and create a new column using pandas?

I have a df named value of size 567 and it has a column index as follows:
index
96.875
96.6796875
96.58203125
96.38671875
95.80078125
94.7265625
94.62890625
94.3359375
58.88671875
58.7890625
58.69140625
58.59375
58.49609375
58.3984375
58.30078125
58.203125
I also have 2 additional variables:
mu = 56.80877955613938
sigma= 17.78935620293665
What I want is to check the values in the index column. If the value is greater than, say, mu+3*sigma, a new column named alarm must be added to the value df and a value of 4 must be added.
I tried:
for i in value['index']:
if (i >= mu+3*sigma):
value['alarm'] = 4
elif ((i < mu+3*sigma) and (i >= mu+2*sigma)):
value['alarm'] = 3
elif((i < mu+2*sigma) and (i >= mu+sigma)):
value['alarm'] = 2
elif ((i < mu+sigma) and (i >= mu)):
value['alarm'] = 1
But it creates an alarm column and fills it completely with 1.
What is the mistake I am doing here?
Expected output:
index alarm
96.875 3
96.6796875 3
96.58203125 3
96.38671875 3
95.80078125 3
94.7265625 3
94.62890625 3
94.3359375 3
58.88671875 1
58.7890625 1
58.69140625 1
58.59375 1
58.49609375 1
58.3984375 1
58.30078125 1
58.203125 1
If you have multiple conditions, you don't want to loop through your dataframe and use if, elif, else. A better solution would be to use np.select where we define conditions and based on those conditions we define choices:
conditions=[
value['index'] >= mu+3*sigma,
(value['index'] < mu+3*sigma) & (value['index'] >= mu+2*sigma),
(value['index'] < mu+2*sigma) & (value['index'] >= mu+sigma),
]
choices = [4, 3, 2]
value['alarm'] = np.select(conditions, choices, default=1)
value
alarm
index
96.875000 3
96.679688 3
96.582031 3
96.386719 3
95.800781 3
94.726562 3
94.628906 3
94.335938 3
58.886719 1
58.789062 1
58.691406 1
58.593750 1
58.496094 1
58.398438 1
58.300781 1
58.203125 1
If you have 10 min time, here's a good post by CS95 explaining why looping over a dataframe is bad practice.

pandas index of data chunk

I want to find the starting index and ending index of every piece of data chunk in the dataset.
The data is like:
index A wanted_column1 wanted_column2
2000/1/1 0 0
2000/1/2 1 2000/1/2 1
2000/1/3 1 1
2000/1/4 1 1
2000/1/5 0 0
2000/1/6 1 2000/1/6 2
2000/1/7 1 2
2000/1/8 1 2
2000/1/9 0 0
As shown in the data, index and A are the given columns and wanted_column1 and wanted_column2 are what I want to get.
The idea is that there are different pieces of continuous chunks of data. I want to retrieve starting indices of every chunk of data and I want to increment a count of how many chunks are in the data.
I tried to use shift(-1), but it is not possible to differentiate the difference between starting index and the ending index.
Is that what you need ?
df['change'] = df['A'].diff().eq(1)
df['wanted_column1'] = df[['index','change']].apply(lambda x: x[0] if x[1] else None, axis=1)
df['wanted_column2'] = df['change'].cumsum()
df['wanted_column2'] = df[['wanted_column2','A']].apply(lambda x: 0 if x[1]==0 else x[0], axis=1)
df.drop('change', axis=1, inplace=True)
That yields :
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 2
Edit : performance comparison
gehbiszumeis's solution : 19.9 ms
my solution : 4.07 ms
Assuming your dataframe to be df, you can find the indices where df['A'] != 0. The indices before are the last indices of a chunck, the ones after the first ones of a chunk. Later you count the number of found indices to calculate the number of data chunks
import pandas as pd
# Read your data
df = pd.read_csv('my_txt.txt', sep=',')
df['wanted_column1'] = None # creating already dummy columns
df['wanted_column2'] = None
# Find indices after each index, where 'A' is not 1, except of it is the last value
# of the dataframe
first = [x + 1 for x in df[df['A'] != 1].index.values if x != len(df)-1]
# Find indices before each index, where 'A' is not 1, except of it is the first value
# of the dataframe
last = [x - 1 for x in df[df['A'] != 1].index.values if x != 0]
# Set the first indices of each chunk at its corresponding position in your dataframe
df.loc[first, 'wanted_column1'] = df.loc[first, 'index']
# You can set also the last indices of each chunk (you only mentioned this in the text,
# not in your expected-result-listed). Uncomment for last indices.
# df.loc[last, 'wanted_column1'] = df.loc[last, 'index']
# Count the number of chunks and fill it to wanted_column2
for i in df.index: df.loc[i, 'wanted_column2'] = sum(df.loc[:i, 'wanted_column1'].notna())
# Some polishing of the df after to match your expected result
df.loc[df['A'] != 1, 'wanted_column2'] = 0
This gives
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 0
and works for all lengths of df and number of chunks in your data

Categories