find index where element change sign from 0 to 1 - python

I have a DataFrame like below, where I want to find index where element goes from 0 to 1.
Below code gets all the instances I want, however it also includes cases where sign changes from -1 to 0, which I don't want.
import numpy as np
df=pd.DataFrame([0,1,1,1,0,1,0,-1,0])
df[np.sign(df[0]).diff(periods=1).eq(1)]
Output:
0
1 1
5 1
8 0

Just add another condition:
filtered = df[np.sign(df[0]).diff(1).eq(1) & np.sign(df[0]).eq(1)]
Output:
>>> filtered
0
1 1
5 1

Related

Removing rows and columns if all zeros in non-diagonal entries

I am generating a confusion matrix to get an idea on my text-classifier's prediction vs ground-truth. The purpose is to understand which intents are being predicted as some another intents. But the problem is I have too many classes (more than 160), so the matrix is sparse, where most of the fields are zeros. Obviously, the diagonal elements are likely to be non-zero, as it is basically the indication of correct prediction.
That being the case, I want to generate a simpler version of it, as we only care non-zero elements if they are non-diagonal, hence, I want to remove the rows and columns where all the elements are zeros (ignoring the diagonal entries), such that the graph becomes much smaller and manageable to view. How to do that?
Following is the code snippet that I have done so far, it will produce mapping for all the intents i.e, (#intent, #intent) dimensional plot.
import matplotlib.pyplot as plt
import numpy as np
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(64,64)})
confusion_matrix = pd.crosstab(df['ground_truth_intent_name'], df['predicted_intent_name'])
variables = sorted(list(set(df['ground_truth_intent_name'])))
temp = DataFrame(confusion_matrix, index=variables, columns=variables)
sns.heatmap(temp, annot=True)
TL;DR
Here temp is a pandas dataframe. I need to remove all rows and columns where all elements are zeros (ignoring the diagonal elements, even if they are not zero).
You can use any on the comparison, but first you need to fill the diagonal with 0:
# also consider using
# a = np.isclose(confusion_matrix.to_numpy(), 0)
a = confusion_matrix.to_numpy() != 0
# fill diagonal
np.fill_diagonal(a, False)
# columns with at least one non-zero
cols = a.any(axis=0)
# rows with at least one non-zero
rows = a.any(axis=1)
# boolean indexing
confusion_matrix.loc[rows, cols]
Let's take an example:
# random data
np.random.seed(1)
# this would agree with the above
a = np.random.randint(0,2, (5,5))
a[2] = 0
a[:-1,-1] = 0
confusion_matrix = pd.DataFrame(a)
So the data would be:
0 1 2 3 4
0 1 1 0 0 0
1 1 1 1 1 0
2 0 0 0 0 0
3 0 0 1 0 0
4 0 1 0 0 1
and the code outputs (notice the 2nd row and 4th column are gone):
0 1 2 3
0 1 1 0 0
1 1 1 1 1
3 0 0 1 0
4 0 1 0 0

Pandas create a conditional column based on cumulative logic operations of the other columns

I have 2 columns that represent the on switch and off switch indicator. I want to create a column called last switch where it keeps record the 'last' direction of the switch (whether it is on or off). Another condition is that if both on and off switch value is 1 for a particular row, then the 'last switch' output will return the opposite sign of the previous last switch. Currently I managed to find a solution to get this almost correct except facing the situation where both on and off shows 1 that makes my code wrong.
I also attached the screenshot with a desired output. Please help appreciate all.
df=pd.DataFrame([[1,0],[1,0],[0,1],[0,1],[0,0],[0,0],[1,0],[1,1],[0,1],[1,0],[1,1],[1,1],[0,1]], columns=['on','off'])
df['last_switch']=(df['on']-df['off']).replace(0,method='ffill')
Add the following lines to your existing code:
for i in range(df.shape[0]):
df['prev']=df['last_switch'].shift()
df.loc[(df['on']==1) & (df['off']==1), 'last_switch']=df['prev'] * (-1)
df.drop('prev', axis=1, inplace=True)
df['last_switch']=df['last_switch'].astype(int)
Output:
on off last_switch
0 1 0 1
1 1 0 1
2 0 1 -1
3 0 1 -1
4 0 0 -1
5 0 0 -1
6 1 0 1
7 1 1 -1
8 0 1 -1
9 1 0 1
10 1 1 -1
11 1 1 1
12 0 1 -1
Let me know if you need expanation of the code
df=pd.DataFrame([[1,0],[1,0],[0,1],[0,1],[0,0],[0,0],[1,0],[1,1],[0,1],[1,0],[1,1],[1,1],[0,1]], columns=['on','off'])
df['last_switch']=(df['on']-df['off']).replace(0,method='ffill')
prev_row = None
def apply_logic(row):
global prev_row
if prev_row is not None:
if (row["on"] == 1) and (row["off"] == 1):
row["last_switch"] = -prev_row["last_switch"]
prev_row = row.copy()
return row
df.apply(apply_logic,axis=1)
personally i am not a big fan of using loop against dataframe. shift wont work in this case as the "last_switch" column is dynamic and subject to change based on on&off status.
Using your intermediate reesult with apply while carrying the value from previous row should do the trick. Hope it makes sense.

Split an array into several arrays by defined boundaries, python

I have a numpy array which consists of 64 columns and 49 rows. Each row stands for a separate message and contains several pieces of information. When an information starts or ends can be recognized by the change of the value. In the following an excerpt of the numpy array:
[[1 1 0 0 2 2 2 2 1 0 0 0 0 2 ... 2 2 2]
[0 0 0 2 2 2 2 2 2 2 2 2 2 2 ... 2 2 2]
[2 0 0 1 2 0 0 0 0 0 0 0 0 0 ... 1 1 0]
.
.
.
[0 1 0 1 0 1 0 1 0 0 0 0 0 0 ... 2 2 2]]
The first information of the first signal therefore takes the first two positions [11]. By changing the value from 1 to 0 I know that the second information is in the third and fourth position [00]. The third information occupies the following four positions [2222]. The next information consists only of [1]. And so on...
Once I have identified the positions of each information of a signal I have to apply these boundaries to my signal numpy arrays. My first binary signal numpy array consists of 64 columns and 3031 rows:
[[1 1 0 0 0 0 0 1 0 1 0 1 0 0 ... 1 0 0 1]
[1 0 1 0 1 1 1 1 1 0 0 1 1 0 ... 1 1 1 0]
[0 1 0 1 1 1 0 0 1 0 0 1 1 1 ... 1 1 1 0]
.
.
.
[1 0 1 0 0 1 0 0 0 0 1 1 0 1 ... 1 1 1 0]]
My first array (first information from the first signal) consists of the first two positions as determined by the previous array. The output should look like this:
[[11]
[10]
[01]
.
.
.
[10]]
The output of the second array (third and fourth position) should be the following:
[[00]
[10]
[01]
.
.
.
[10]]
The output of the third array:
[[0001]
[1111]
[1100]
.
.
.
[0100]]
Unfortunately I do not know how to create and apply the initial boundaries of the first array to the binary arrays. Does anyone have a solution for this?
Thanks for the help!
Sorry, I placed the hint of where you should create a loop at the wrong place. See if this code works: (I tried to explain numpy slicing a little in comments but can learn more here: Numpy indexing and slicing
import itertools
import numpy as np
# Def to reshape signals according to message
def reshape(lst1, lst2):
iterator = iter(lst2)
return [[next(iterator) for _ in sublist]
for sublist in lst1]
# Arrays
array_1 = np.array([[1,1,0,0,2,2,2,2,1,0,0,0,0,2],
[0,0,0,2,2,2,2,2,2,2,2,2,2,2],
[2,0,0,1,2,0,0,0,0,0,0,0,0,0],
[0,1,0,1,0,1,0,1,0,0,0,0,0,0]])
array_2 = np.array([[1,1,0,0,0,0,0,1,0,1,0,1,0,0],
[1,0,1,0,1,1,1,1,1,0,0,1,1,0],
[0,1,0,1,1,1,0,0,1,0,0,1,1,1],
[1,0,1,0,0,1,0,0,0,0,1,1,0,1]])
#Group messages into pieces of information
signal_list = []
for lists in array_1:
signal_list.append([list(group) for key, group in itertools.groupby(lists)])
#Index for all message
All_messages={}
#Do this for each message:
for rows in range(len(array_1)):
#Reshapes each signal according to current message
signals_reshape = (np.array([reshape(signal_list[rows], array_2[i]) for i in range(len(array_2))]))
# Create list to append all signals in current message
all_signal = []
# Do this for each information block
for i in range(len(signals_reshape[rows])):
'''
Append information blocks
1st [:] = retrieve in all signals
2nd [:] = retrieve the whole signal
3rd [:,i] = retrieve information block from specific column
Example: signals_reshape[0][0][0] retrieves the first information element of first information block of the fisrt signal
signals_reshape[0][0][:] retrieves all the information elements from the first information block from the first signal
signals_reshape[:][:][:,0] retrieves the first information block from all the signals
'''
all_signal.append(signals_reshape[:][:][:,i].flatten())
# add message information to dictionary (+ 1 is so that the names starts at Message1 and not Message0
All_messages["Message{0}".format(rows+1)] = all_signal
print(All_messages['Message1'])
print(All_messages['Message2'])
print(All_messages['Message3'])
print(All_messages['Message4'])
See if this can help you. This example returns the information for the 1st message, but you should be able to create a loop for all 49 messages and assign it to a new list.
import itertools
import numpy as np
# Def to reshape signals according to message
def reshape(lst1, lst2):
iterator = iter(lst2)
return [[next(iterator) for _ in sublist]
for sublist in lst1]
# Arrays
array_1 = np.array([[1,1,0,0,2,2,2,2,1,0,0,0,0,2],
[0,0,0,2,2,2,2,2,2,2,2,2,2,2],
[2,0,0,1,2,0,0,0,0,0,0,0,0,0]])
array_2 = np.array([[1,1,0,0,0,0,0,1,0,1,0,1,0,0],
[1,0,1,0,1,1,1,1,1,0,0,1,1,0],
[0,1,0,1,1,1,0,0,1,0,0,1,1,1]])
#Group messages into pieces of information
signal_list = []
for lists in array_1:
signal_list.append([list(group) for key, group in itertools.groupby(lists)])
# reshapes signals for each message to be used
signals_reshape = np.array([reshape(signal_list[0], array_2[i]) for i in range(len(array_2))])
print(signals_reshape[0])
# Get information from first message (Can create a loop to do the same for all 49 messages)
final_list_1 = []
for i in range(len(signals_reshape[0])):
final_list_1.append(signals_reshape[:][:][:,i].flatten())
print(final_list_1[0])
print(final_list_1[1])
print(final_list_1[2])
Output:
final_list_1[0]
[list([1, 1]), list([1, 0]), list([0, 1])]
final_list_1[1]
[list([0, 0]), list([1, 0]), list([0, 1])]
final_list_1[2]
[list([0, 0, 0, 1]) list([1, 1, 1, 1]) list([1, 1, 0, 0])]
Credits to #Kempie. He has solved the problem. I just adapted his code to my needs, shortened it a bit and fixed some small bugs.
import itertools
import numpy as np
# Def to reshape signals according to message
def reshape(lst1, lst2):
iterator = iter(lst2)
return [[next(iterator) for _ in sublist]
for sublist in lst1]
# Arrays
array_1 = np.array([[1,1,0,0,2,2,2,2,1,0,0,0,0,2],
[0,0,0,2,2,2,2,2,2,2,2,2,2,2],
[2,0,0,1,2,0,0,0,0,0,0,0,0,0],
[0,1,0,1,0,1,0,1,0,0,0,0,0,0]])
#in my case, array_2 was a list (see difference of code to Kempies solutions)
array_2 = np.array([[1,1,0,0,0,0,0,1,0,1,0,1,0,0],
[1,0,1,0,1,1,1,1,1,0,0,1,1,0],
[0,1,0,1,1,1,0,0,1,0,0,1,1,1],
[1,0,1,0,0,1,0,0,0,0,1,1,0,1]])
#Group messages into pieces of information
signal_list = []
for lists in array_1:
signal_list.append([list(group) for key, group in itertools.groupby(lists)])
signals_reshape_list = []
#Do this for each message (as array_2 is a list, we must work with indices):
for rows in range(len(array_1)):
#Reshapes each signal according to current message
signals_reshape = (np.array([reshape(signal_list[rows], array_2[rows][i]) for i in range(len(array_2[rows]))]))
signals_reshape_list.append(signals_reshape)
#print first signal of third message e.g.
print(signals_reshape_list[2][:,0]

Python Pandas: Create continuous id values using "flag values" in a df column and displaying it in another column

I have a very big dataframe (20.000.000+ rows) that containes a column called 'sequence' amongst others.
The 'sequence' column is calculated from a time series applying a few conditional statements. The value "2" flags the start of a sequence, the value "3" flags the end of a sequence, the value "1" flags a datapoint within the sequence and the value "4" flags datapoints that need to be ignored. (Note: the flag values doesn't necessarily have to be 1,2,3,4)
What I want to achieve is a continous ID value (written in a seperate column - see 'desired_Id_Output' in the example below) that labels the slices of sequences from 2 - 3 in a unique fashion (the length of a sequence is variable ranging from 2 [Start+End only] to 5000+ datapoints) to be able to do further groupby-calculations on the individual sequences.
index sequence desired_Id_Output
0 2 1
1 1 1
2 1 1
3 1 1
4 1 1
5 3 1
6 2 2
7 1 2
8 1 2
9 3 2
10 4 NaN
11 4 NaN
12 2 3
13 3 3
Thanks in advance and BR!
I can't think of anything better than the "dumb" solution of looping through the entire thing, something like this:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
Of course this is going to be pretty slow with a 20M-sized DataFrame. One way to improve this is via just-in-time compilation with numba:
import numba
#numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
and call this with argument df['sequence'].values.
Does it work to count the sequence starts? And then just set the ignore values (flag 4) afterwards. Like this:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan

indexing numpy multidimensional arrays

I need to access this numpy array, sometimes with only the rows where the last column is 0, and sometimes the rows where the value of the last column is 1.
y = [0 0 0 0
1 2 1 1
2 -6 0 1
3 4 1 0]
I have to do this over and over, but would prefer to shy away from creating duplicate arrays or having to recalculate each time. Is there someway that I can identify the indices concerned and just call them? So that I can do this:
>>print y[LAST_COLUMN_IS_0]
[0 0 0 0
3 4 1 0]
>>print y[LAST_COLUMN_IS_1]
[1 2 1 1
2 -6 0 1]
P.S. The number of columns in the array never changes, it's always going to have 4 columns.
You can use numpy's boolean indexing to identify which rows you want to select, and numpy's fancy indexing/slicing to select the whole row.
print y[y[:,-1] == 0, :]
print y[y[:,-1] == 1, :]
You can save y[:,-1] == 0 and ... == 1 as usual, since they are just numpy arrays.
(The y[:,-1] selects the whole of the last column, and the == equality check happens element-wise, resulting in an array of booleans.)

Categories