function on previous N rows in data frame?

function on previous N rows in data frame? - python

I need to go through each row in a dataframe looking at a column called 'source_IP_address', and look at the previous 100 rows, so that I can find if there are any rows with the same 'source_IP_address' and where another column states 'authentication failure'.
I have written some code that does this, as I couldn't use Pandas rolling over two columns. Problem is, it is not very fast and I want to know if there is a better way to do it?
function to find in the previous window of n rows, the number of matching axis values, together with number of attribute values in the attribute column
def check_window_for_match(df_w, window_size, axis_col, attr_col, attr_var):
l = []
n_rows = df_w.shape[0]
for i in range(n_rows):
# create a temp dataframe with the previous n rows including current row
temp_df = df_w.iloc[i-(window_size-1):i+1]
#print(temp_df.shape)
# assign the current 'B' value as the axis variable
current_value = df_w[axis_col].iloc[i]
#print(current_value)
#print(temp_df)
# given the temp dataframe of previous window of n_rows, check axis matches against fails
l_temp = temp_df.loc[(temp_df[axis_col] == current_value) & (temp_df[attr_col] == attr_var)].shape[0]
l.append(l_temp)
return l
e.g.
df_test = pd.DataFrame({'B': [0, 1, 2, np.nan, 4, 6, 7, 8, 10, 8, 7], 'C': [2, 10, 'fail', np.nan, 6, 7, 8, 'fail', 8, 'fail', 9]})
df_test
matches_list = check_window_for_match(df_test, window_size=3, axis_col='B', attr_col='C', attr_var='fail')
output: [0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0]
I want to know if my code is correct and if it is the best way to do it, or there is a better alternative.

Related

How to do point biserial correlation for multiple columns in one iteration

I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add

x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe

You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331

how to iterate though 5 rows from get Max per row of 4 columns and set value of a different column accordingly

I have a DataFrame, i've made a subset df2 of 4 columns from df1 and create a list of 5 items containing max value from each row. Now depending on which column the max value for that row is in i.e column 1, 2, 3, 4, determines the int label i.e. 1, 2, 3, or 4 in the label column in df1.
The df2 is because some of the other columns not including those 4 have a higher value than the 4 to compare and obviously screws that up. Starting to think it should be a list or series?
code
df1= pd.DataFrame({'x_1': [xvalues[0][0], xvalues[0][1], xvalues[0][2],
xvalues[0][3], xvalues[0][4]],
'x_2': [yvalues[0][0], yvalues[0][1], yvalues[0][2],
yvalues[0][3], yvalues[0][4]],
'True labels': [truelabels[0], truelabels[1],
truelabels[2],truelabels[3], truelabels[4]],
'g11': [classifier1[0][0],classifier1[0][1],
classifier1[0][2],classifier1[0][3],
classifier1[0][4],],
'g12': [classifier1[1][0],classifier1[1][1],
classifier1[1][2],classifier1[1][3],
classifier1[1][4],],
'g13': [classifier1[2][0],classifier1[2][1],
classifier1[2][2],classifier1[2][3],
classifier1[2][4],],
'g14': [classifier1[3][0],classifier1[3][1],
classifier1[3][2],classifier1[3][3],
classifier1[3][4],],
'L1': [2, 5, 6, 7, 8],
'g21': [classifier2[0][0],classifier2[0][1],
classifier2[0][2],classifier2[0][3],
classifier2[0][4],],
'g22': [classifier2[1][0],classifier2[1][1],
classifier2[1][2],classifier2[1][3],
classifier2[1][4],],
'g23': [classifier2[2][0],classifier2[2][1],
classifier2[2][2],classifier2[2][3],
classifier2[2][4],],
'g24': [classifier2[3][0],classifier2[3][1],
classifier2[3][2],classifier2[3][3],
classifier2[3][4],],
'L2': [0, 0, 0, 0, 0],
'g31': [classifier3[0],classifier3[0],
classifier3[0],classifier3[0],
classifier3[0],],
'g32': [classifier3[1][0],classifier3[1][1],
classifier3[1][2],classifier3[1][3],
classifier3[1][4],],
'g33': [classifier3[2][0],classifier3[2][1],
classifier3[2][2],classifier3[2][3],
classifier3[2][4],],
'g34': [classifier3[3][0],classifier3[3][1],
classifier3[3][2],classifier3[3][3],
classifier3[3][4],],
'L3': [0, 0, 0, 0, 0],
'Assigned L':[1, 1, 1, 1,1]}, index =['Datapoint1', 'D2', 'D3',
'D4', 'D5'])
df2= df1[['g11','g12','g13','g14']]
hdf = df2.max(axis = 1)
g11 = df1['g11'].to_list()
g12 = df1['g12'].to_list()
g13 = df1['g13'].to_list()
g14 = df1['g14'].to_list()
for item, label in zip(hdf, table['L1']):
if hdf[item] in g11:
df1['L1'][label] = labels[0]
print(item, label)
elif hdf[item] in g12:
df1['L1'][label] = labels[1]
print(item, label)
elif hdf[item] in g13:
df1['L1'][label] = labels[2]
print(item, label)
elif hdf[item] in g14:
df1['L1'][label] = labels[3]
print(item, label)
I have tried using .loc, .at but when it didn't work I just scrapped it and tried something else, maybe those approaches would be better? This is where I'm at so far.
The error is coming from the for loop for hdf,
The issue i'm having is "cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.0311272081] of <class 'float'>"
I don't think the other values in the data frame are relvant, its just there so people know I have made one. The 5 relavant columns in the dataframe are g11, g12, g13, g14 and L1.

Combine Pandas Columns by Corresponding Dictionary Values

I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})

Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4

How do you increase speed of this for loop with conditions?

I want to calculate how many columns for each row have greater than zero values. So if two out of the three columns have the required values then the score is 2.
I can build this using a for loop but it seems to be slow, so I am looking for faster alternatives. How do I do that?
df = pd.DataFrame({'intro': [1, 2, 3], 'quote': [0, 1, 0],'sample': [0, 1, 4]},
columns=['intro', 'quote','sample'])
df['score']=0
cols=['intro', 'quote', 'sample']
for i in range(len(df)):
print(i)
for col in cols:
if df.iloc[i][col] >= 1:
df['score'][i]= df['score'][i]+1
df_expected = pd.DataFrame({'intro': [1, 2, 3], 'quote': [0, 1, 0],'sample': [0, 1, 4],'score': [1, 3, 2]},
columns=['intro', 'quote','sample','score'])
df_expected

this will do the trick:
df['score']=(df>0).sum(axis=1)

You can create a True/False frame of values > 0 like this:
df > 0
You can coun't the True values in each column using
(df > 0).sum(axis)
and create a column like this:
df['score'] = (df > 0).sum(axis=1)

how to find number of rows which satisfy different conditions on two numpy array

Let says we have two numpy array as a = [4, 5, 8, 10, 4, 8, 4]
and b = [1, 0, 1, 1, 1, 0, 0].
we have to find number of rows in which first array element is 4 and second array element is 1.
4,1 5,0 8,1 10,1 4,1 8,0 4,0
In this it is 2.since there are two rows in which first element is 4 and second is 1.

You should use something like
np.sum((a == 4) & (b == 1))

You can try the basics of python:-
import numpy as np
a = np.array([4, 5, 8, 10, 4, 8, 4])
b = np.array([1, 0, 1, 1, 1, 0, 0])
new_pair = []
for a_value, b_value in zip(a,b):
if a_value==4 and b_value==1:
new_pair.append([a_value,b_value])
print( len(new_pair) )
I hope it may help you.

It's like filtering your lists into the pairing within the same list.
Have you tried isin() method in pandas?
import pandas as pd
df = pd.DataFrame({'List_1': a, 'List_2':b})
df_list = []
for i in range(0,len(a)):
df = df.loc[df['List_1'].isin([a[i]])]
df = df.loc[df['List_2'].isin([b[i]])]
df_list.append(df)
#your df_list will now have the values as you need
Hope this helps :))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

function on previous N rows in data frame? - python

Related

How to do point biserial correlation for multiple columns in one iteration

how to iterate though 5 rows from get Max per row of 4 columns and set value of a different column accordingly

Combine Pandas Columns by Corresponding Dictionary Values

How do you increase speed of this for loop with conditions?

how to find number of rows which satisfy different conditions on two numpy array

Categories

Resources