obtaining indices of n max absolute values in dataframe row - python

suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values

You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])

Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values ​​the index of the 'n' highest value of this row

Related

How to do point biserial correlation for multiple columns in one iteration

I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add
x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe
You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331

Combine Pandas Columns by Corresponding Dictionary Values

I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4

numpy get row index where elements in certain columns are zero

I want to find indexes of row based on criteria over certain columns
So, something like:
import numpy as np
x = np.random.rand(4, 5)
x[2, 2] = 0
x[2, 3] = 0
x[3, 1] = 0
x[1, 3] = 0
Now, I want to get the index of the rows where either of columns 3 or 4 are zeros. How can one do that with numpy? Do I need to make multiple calls to nonzero for each column and combine these indices using a set or something like that?
Using np.where first array within the tuple is row index
np.where(x[:,[3,4]]==0)
Out[79]: (array([1, 2], dtype=int64), array([0, 0], dtype=int64))

Apply arithmetic calculations on specific rows of a large dataframe

Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)

Detect Missing Column Labels in Pandas

I'm working with the dataset outlined here:
https://archive.ics.uci.edu/ml/datasets/Balance+Scale
I'm trying create a general function to be able to parse any categorical data following these two rules:
Must have a column labeled class containing the class of the object
Each row must have the same numbers of columns
Minimal example of the data that I'm working with:
Class,LW,LD,RW,RD
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
This provides 3 unique classes: B, L, R. It also provides 4 features which pertain to each entry: LW, LD, RW and RD.
The following is a part of my function to handle generic cases, but my issue with it is that I don't know how to check if any column labels are simply missing:
import pandas as pd
import sys
dataframe = pd.read_csv('Balance_Data.csv')
columns = list(dataframe.columns.values)
if "Class" not in columns:
sys.exit("'Class' is not a column in the data")
if "Class.1" in columns:
sys.exit("Cannot specify more than one 'Class' column")
columns.remove("Class")
inputX = dataframe.loc[:, columns].as_matrix()
inputY = dataframe.loc[:, ['Class']].as_matrix()
At this point, the correct values are:
inputX = array([[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 2, 1, 3],
[2, 2, 4, 5]])
inputY = array([['B'],
['L'],
['R'],
['R'],
['R'],
['R']], dtype=object)
But if I remove the last column label (RD) and reprocess,
Class,LW,LD,RW
B,1,1,1,1
L,1,2,1,1
R,1,2,1,3
R,2,2,4,5
I get:
inputX = array([[1, 1, 1],
[2, 1, 1],
[2, 1, 3],
[2, 4, 5]])
inputY = array([[1],
[1],
[1],
[2]])
This indicates that it reads label values from right to left instead of left to right, which means that if any data is input into this function that doesn't have the right amount of labels, it's not going to work correctly.
How can I check that the dimension of the rows is the same as the number of columns? (It can be assumed that there are no gaps in the data itself, that each row of data beyond the columns always has the same number of elements in it)
I would pull it out as follows:
In [11]: df = pd.read_csv('Balance_Data.csv', index_col=0)
In [12]: df
Out[12]:
LW LD RW RD
Class
B 1 1 1 1
L 1 2 1 1
R 1 2 1 3
R 2 2 4 5
That way the assertion check can be:
if "Class" in df.columns:
sys.exit("class must be the first and only the column and number of columns must match all rows")
and then check that the there are no NaNs in the last column:
In [21]: df.iloc[:, -1].notnull().all()
Out[21]: True
Note: this happens e.g. with the following (bad) csv:
In [31]: !cat bad.csv
A,B,C
1,2
3,4
In [32]: df = pd.read_csv('bad.csv', index_col=0)
In [33]: df
Out[33]:
B C
A
1 2 NaN
3 4 NaN
In [34]: df.iloc[:, -1].notnull().all()
Out[34]: False
I think these are the only two failing cases (but I think the error messages can be made clearer)...

Categories