Pandas dataframe: selecting max by column for subset

Pandas dataframe: selecting max by column for subset - python

Am fairly new to pandas and going around in circles trying to find an easy way to solve the following problem:
I have a large correlation matrix (several thousand rows / columns) as a dataframe and would like to extract the maximum value by column excluding the '1' which is of course present in all columns (diagonal of the matrix).
Tried all sorts of variations of .max() .imax(), including the following:
corr.drop(corr.idxmax()).max()
But only get nonsense results. Any help is highly appreciated.

You can probably use np.fill_diagonal
df_values=df.values.copy()
np.fill_diagonal(df_values,-np.inf)
df_values.max(0)
Or with a one-liner you can use:
df.values[~np.eye(df.shape[0],dtype=bool)].reshape(df.shape[0]-1,-1).max(0)

This will get the 2nd highest values from each column.
As array:
np.partition(df.values, len(df)-2, axis=0)[len(df)-2]
or in a dataframe:
pd.DataFrame(np.partition(df.values, len(df)-2, axis=0)[len(df)-2],
index=df.columns, columns=['2nd'])

Related

Finding correlation in dataframe

I have a pandas dataframe(df) that has columns (say x_1,x_2,....x_n as column names). I want to find a correlation (Pearson) between the ith column and the rest of the columns.
One way I can do this is by using the .corr() function
correlation = df.corr(method='pearson')
corr_i = correlation['x_i']
but this method is bit expensive since it finds correlations between all of the columns (all I need is only one column). The other method that I could do is
corr_i =[df['x_i'].corr(df[j], method ='pearson') for j in df.columns.tolist() if j!='x_i']
but I do feel that this is not efficient way of finding correlation given the flexibility of dataframe. Can anyone help me with very efficient method than above two? Thanks in advance.

corrwith() might be what are looking for.
Say you had a data frame with columns c1,c2,c3,c4.
Then you should be able to:
df[['c2','c3','c4']].corrwith(df['c1'])

Faster Way of Getting Distinct Rows

Suppose we have a PySpark dataframe with ~10M rows. Is there a a faster way of getting distinct rows compared to df.distinct()? Maybe use df.groupBy()?

if you try select the columns of interest before you do the operation then wil be faster (smaller dataset).
something like
python
columns_to_select = ["col1", "col2"]
df.select(columns_to_select).distinct()

How to multiply all elements in a row with corresponding element in same row in different column in pandas dataframe in Python?

I have a dataframe which looks as follows:
I want to multiply elements in a row except for the "depreciation_rate" column with the value in the same row in the "depreciation_rate" column.
I tried df2.iloc[:,6:26]*df2["depreciation_rate"] as well as df2.iloc[:,6:26].mul(df2["depreciation_rate"])
I get the same results with both which look as follows. I get NaN values with additional columns which I don't want. I think the elements in rows also multiply with values in other rows in the "depreciation_rate" column. What would be a good way to solve this issue?

Try using mul() along axis=0:
df2.iloc[:,6:26].mul(df2["depreciation_rate"], axis=0)

pandas DataFrame: Calculate Sum based on boolean values in another column

I am fairly new to Python and I trying to simulate the following logic with in pandas
I am currently looping throw the rows and want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. It seems inefficient with the actual data (I have a dataframe of about 5 million rows)? Was wondering what the efficient way of handling such a logic in Python would entail?
Logic:
The logic is that if FLAG is TRUE I want to sum the values in the AMOUNT column in the prior rows but only till the last seen 'TRUE' value. Basically sum the values in 'AMOUNT' between the rows where FLAG is TRUE

Check with cumsum and transform sum
df['SUM']=df.groupby(df['FLAG'].cumsum()).Amount.transform('sum').where(df.FLAG)

maybe try something around the following:
import pandas
df = pd.read_csv('name of file.csv')
df['AMOUNT'].sum()

Remove duplicate columns only by their values

I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.

I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe: selecting max by column for subset - python

You can probably use np.fill_diagonal df_values=df.values.copy() np.fill_diagonal(df_values,-np.inf) df_values.max(0) Or with a one-liner you can use: df.values[~np.eye(df.shape[0],dtype=bool)].reshape(df.shape[0]-1,-1).max(0)

This will get the 2nd highest values from each column. As array: np.partition(df.values, len(df)-2, axis=0)[len(df)-2] or in a dataframe: pd.DataFrame(np.partition(df.values, len(df)-2, axis=0)[len(df)-2], index=df.columns, columns=['2nd'])

Related

Finding correlation in dataframe

Faster Way of Getting Distinct Rows

How to multiply all elements in a row with corresponding element in same row in different column in pandas dataframe in Python?

pandas DataFrame: Calculate Sum based on boolean values in another column

Remove duplicate columns only by their values

Categories

Resources