Finding correlation in dataframe - python

I have a pandas dataframe(df) that has columns (say x_1,x_2,....x_n as column names). I want to find a correlation (Pearson) between the ith column and the rest of the columns.
One way I can do this is by using the .corr() function
correlation = df.corr(method='pearson')
corr_i = correlation['x_i']
but this method is bit expensive since it finds correlations between all of the columns (all I need is only one column). The other method that I could do is
corr_i =[df['x_i'].corr(df[j], method ='pearson') for j in df.columns.tolist() if j!='x_i']
but I do feel that this is not efficient way of finding correlation given the flexibility of dataframe. Can anyone help me with very efficient method than above two? Thanks in advance.

corrwith() might be what are looking for.
Say you had a data frame with columns c1,c2,c3,c4.
Then you should be able to:
df[['c2','c3','c4']].corrwith(df['c1'])

Related

Pandas - Adding individual Z-Scores to each row based on an id

So I have a Pandas data-frame with a game_id, player_id, and playtime column. I would like to add a z-score rating for each row to find how much from the norm, in terms of playtime, they are for each given game. How would I go through and add each one of these scores to a new column for the data-frame? Let me know if there's anything I need to clarify.
The simplest way and more efficient (fastest) is using the function apply of the data frame. First define the function of how you calculate the new column, and then apply it with the apply function:
def creating_new_column(row):
#Adapt your code for the way you calculate the z score
return ((row["playtime "])/2)
df["z_score"]=df.apply(lambda row: creating_new_column (row), axis=1)

Fastest way to update pandas columns based on matching column from other pandas dataframe

I have two pandas dataframes and one has updated values for a subset of the values for the primary dataframe. The main one is ~2m rows and the column to update is ~20k. This operation is running extremely slowly as I have it below which is O(m*n) as far as I can tell, is there a good way to vectorize it or just generally increase the speed? I don't see how many other optimizations can apply to this case. I have also tried making the 'object_id' column the index but that didn't lead to a meaningful increase in speed.
# df_primary this is 2m rows
# df_updated this is 20k rows
for idx, row in df_updated.iterrows():
df_primary.loc[df_primary.object_id == row.object_id, ['status', 'category']] = [row.status, row.category]
Let's try DataFrame.update to update df_primary in place using values from df_updated:
df_primary = df_primary.set_index('object_id')
df_primary.update(df_updated.set_index('object_id')[['status', 'category']])
df_primary = df_primary.reset_index()
use join methods based on requirements like left/right/inner joins. It will be super fast than any other way.

Looking for the "pandas" way to aggregate multiple columns to a single row

I'm working with tables containing a number of image-derived features, with a row of features for every timepoint of an image series, and multiple images. An image identifier is contained in one column. To condense this (primarily to be used in a parallel coordinates plot), I want to reduce/aggregate the columns to a single value. The reduce operation has to be chosen depending on the feature in the column (for example mean, max, or something custom). DataFrame.agg seemed like the natural thing to do, but it gives me a table with multiple rows. Right now I'm doing something like this:
result_df = DataFrame()
for col in df.columns:
if col in ReduceThisColumnByMean:
result_df[col] = df.mean()
elif col in ReduceThisColumnByMax:
result.df[col] = df.max()
This seems like a detour to me, and might not scale well (not a big concern, as the number of reduce operations will most probably not grow beyond a few). Is there a more pandas-esk way to aggregate multiple columns with specific operations to a single row?
You can select all columns by list, get mean and max and join together by concat, last convert Series to one row DataFrame by Series.to_frame and transpose:
result_df = pd.concat([df[ReduceThisColumnByMean].mean(),
df[ReduceThisColumnByMax].max()]).to_frame().T

Pandas dataframe: selecting max by column for subset

Am fairly new to pandas and going around in circles trying to find an easy way to solve the following problem:
I have a large correlation matrix (several thousand rows / columns) as a dataframe and would like to extract the maximum value by column excluding the '1' which is of course present in all columns (diagonal of the matrix).
Tried all sorts of variations of .max() .imax(), including the following:
corr.drop(corr.idxmax()).max()
But only get nonsense results. Any help is highly appreciated.
You can probably use np.fill_diagonal
df_values=df.values.copy()
np.fill_diagonal(df_values,-np.inf)
df_values.max(0)
Or with a one-liner you can use:
df.values[~np.eye(df.shape[0],dtype=bool)].reshape(df.shape[0]-1,-1).max(0)
This will get the 2nd highest values from each column.
As array:
np.partition(df.values, len(df)-2, axis=0)[len(df)-2]
or in a dataframe:
pd.DataFrame(np.partition(df.values, len(df)-2, axis=0)[len(df)-2],
index=df.columns, columns=['2nd'])

Turning DataFrameGroupBy.resample hierarchical index into columns

I have a dataset that contains individual observations that I need to aggregate at coarse time intervals, as a function of several indicator variables at each time interval. I assumed the solution here was to do a groupby operation, followed by a resample:
adult_resampled = adult_data.set_index('culture', drop=False).groupby(['over64','regioneast','pneumo7',
'pneumo13','pneumo23','pneumononPCV','PENR','LEVR',
'ERYTHR','PENS','LEVS','ERYTHS'])['culture'].resample('AS', how='count')
The result is an awkward-looking series with a massive hierarchical index, so perhaps this is not the right approach, but I need to then turn the hierarchical index into columns. The only way I can do that now is to hack the hierarchical index (by pulling out the index labels, which are essentially the contents of the columns I need).
Any tips on what I ought to have done instead would be much appreciated!
I've tried the new Grouper syntax, but it does not allow me to subsequently change the hierarchical indices to data columns. Applying unstack to this table:
results in this:
In order for this dataset to be useful, say in a regression model, I really need the index labels as indicators in columns.

Categories