I have a pandas dataframe with two or more rows and 42 columns. By transposing and plotting it, I get the profiles of the rows.
df.T.plot()
I want to sort the columns, so that first there are the columns, where the rows are strongly correlated (similar profile, values go in the same direction) and later the columns, where the rows have a weak correlation (opposite profile, values go in opposite direction).
I could run a cluster algorithm on the columns, but clusters are not exactly what I want.
I think one solution would be to sort after the distance of the points from the linear regression line??
Correlation is a measure that describes the relationship between two variables in total, not at specific points. The metric you've described for sorting isn't correlation, but rather the absolute difference between the column values in the two rows. (With the transpose operation the two rows become two columns, and their lines on the graph you're making will 'go in the opposite direction" when the values in the two columns are further away from each other.)
Achieving this with the dataframe you've described would look something like:
df_T = df.T
df_T['sort_column'] = df_T.panB - df_T.panC
df_T.sort_values('sort_column', inplace=True)
df_T.drop('sort_column', inplace=True)
df_T.plot()
Related
I have the following data frame
I need to calculate a correlation matrix across all columns. The problem is: when I calculate two columns separately I get a different value than when I calculate all together for every pair using df.corr().
The way I constructed the data frame was by merging the first column with every other one, and this merging process resulted in different row lengths of two by two data frames.
For example: the first column (btc_logreturns) and the second columns (gold_logreturns) had 2000 rows originally, while btc_logreturns and ewz_logreturns had 2100 rows. But all columns together have 2459 rows.
Does the function .corr() account for NaNs when calculating the correlation? Is the length of the data frame a potential problem for the different correlation values I get?
The problem is likely in the fact that the indices of these data frames do not align, meaning that some indices that are present in one data frame are not present in the other and vice versa.
If the indices are indeed meaningful, use the result of the merged data frame. If on the other hand, the indices are not meaningful, merge the original data frames using pd.concat([df_1, df_2], ignore_index=True) which will ignore the indices in the original data frames.
I'm working with tables containing a number of image-derived features, with a row of features for every timepoint of an image series, and multiple images. An image identifier is contained in one column. To condense this (primarily to be used in a parallel coordinates plot), I want to reduce/aggregate the columns to a single value. The reduce operation has to be chosen depending on the feature in the column (for example mean, max, or something custom). DataFrame.agg seemed like the natural thing to do, but it gives me a table with multiple rows. Right now I'm doing something like this:
result_df = DataFrame()
for col in df.columns:
if col in ReduceThisColumnByMean:
result_df[col] = df.mean()
elif col in ReduceThisColumnByMax:
result.df[col] = df.max()
This seems like a detour to me, and might not scale well (not a big concern, as the number of reduce operations will most probably not grow beyond a few). Is there a more pandas-esk way to aggregate multiple columns with specific operations to a single row?
You can select all columns by list, get mean and max and join together by concat, last convert Series to one row DataFrame by Series.to_frame and transpose:
result_df = pd.concat([df[ReduceThisColumnByMean].mean(),
df[ReduceThisColumnByMax].max()]).to_frame().T
I'm trying to make a heat map using a weighted average percentage. I was trying to do something similar to using a calculated field in a pivot table in excel, but wound up with two grouped data frames with the same index and columns. The grouping was done by two sets of predetermined buckets, one of which I unstacked to be the column headers (eg. [0,10,20,50,100] and [0,1,2,5,10]).
df7 = df5.groupby(['SomeBuckets','MoreBuckets']).sum().astype(float).unstack(['MoreBuckets'])
df8 = df6.groupby(['SomeBuckets','MoreBuckets']).sum().astype(float).unstack(['MoreBuckets'])
I'm not sure how to divide the two literally cell by cell, is there a way to do this? I tried
df9 = df7.truediv(df8,axis=0,fill_value='')
but all that gave me was a dtype error, could not convert string to float.
I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.
I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.
I have a dataset that contains individual observations that I need to aggregate at coarse time intervals, as a function of several indicator variables at each time interval. I assumed the solution here was to do a groupby operation, followed by a resample:
adult_resampled = adult_data.set_index('culture', drop=False).groupby(['over64','regioneast','pneumo7',
'pneumo13','pneumo23','pneumononPCV','PENR','LEVR',
'ERYTHR','PENS','LEVS','ERYTHS'])['culture'].resample('AS', how='count')
The result is an awkward-looking series with a massive hierarchical index, so perhaps this is not the right approach, but I need to then turn the hierarchical index into columns. The only way I can do that now is to hack the hierarchical index (by pulling out the index labels, which are essentially the contents of the columns I need).
Any tips on what I ought to have done instead would be much appreciated!
I've tried the new Grouper syntax, but it does not allow me to subsequently change the hierarchical indices to data columns. Applying unstack to this table:
results in this:
In order for this dataset to be useful, say in a regression model, I really need the index labels as indicators in columns.