Problem calculating correlation in Python - python

I have the following data frame
I need to calculate a correlation matrix across all columns. The problem is: when I calculate two columns separately I get a different value than when I calculate all together for every pair using df.corr().
The way I constructed the data frame was by merging the first column with every other one, and this merging process resulted in different row lengths of two by two data frames.
For example: the first column (btc_logreturns) and the second columns (gold_logreturns) had 2000 rows originally, while btc_logreturns and ewz_logreturns had 2100 rows. But all columns together have 2459 rows.
Does the function .corr() account for NaNs when calculating the correlation? Is the length of the data frame a potential problem for the different correlation values I get?

The problem is likely in the fact that the indices of these data frames do not align, meaning that some indices that are present in one data frame are not present in the other and vice versa.
If the indices are indeed meaningful, use the result of the merged data frame. If on the other hand, the indices are not meaningful, merge the original data frames using pd.concat([df_1, df_2], ignore_index=True) which will ignore the indices in the original data frames.

Related

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

Python pandas.get_dummies generates duplicate field names when handling 2D Arrays with different values order

I've the following DF for example: DF with missing values or array with different lengths and with different values order inside each observation 2D Arrays
Data Sample
when I convert it to onehot using pandas.get_dummies I receive a duplicated column names in the onehot Matrix presented below. What's the solution to not receive unique column names?
Note: I've data of more than 15 million observations and about 35000 unique field names, so I need a solution which can be also processed in side my memory.
pd.get_dummies(pd.DataFrame(path_Ids[0:10]), prefix = 'paths_')
Actual Results
I've tried to provide the "columns" parameter to provide the unique column names.
I've also tried to create them on mini batches but it also didn't work because in every minibatch it generates different column names "Duplicated one or missing ones", and it can't append or concat both generated DF's with different columns or column names.
tensorflow onehot also didn't work with me

Sort columns after rows correlation

I have a pandas dataframe with two or more rows and 42 columns. By transposing and plotting it, I get the profiles of the rows.
df.T.plot()
I want to sort the columns, so that first there are the columns, where the rows are strongly correlated (similar profile, values go in the same direction) and later the columns, where the rows have a weak correlation (opposite profile, values go in opposite direction).
I could run a cluster algorithm on the columns, but clusters are not exactly what I want.
I think one solution would be to sort after the distance of the points from the linear regression line??
Correlation is a measure that describes the relationship between two variables in total, not at specific points. The metric you've described for sorting isn't correlation, but rather the absolute difference between the column values in the two rows. (With the transpose operation the two rows become two columns, and their lines on the graph you're making will 'go in the opposite direction" when the values in the two columns are further away from each other.)
Achieving this with the dataframe you've described would look something like:
df_T = df.T
df_T['sort_column'] = df_T.panB - df_T.panC
df_T.sort_values('sort_column', inplace=True)
df_T.drop('sort_column', inplace=True)
df_T.plot()

Random Sampling of Pandas data frame (both rows and columns)

I know how to randomly sample few rows from a pandas data frame. Lets say I had a data frame df, then to get a fraction of rows, I can do :
df_sample = df.sample(frac=0.007)
However what I need is random rows as above AND also random columns from the above data frame.
Df is currently 56Kx8.5k. If I want say 500x1000 where both 500 and 1000 are randomly sampled how to do this?
I think one approach would be do something like
df.columns to get a list of columns names.
Then do some random sampling of the indices of this list of columns and use that random indices to filter out remaining columns?
Just call sample twice, with corresponding axis parameters:
df.sample(n=500).sample(n=1000, axis=1)
For the first one, axis=0 by default. The first sampling samples lines, while the second considers columns.

Remove duplicate columns only by their values

I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.
I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.

Categories