I am trying to do something quite simple compute a Pearson correlation matrix of several variables that are given as columns of a DataFrame. I want it to ignore nans and provide also the p-values. scipy.stats.pearsonr is insufficient because it works only for two variables and cannot account for nans. There should be something better than that...
For example,
df = pd.DataFrame([[1,2,3],[6,5,4],[1,None,9]])
0 1 2
0 1 2.0 3
1 6 5.0 4
2 1 NaN 9
The columns of df are the variables and the rows are observations. I would like a command that returns a 3x3 correlation matrix, along with a 3x3 matrix of corresponding p-values. I want it to omit the None. That is, the correlation between [1,6,1],[2,5,NaN] should be the correlation between [1,6] and [2,5].
There must be a nice Pythonic way to do that, can anyone please suggest?
If you have your data in a pandas DataFrame, you can simply use df.corr().
From the docs:
DataFrame.corr(method='pearson', min_periods=1)
Compute pairwise correlation of columns, excluding NA/null values
Related
Naive problem here: let's say I have a dataframe, that is divided into df1 and df2.
Now, df1 is composed of the following variables:
categ_var_1
categ_var_2
binary_target
A
1
1
B
1
1
C
4
0
B
5
0
B
5
1
...
...
...
Goal: I want to use df1 to fill in the missing column in df2, which has the same categorical variables (with different data), but the binary_target is completely missing:
categ_var_1
categ_var_2
F
3
B
2
A
5
A
5
B
1
...
...
Which would be the best approach to resolve this in a simple manner?
My first guess was to use a Machine Learning model (using categorical variables as predictors), but I wouldn't be able to contrast the results, since df2 has no target variable.
My second guess was to merge both sets and do a column-imputation, though the final result wouldn't be accurate.
What do you think? Any help would be highly apppreciated! (The only restriction is to use Python!)
If you have all possible pairs of "categ_var_1" and "categ_var_2", you can map category values to target values and use the mapper on df2:
Something like:
mapper = df1.set_index(['categ_var_1','categ_var_2'])['binary_target'].to_dict()
out = df2.apply(tuple, axis=1).map(mapper)
Then it gives:
0 NaN
1 NaN
2 NaN
3 NaN
4 1.0
As you can see there are lots of NaN values that because those combinations are not in mapper.
Then what you can do is to create dummy variables from each category and use classification algorithm, maybe like logistic regression (scikit-learn has a class for it). Use df1 as a training set and df2 as a target set. This will obviously give you a prediction, not an exact value. Also note that depending on the number of options in each category, the size of the independent variables might become enormous, so the size of df1 must be big as well, otherwise the prediction won't make much sense.
In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.
I would like to compute a quantity called "downside beta".
Let's suppose I have a dataframe df:
df = pd.DataFrame({'A': [-0.1,0.3,-0.4, 0.8,-0.5],'B': [-0.2,0.5,0.3,-0.5,0.1]},index=[0, 1, 2, 3,4])
I would like to add a column, 'C' that computes this downside beta defined as the covariance between the columns A and B considering only the negative values of column A with the corresponding values of B. This covariance should be then divided by the variance of column A considering only the negative values.
In the above example, it should be equivalent of computing the covariance between the two series: [-0.1,-0.4,-0.5] and [-0.2,0.3,0.1]. Divided by the variance of the series [-0.1,-0.4,-0.5].
Next step would be to roll this metric over the index of an initial large dataframe df.
Is there an efficient way to do that? In a vectorized manner. I guess combining pd.rolling_cov and np.where?
Thank you!
Is this what you're looking for? You can filter out positive values and then call pandas cov and var functions accordingly:
v = df[df.A.lt(0)]
v.cov() / v.A.var()
A B
A 1.000000 -0.961538
B -0.961538 1.461538
If you just want the value at the diagonal,
np.diag(v.cov() / v.A.var(), k=-1)
array([-0.96153846])
For a rolling window, you may need to jump through a few hoops, but this should be doable;
v = df[df.A.lt(0)]
i = v.rolling(3).cov().A.groupby(level=0).last()
j = v.rolling(3).A.var()
i / j
0 NaN
2 NaN
4 -0.961538
Name: A, dtype: float64
I have a dataframe with 4 columns an ID and three categories that results fell into
<80% 80-90 >90
id
1 2 4 4
2 3 6 1
3 7 0 3
I would like to convert it to percentages ie:
<80% 80-90 >90
id
1 20% 40% 40%
2 30% 60% 10%
3 70% 0% 30%
this seems like it should be within pandas capabilities but I just can't figure it out.
Thanks in advance!
You can do this using basic pandas operators .div and .sum, using the axis argument to make sure the calculations happen the way you want:
cols = ['<80%', '80-90', '>90']
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0).multiply(100)
Calculate the sum of each column (df[cols].sum(axis=1). axis=1 makes the summation occur across the rows, rather than down the columns.
Divide the dataframe by the resulting series (df[cols].div(df[cols].sum(axis=1), axis=0). axis=0 makes the division happen across the columns.
To finish, multiply the results by 100 so they are percentages between 0 and 100 instead of proportions between 0 and 1 (or you can skip this step and store them as proportions).
df/df.sum()
If you want to divide the sum of rows, transpose it first.
You could use the .apply() method:
df = df.apply(lambda x: x/sum(x)*100, axis=1)
Tim Tian's answer pretty much worked for me, but maybe this helps if you have a df with several columns and want to do a % column wise.
df_pct = df/df[df.columns].sum()*100
I was having trouble because I wanted to have the result of a pd.pivot_table expressed as a %, but couldn't get it to work. So I just used that code on the resulting table itself and it worked.
I have a time series of counts obtained with
data_g = data.groupby(['A', 'B'])
incidence = data_g.size()
incidence
And I get something like this:
A B
A30.9 200801 4
200802 9
200803 9
200804 5
200805 3
...
A is a categorical variable with two values. B is yearweek. So this is weekly counts of two time-series (named after the values of A) over many years.
What I want, is to simply plot these series. I did not find a simple way to do this. What do you suggest?