I am trying to calculate the covariance between two columns by group. I am doing doing the following:
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
B['value1'].cov(B['value2'])
Ideally, I would like to get the covariance between X and Y and not the whole variance-covariance matrix, since I only have two columns.
Thank you,
You are almost there, only that you do not clear understand the groupby object, see Pandas-GroupBy for more details.
For your problem, if I understand correctly, you would like to calculate cov between two columns in same group.
The simplest one is to use groupeby.cov function, which gives pairwise cov between groups.
A.groupby('group').cov()
value1 value2
group
A value1 1.666667 -2.666667
value2 -2.666667 4.666667
B value1 1.000000 0.500000
value2 0.500000 0.333333
If you only need cov(grouped_v1, grouped_v2)
grouped = A.groupby('group')
grouped.apply(lambda x: x['value1'].cov(x['value2']))
group
A -2.666667
B 0.500000
In which, grouped is a groupby object. For grouped.apply function, it need a callback function as argument and each group will be the argument for the callback function. Here, the callback function is a lambda function, and the argument x is a group (a DataFrame).
Hope this will be helpful for your understanding of groupby.
The following code gives you the grouped variance-covariance matrix. You can subset it as you wish to just get the covariances.
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
print A.groupby('group').cov()
Here is an alternative solution that estimates cov(value1, value2) within each group, but doesn't use .apply():
import pandas as pd
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'],
'value1':[1,2,3,4,5,6,7],
'value2':[8,5,4,3,7,8,8]})
B = A.groupby('group')
cov_a_b = B[['value1', 'value2']].cov(ddof=0)['value1'].unstack()['value2']
As an additional note somewhat related to the question, you should be careful when using the NumPy/Pandas implementations of variance and covariance, as they use a degrees of freedom correction of 1 by default (confusingly, NumPy defaults to ddof=0 for their implementation of variance). This is why I included ddof=0.
If you're looking for cov() of specific two columns, you can use df.Age.cov(df.Salary)
Assuming that Age and salary are two of many columns of the dataFrame. This is useful for only two columns.
Related
Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.
I'm successfully using the groupby() function to compute statistics on grouped data, however, I'd now like to do the same for subsets of each group.
I can't seem to understand how to generate a subset for each group (as a groupby object) that can then be applied to a groupby function such as mean(). The following line works as intended:
d.groupby(['X','Y'])['Value'].mean()
How can I subset the values of the individual groups to then supply to the mean function? I suspect transform() or filter() might be useful though I can't figure out how.
EDIT to add reproducible example:
random.seed(881)
value = np.random.randn(15)
letter = np.random.choice(['a','b','c'],15)
date = np.repeat(pd.date_range(start = '1/1/2001', periods=3), 5)
data = {'date':date,'letter':letter,'value':value}
df = pd.DataFrame(data)
df.groupby(['date','letter'])['value'].mean()
date letter
2001-01-01 a -0.039407
b -0.350787
c 1.221200
2001-01-02 a -0.688744
b 0.346961
c -0.702222
2001-01-03 a 1.320947
b -0.915636
c -0.419655
Name: value, dtype: float64
Here's an example of calculating the mean of the multi-level group. Now I'd like to find the mean of a subset of each group. For example, the mean of each groups data that is < the groups 10th percentile. The key take away being that the subsets must be performed on the groups and not the entire df first.
I think the function you're looking for is quantile(), which you can add to a groupby().apply() statement. For the tenth percentile, use quantile(.1):
df.groupby(['date','letter'])['value'].apply(lambda g: g[g <= g.quantile(.1)].mean())
I'm looking for help with the Pandas .corr() method.
As is, I can use the .corr() method to calculate a heatmap of every possible combination of columns:
corr = data.corr()
sns.heatmap(corr)
Which, on my dataframe of 23,000 columns, may terminate near the heat death of the universe.
I can also do the more reasonable correlation between a subset of values
data2 = data[list_of_column_names]
corr = data2.corr(method="pearson")
sns.heatmap(corr)
That gives me something that I can use--here's an example of what that looks like:
What I would like to do is compare a list of 20 columns with the whole dataset. The normal .corr() function can give me a 20x20 or 23,000x23,000 heatmap, but essentially I would like a 20x23,000 heatmap.
How can I add more specificity to my correlations?
Thanks for the help!
Make a list of the subset that you want (in this example it is A, B, and C), create an empty dataframe, then fill it with the desired values using a nested loop.
df = pd.DataFrame(np.random.randn(50, 7), columns=list('ABCDEFG'))
# initiate empty dataframe
corr = pd.DataFrame()
for a in list('ABC'):
for b in list(df.columns.values):
corr.loc[a, b] = df.corr().loc[a, b]
corr
Out[137]:
A B C D E F G
A 1.000000 0.183584 -0.175979 -0.087252 -0.060680 -0.209692 -0.294573
B 0.183584 1.000000 0.119418 0.254775 -0.131564 -0.226491 -0.202978
C -0.175979 0.119418 1.000000 0.146807 -0.045952 -0.037082 -0.204993
sns.heatmap(corr)
After working through this last night, I came to the following answer:
#datatable imported earlier as 'data'
#Create a new dictionary
plotDict = {}
# Loop across each of the two lists that contain the items you want to compare
for gene1 in list_1:
for gene2 in list_2:
# Do a pearsonR comparison between the two items you want to compare
tempDict = {(gene1, gene2): scipy.stats.pearsonr(data[gene1],data[gene2])}
# Update the dictionary each time you do a comparison
plotDict.update(tempDict)
# Unstack the dictionary into a DataFrame
dfOutput = pd.Series(plotDict).unstack()
# Optional: Take just the pearsonR value out of the output tuple
dfOutputPearson = dfOutput.apply(lambda x: x.apply(lambda x:x[0]))
# Optional: generate a heatmap
sns.heatmap(dfOutputPearson)
Much like the other answers, this generates a heatmap (see below) but it can be scaled to allow for a 20,000x30 matrix without computing the correlation between the entire 20,000x20,000 combinations (and therefore terminating much quicker).
Usually the calculation of correlation coefficients pairwise for all variables make most sense. pd.corr() is convenience function to calculate the correlation coefficient pairwise (and for all pairs).
You can do it with scipy also only for specified pairs within a loop.
Example:
d=pd.DataFrame([[1,5,8],[2,5,4],[7,3,1]], columns=['A','B','C'])
One pair in pandas could be:
d.corr().loc['A','B']
-0.98782916114726194
Equivalent in scipy:
import scipy.stats
scipy.stats.pearsonr(d['A'].values,d['B'].values)[0]
-0.98782916114726194
I'd like to do some math on a series vector. I'd like to take the difference between two rows in a vector. My first intuition was:
def row_diff(prev, next):
return(next - prev)
and then using it
my_col_vec.apply(row_diff)
but this doesn't do what I'd like. It appears apply is row-wise, which is fine, but I can't seem to find an equivalent operation that will allow me to easy create a new vector from the old one by subtracting the previous row from the next.
Is there a better way to do this? I've been reading this document and it doesn't look like it.
Thanks!
To calculate inter-row differences use diff:
In [6]:
df = pd.DataFrame({'a':np.random.rand(5)})
df
Out[6]:
a
0 0.525220
1 0.031826
2 0.260853
3 0.273792
4 0.281368
In [7]:
df['diff'] = df['a'].diff()
df
Out[7]:
a diff
0 0.525220 NaN
1 0.031826 -0.493394
2 0.260853 0.229027
3 0.273792 0.012940
Also please try to avoid using apply as there is usually a vectorised method available
Is there a way to apply a list of functions to each column in a DataFrame like the DataFrameGroupBy.agg function does? I found an ugly way to do it like this:
df=pd.DataFrame(dict(one=np.random.uniform(0,10,100), two=np.random.uniform(0,10,100)))
df.groupby(np.ones(len(df))).agg(['mean','std'])
one two
mean std mean std
1 4.802849 2.729528 5.487576 2.890371
For Pandas 0.20.0 or newer, use df.agg (thanks to ayhan for pointing this out):
In [11]: df.agg(['mean', 'std'])
Out[11]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
For older versions, you could use
In [61]: df.groupby(lambda idx: 0).agg(['mean','std'])
Out[61]:
one two
mean std mean std
0 5.147471 2.971106 4.9641 2.753578
Another way would be:
In [68]: pd.DataFrame({col: [getattr(df[col], func)() for func in ('mean', 'std')] for col in df}, index=('mean', 'std'))
Out[68]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
In the general case where you have arbitrary functions and column names, you could do this:
df.apply(lambda r: pd.Series({'mean': r.mean(), 'std': r.std()})).transpose()
mean std
one 5.366303 2.612738
two 4.858691 2.986567
I tried to apply three functions into a column and it works
#removing new line character
rem_newline = lambda x : re.sub('\n',' ',x).strip()
#character lower and removing spaces
lower_strip = lambda x : x.lower().strip()
df = df['users_name'].apply(lower_strip).apply(rem_newline).str.split('(',n=1,expand=True)
I am using pandas to analyze Chilean legislation drafts. In my dataframe, the list of authors are stored as a string. The answer above did not work for me (using pandas 0.20.3). So I used my own logic and came up with this:
df.authors.apply(eval).apply(len).sum()
Concatenated applies! A pipeline!! The first apply transforms
"['Barros Montero: Ramón', 'Bellolio Avaria: Jaime', 'Gahona Salazar: Sergio']"
into the obvious list, the second apply counts the number of lawmakers involved in the project. I want the size of every pair (lawmaker, project number) (so I can presize an array where I will study which parties work on what).
Interestingly, this works! Even more interestingly, that last call fails if one gets too ambitious and does this instead:
df.autores.apply(eval).apply(len).apply(sum)
with an error:
TypeError: 'int' object is not iterable
coming from deep within /site-packages/pandas/core/series.py in apply