I have a time series of counts obtained with
data_g = data.groupby(['A', 'B'])
incidence = data_g.size()
incidence
And I get something like this:
A B
A30.9 200801 4
200802 9
200803 9
200804 5
200805 3
...
A is a categorical variable with two values. B is yearweek. So this is weekly counts of two time-series (named after the values of A) over many years.
What I want, is to simply plot these series. I did not find a simple way to do this. What do you suggest?
Related
I basically have two problems which are very simple yet I'm stuck as I'm not use to work with stadistics in python.
The first problems is that I have a dataset with a large number of rows and a few labels, and I'm interested in two of them. I want to plot a Bar Chart (in any of the two ways seem below) that plots me the number of rows that satifies belonging to a particular class of the first label divided by the class of the second label. For example if in the Group A there's 160 rows and each of them 40 if also for Series 1, I want it plotted like below (there's a link in an image)
Group Series
0 Group A Series 2
1 Group B Series 1
2 Group B Series 5
3 Group A Series 4
0 Group A Series 1
1 Group B Series 3
2 Group B Series 3
3 Group A Series 2
Image
The second problems is that I would also like to know if is there any fuction that given two labels, for each class of the first label, tells me the percentaje of each of labels of the second column. Like in the second question but instead of visually I want it numerically. The output would be something like
Group A : 24% Series 1, 17% Series 2, 11% Series 3...
Group B: 27% Series 1, 13% Series 2, 10% Series 3...
I've tried the plotting with the hue argument but I don't know how do I do for counting the number of observations.
In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.
I have a pandas data frame which consists of three columns A ,B ,C and I need to sum up values based on row values
Below is the scenario
A B C
Distance_a distance_b 5
Distance_a distance_c 6
distance_b distance_c 7
distance_b distance_d 7
distance_d Distance_a 9
if I want to find out cumulative distance from distance_A, I need my code to add 5,6 and also it is supposed to consider last column that is distance_d Distance _a and it need to add 9 as well
So cumulative distance from a will be 5+6+9 = 20
#Hongpei's answer is certainly more efficient, but if you just want the sum of distance_a. You can do the following as well
import pandas as pd
# initialize list of lists
data = {'A':['distance_a', 'distance_a', 'distance_b', 'distance_b', 'distance_d'],
'B':['distance_b', 'distance_c', 'distance_c', 'distance_d', 'distance_a'],
'C':[5, 6, 7, 7, 9]}
# Create the pandas DataFrame
df = pd.DataFrame(data)
# Group by columns A and B individually
col_A_groupby = df.groupby(['A']).sum()
col_B_groupby = df.groupby(['B']).sum()
# Sum the values together
dist_a_sum = col_A_groupby.loc['distance_a'] + col_B_groupby.loc['distance_a']
There can be a easy workaround, suppose your original DataFrame is df, then you only need to:
pd.concat([df[['A','C']],
df[['B','C']].rename(columns={'B':'A'})],
sort=False).groupby('A').sum()
Basically what I did is to concat df[['A','C']] and df[['B','C']] together (while rename the second df columns to ['A','C']), and then groupby
IIUC, a melt and sum are enough
s = df.melt('C').groupby('value').C.sum()
print(s)
Out[113]:
value
Distance_a 20
distance_b 19
distance_c 13
distance_d 16
Name: C, dtype: int64
I would like to compute a quantity called "downside beta".
Let's suppose I have a dataframe df:
df = pd.DataFrame({'A': [-0.1,0.3,-0.4, 0.8,-0.5],'B': [-0.2,0.5,0.3,-0.5,0.1]},index=[0, 1, 2, 3,4])
I would like to add a column, 'C' that computes this downside beta defined as the covariance between the columns A and B considering only the negative values of column A with the corresponding values of B. This covariance should be then divided by the variance of column A considering only the negative values.
In the above example, it should be equivalent of computing the covariance between the two series: [-0.1,-0.4,-0.5] and [-0.2,0.3,0.1]. Divided by the variance of the series [-0.1,-0.4,-0.5].
Next step would be to roll this metric over the index of an initial large dataframe df.
Is there an efficient way to do that? In a vectorized manner. I guess combining pd.rolling_cov and np.where?
Thank you!
Is this what you're looking for? You can filter out positive values and then call pandas cov and var functions accordingly:
v = df[df.A.lt(0)]
v.cov() / v.A.var()
A B
A 1.000000 -0.961538
B -0.961538 1.461538
If you just want the value at the diagonal,
np.diag(v.cov() / v.A.var(), k=-1)
array([-0.96153846])
For a rolling window, you may need to jump through a few hoops, but this should be doable;
v = df[df.A.lt(0)]
i = v.rolling(3).cov().A.groupby(level=0).last()
j = v.rolling(3).A.var()
i / j
0 NaN
2 NaN
4 -0.961538
Name: A, dtype: float64
I am trying to do something quite simple compute a Pearson correlation matrix of several variables that are given as columns of a DataFrame. I want it to ignore nans and provide also the p-values. scipy.stats.pearsonr is insufficient because it works only for two variables and cannot account for nans. There should be something better than that...
For example,
df = pd.DataFrame([[1,2,3],[6,5,4],[1,None,9]])
0 1 2
0 1 2.0 3
1 6 5.0 4
2 1 NaN 9
The columns of df are the variables and the rows are observations. I would like a command that returns a 3x3 correlation matrix, along with a 3x3 matrix of corresponding p-values. I want it to omit the None. That is, the correlation between [1,6,1],[2,5,NaN] should be the correlation between [1,6] and [2,5].
There must be a nice Pythonic way to do that, can anyone please suggest?
If you have your data in a pandas DataFrame, you can simply use df.corr().
From the docs:
DataFrame.corr(method='pearson', min_periods=1)
Compute pairwise correlation of columns, excluding NA/null values