Let us assume we created a dataframe df using the code below. I have created a bin frequency count based on the 'value' column in df. Now how do I get the frequency count of these label=1 samples frequency count based on previous created bin? Obviously, I should not use qcut for those label = 1 samples to get the count, since the bin positions are not same as before.
import numpy as np
import pandas as pd
mu, sigma = 0, 0.1
theta = 0.3
s = np.random.normal(mu, sigma, 100)
group = np.random.binomial(1, theta, 100)
df = pd.DataFrame(np.vstack([s,group]).transpose())
df.columns = ['value','label']
factor = pd.qcut(df['value'], 5)
factor_bin_count = pd.value_counts(factor)
Update: I took the solution from jeff
df.groupby(['label',factor]).value.count()
If I understand your question. You want to take a grouping factor (e.g. you created using qcut to bin the continuous values), and another grouper (e.g. 'label'), then perform an operation. count in this case.
In [36]: df.groupby(['label',factor]).value.count()
Out[36]:
label value
0 [-0.248, -0.0864] 14
(-0.0864, -0.0227] 15
(-0.0227, 0.0208] 15
(0.0208, 0.0718] 17
(0.0718, 0.24] 13
1 [-0.248, -0.0864] 6
(-0.0864, -0.0227] 5
(-0.0227, 0.0208] 5
(0.0208, 0.0718] 3
(0.0718, 0.24] 7
Name: value, dtype: int64
Related
I have the following matrix in pandas:
import numpy as np
import pandas as pd
df_matrix = pd.DataFrame(np.random.random((10, 10)))
I need to get a vector that contains 10 median values, 1 value across each blue line as shown in the picture below:
The last number in the output vector is basically 1 number rather than a median.
X = np.random.random((10, 10))
fX = np.fliplr(X) # to get the "other" diagonal
np.array([np.median(np.diag(fX, k=-k)) for k in range(X.shape[0])])
The diagonals are such that row_num + col_num = constant. So you can use stack and sum the rows/cols and groupby:
(df_matrix.stack().reset_index(name='val')
.assign(diag=lambda x: x.level_0+x.level_1) # enumerate the diagonals
.groupby('diag')['val'].median() # median by diagonal
.loc[len(df_matrix):] # lower triangle diagonals
)
Output (for np.random.seed(42)):
diag
9 0.473090
10 0.330898
11 0.531382
12 0.440152
13 0.548075
14 0.325330
15 0.580145
16 0.427541
17 0.248817
18 0.107891
Name: val, dtype: float64
I am looking for a way to make n (e.g. 20) groups in a dataframe by a specific column by percentile. (data type is float). I am not sure if the group by quantile function can take care of this, and if it can, how the code should look like.
There are 3 rows a, b, c
i.e. Data are sorted by column 'a', and make 20 groups
Group 1 = 0 to 5 percentile
Group 2 = 5 to 10 percentile
.
.
.
Group 20 = 95 to 100 percentile.
would there also be a way to find the mean a, b, and c of each group, and sort them into another dataframe?
You can create 20 equal size bins using this.
df['newcol'] = pd.qcut(df.a,np.linspace(.05, 1, 19, 0), duplicates='drop')
Then you can groupby the newcol to find the summary stats of a,b and c columns
df.groupby(['newcol']).mean()
# group by percentile
profitdf['quantile_a'] = pd.qcut(profitdf['a'], 20)
profitdf['quantile_b'] = pd.qcut(profitdf['b'], 20)
quantile_a = profitdf.groupby(['quantile_a']).mean()
quantile_b = profitdf.groupby(['quantile_b']).mean()
Solved. Thank you everyone.
In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.
I have time series data in dataframe named "df", and, my code for calculating the z-score is given below:
mean = df.mean()
standard_dev = df.std()
z_score = (df - mean) / standard_dev
I would like to calculate the z-score for each observation using the respective observation and data that was known at the point of recording the observation. i.e. I do not want to use a standard deviation and mean that incorporates data that occurs after a specific point in time. I just want to use data from time t, t-1, t-2....
How do I do this?
Use .expanding() - col being the column you want to compute your statistics for (drop [col] in case, if you wish to compute it for the whole dataframe):
You might need to sort values by time column first - denoted as time_col (in case if it's not sorted already):
df=df.sort_values("time_col", axis=0)
Then:
df[col].sub(df[col].expanding().mean()).div(df[col].expanding().std())
Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html
For the sample data:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqrstuv"), "b": [6,5,7,1,-9,0,3,5,2,8]})
df["c"]=df["b"].sub(df["b"].expanding().mean()).div(df["b"].expanding().std())
Outputs:
a b c
0 x 6 NaN
1 y 5 -0.707107
2 z 7 1.000000
3 p 1 -1.425880
4 q -9 -1.677484
5 r 0 -0.281450
6 s 3 0.210502
7 t 5 0.534207
8 u 2 -0.046142
9 v 8 1.062430
You could assign two new columns, containing the mean and std of previous items. I here assume, that your time series data is in the column 'time_series_data':
len_ = len(df)
df['mean_past'] = [np.mean(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['std_past'] = [np.std(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['z_score'] = (df['time_series_data'] - df['mean_past']) / df['std_past']
Edit: if you want to z-score all columns, you could define a function, that computes the z-score and apply it on all columns of your dataframe:
def z_score_column(column):
len_ = len(column)
mean = [np.mean(column[0:lv+1]) for lv in range(0,len_)]
std = [np.std(column[0:lv+1]) for lv in range(0,len_)]
return [(c-m)/s for c,m,s in zip(column, mean, std)]
df = pd.DataFrame(np.random.rand(10,5))
df.apply(z_score_column)
I would like to compute a quantity called "downside beta".
Let's suppose I have a dataframe df:
df = pd.DataFrame({'A': [-0.1,0.3,-0.4, 0.8,-0.5],'B': [-0.2,0.5,0.3,-0.5,0.1]},index=[0, 1, 2, 3,4])
I would like to add a column, 'C' that computes this downside beta defined as the covariance between the columns A and B considering only the negative values of column A with the corresponding values of B. This covariance should be then divided by the variance of column A considering only the negative values.
In the above example, it should be equivalent of computing the covariance between the two series: [-0.1,-0.4,-0.5] and [-0.2,0.3,0.1]. Divided by the variance of the series [-0.1,-0.4,-0.5].
Next step would be to roll this metric over the index of an initial large dataframe df.
Is there an efficient way to do that? In a vectorized manner. I guess combining pd.rolling_cov and np.where?
Thank you!
Is this what you're looking for? You can filter out positive values and then call pandas cov and var functions accordingly:
v = df[df.A.lt(0)]
v.cov() / v.A.var()
A B
A 1.000000 -0.961538
B -0.961538 1.461538
If you just want the value at the diagonal,
np.diag(v.cov() / v.A.var(), k=-1)
array([-0.96153846])
For a rolling window, you may need to jump through a few hoops, but this should be doable;
v = df[df.A.lt(0)]
i = v.rolling(3).cov().A.groupby(level=0).last()
j = v.rolling(3).A.var()
i / j
0 NaN
2 NaN
4 -0.961538
Name: A, dtype: float64