pandas dataframe using different mean for standard deviation - python

Is there a way to perform standard deviation on an array where you specify the the xi and xn variable? it appears that the standard deviation function uses the mean of the respected series. For example if I have a dataframe DF with 2 columns d and c, I would like the standard deviation function to perform is as stddev=sqrt(1/df.index*cumsum((DF.d-DF.c)^2)`.
Edit: Here is the dataframe
d c
0 1.8740 1.874000
1 1.8762 1.876114
2 1.8735 1.874886
3 1.8740 1.874633
4 1.8754 1.874746
5 1.8716 1.874110
6 1.8696 1.873351
7 1.8732 1.873324
8 1.8656 1.871752
9 1.8613 1.870247
In the pandas std class, it will calculate standard dev by using the mean of column D. I would like to and perform the calucation using an alternate calculated mean in column c. Basically column C is weighted average. I wanted to use the expanding_std class but no way, that I can see to define the mean variable.

Related

How to create calculated column off variable result of same row? Pandas & Python 3

Fairly new to python, I have been struggling with creating a calculated column based off of the variable values of each item.
I Have this table below with DF being the dataframe name
I am trying to create a 'PE Comp' Column that gets the PE value for each ticker, and divides it by the **Industry ** average PE Ratio.
My most successful attempt required me creating a .groupby industry dataframe (y) which has calculated the mean per industry. These numbers are correct. Once I did that I created this code block:
for i in DF['Industry']:
DF['PE Comp'] = DF['PE Ratio'] / y.loc[i,'PE Ratio']
However the numbers are coming out incorrect. I've tested this and the y.loc divisor is working fine with the right numbers, meaning that the issue is coming from the dividend.
Any suggestions on how I can overcome this?
Thanks in advance!
You can use the Pandas Groupby transform:
The following takes the PE Ratio column and divides it by the mean of the grouped industries (expressed three different ways in order of speed of calculation):
import pandas as pd
df = pd.DataFrame({"PE Ratio": [1,2,3,4,5,6,7],
"Industry": list("AABCBBC")})
# option 1
df["PE Comp"] = df["PE Ratio"] / df.groupby("Industry")["PE Ratio"].transform("mean")
# option 2
df["PE Comp"] = df.groupby("Industry")["PE Ratio"].transform(lambda x: x/x.mean())
# option 3
import numpy as np
df["PE Comp"] = df.groupby("Industry")["PE Ratio"].transform(lambda x: x/np.mean(x))
df
#Out[]:
# PE Ratio Industry PE Comp
#0 1 A 0.666667
#1 2 A 1.333333
#2 3 B 0.642857
#3 4 C 0.727273
#4 5 B 1.071429
#5 6 B 1.285714
#6 7 C 1.272727
First, you MUST NOT ITERATE through a dataframe. It is not optimized at all and it is a misused of Pandas' DataFrame.
Creating a new dataframe containing the averages is a good approach in my opinion. I think the line you want to write after is :
df['PE comp'] = df['PE ratio'] / y.loc[df['Industry']].value
I just have a doubt about y.loc[df['Industry']].value maybe you don't need .value or maybe you need to cast the value, I didn't test. But the spirit is that you new y DataFrame is like a dict containing the average of each Industry.

Rolling mean and standard deviation without zeros

I have a data frame that one of its columns represents how many corns produced in this time stamp.
for example
timestamp corns_produced another_column
1 5 4
2 0 1
3 0 3
4 3 4
The dataframe is big.. 100,000+ rows
I want to calculate moving average and std for 1000 time stamps of corn_produced.
Luckily it is pretty easy using rolling :
my_df.rolling(1000).mean()
my_df.rolling(1000).std().
But the problem is I want to ignore the zeros, meaning if in the last 1000 timestamps there are only 5 instances in which corn was produced, I want to do the mean and std on those 5 elements.
How do I ignore the zeros ?
Just to clarify, I don't want to do the following x = my_df[my_df['corns_produced'] != 0], and than do rolling on x, because it ignores the time stamps and doesn't give me the result I need
You can use Rolling.apply:
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].mean()))
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].std()))
A faster solution: first set all zeros to np.nan, then take a rolling mean. If you are dealing with large data, it will be much faster

Standard deviation of time series

I wanted to calculate the mean and standard deviation of a sample. The sample is two columns, first is a time and second column, separated by space is value. I don't know how to calculate mean and standard deviation of the second column of vales using python, maybe scipy? I want to use that method for large sets of data.
I also want to check which number of a set is seven times higher than standard deviation.
Thanks for help.
time value
1 1.17e-5
2 1.27e-5
3 1.35e-5
4 1.53e-5
5 1.77e-5
The mean is 1.418e-5 and the standard deviation is 2.369-6.
To answer your first question, assuming your samplee's dataframe is df, the following should work:
import pandas as pd
df = pd.DataFrame({'time':[1,2,3,4,5], 'value':[1.17e-5,1.27e-5,1.35e-5,1.53e-5,1.77e-5]}
df will be something like this:
>>> df
time value
0 1 0.000012
1 2 0.000013
2 3 0.000013
3 4 0.000015
4 5 0.000018
Then to obtain the standard deviation and mean of the value column respectively, run the following and you will get the outputs:
>>> df['value'].std()
2.368966019173766e-06
>>> df['value'].mean()
1.418e-05
To answer your second question, try the following:
std = df['value'].std()
df = df[(df.value > 7*std)]
I am assuming you want to obtain the rows at which value is greater than 7 times the sample standard deviation. If you actually want greater than or equal to, just change > to >=. You should then be able to obtain the following:
>>> df
time value
4 5 0.000018
Also, following #Mad Physicist's suggestion of adding Delta Degrees of Freedom ddof=0 (if you are unfamiliar with this, checkout Delta Degrees of Freedom Wiki), doing so results in the following:
std = df['value'].std(ddof=0)
df = df[(df.value > 7*std)]
with output:
>>> df
time value
3 4 0.000015
4 5 0.000018
P.S. If I am not wrong, its a convention here to stick to one question a post, not two.

Calculate a value in Pandas that is based on a product of past values without looping

I have a dataframe that represents time series probabilities. Each value in column 'Single' represents the probability of that event in that time period (where each row represents one time period). Each value in column 'Cumulative' represents the probability of that event occurring every time period until that point (ie it is the product of every value in 'Single' from time 0 until now).
A simplified version of the dataframe looks like this:
Single Cumulative
0 0.990000 1.000000
1 0.980000 0.990000
2 0.970000 0.970200
3 0.960000 0.941094
4 0.950000 0.903450
5 0.940000 0.858278
6 0.930000 0.806781
7 0.920000 0.750306
8 0.910000 0.690282
9 0.900000 0.628157
10 0.890000 0.565341
In order to calculate the 'Cumulative' column based on the 'Single' column I am looping through the dataframe like this:
for index, row in df.iterrows():
df['Cumulative'][index] = df['Single'][:index].prod()
In reality, there is a lot of data and looping is a drag on performance, is it at all possible to achieve this without looping?
I've tried to find a way to vectorize this calculation or even use the pandas.DataFrame.apply function, but I don't believe I'm able to reference the current index value in either of those methods.
There's a built in function for this in Pandas:
df.cumprod()

How to plot a graph for correlation co-efficient between each attributes of a dataset and target attribute using Python

I am new to Python and I need to plot a graph between correlation coefficient of each attributes against target value. I have an input dataset with huge number of values. I have provided sample dataset value as below. We need to predict whether a particular consumer will leave or not in a company and hence Result column is the target variable.
SALARY DUE RENT CALLSPERDAY CALL DURATION RESULT
238790 7 109354 0 6 YES
56004 0 204611 28 15 NO
671672 27 371953 0 4 NO
786035 1 421999 19 11 YES
89684 2 503335 25 8 NO
904285 3 522554 0 13 YES
12072 4 307649 4 11 NO
23621 19 389157 0 4 YES
34769 11 291214 1 13 YES
945835 23 515777 0 5 NO
Here, if you see, the result column is String where as rest of the columns are integer. Similar to result, I also have few other columns(not mentioned in sample) which have string value. Here, I need to compute the values of column which has both string and integer values.
Using dictionary I have assigned a value to each of the columns which has string value.
Example: Result column has Yes or No. Hence assigned value as below:
D = {'NO': 0, 'YES': 1}
and using lambda function, looped through each columns of dataset and replaced NO with 0 and YES with 1.
I tried to calculate the correlation coefficient using the formula:
pearsonr(S.SALARY,targetVarible)
Where S is the dataframe which holds all values.
Similarly, I will loop through all the columns of dataset and calculate correlation coefficient of each columns agains target variable.
Is this an efficient way of calculating correlation coefficient?
Because, I am getting value as below
(0.088327739664096655, 1.1787456108540725e-25)
e^-25 seems to be too small.
Is there any other way to calculate it? Would you be suggesting any other way to input String values, so that it can be treated as integer when compared with other columns that has integer values(other than Dictionaries and lambdas which I used?)
Also I need to plot bar graph using the same code. I am planning to use from matplotlib import pyplot as plt library.
Would you be suggesting any other function to plot bar graph. Mostly Im using sklearn libraries,numpy and pandas to use existing functions from them.
It would be great, if someone helps me. Thanks.
As mentioned in the comments you can use df.corr() to get the correlation matrix of your data. Assuming the name of your DataFrame is df you can plot the correlation with:
df_corr = df.corr()
df_corr[['RESULT']].plot(kind='hist')
Pandas DataFrames have a plot function that uses matplotlib. You can learn more about it here: http://pandas.pydata.org/pandas-docs/stable/visualization.html

Categories