I wanted to calculate the mean and standard deviation of a sample. The sample is two columns, first is a time and second column, separated by space is value. I don't know how to calculate mean and standard deviation of the second column of vales using python, maybe scipy? I want to use that method for large sets of data.
I also want to check which number of a set is seven times higher than standard deviation.
Thanks for help.
time value
1 1.17e-5
2 1.27e-5
3 1.35e-5
4 1.53e-5
5 1.77e-5
The mean is 1.418e-5 and the standard deviation is 2.369-6.
To answer your first question, assuming your samplee's dataframe is df, the following should work:
import pandas as pd
df = pd.DataFrame({'time':[1,2,3,4,5], 'value':[1.17e-5,1.27e-5,1.35e-5,1.53e-5,1.77e-5]}
df will be something like this:
>>> df
time value
0 1 0.000012
1 2 0.000013
2 3 0.000013
3 4 0.000015
4 5 0.000018
Then to obtain the standard deviation and mean of the value column respectively, run the following and you will get the outputs:
>>> df['value'].std()
2.368966019173766e-06
>>> df['value'].mean()
1.418e-05
To answer your second question, try the following:
std = df['value'].std()
df = df[(df.value > 7*std)]
I am assuming you want to obtain the rows at which value is greater than 7 times the sample standard deviation. If you actually want greater than or equal to, just change > to >=. You should then be able to obtain the following:
>>> df
time value
4 5 0.000018
Also, following #Mad Physicist's suggestion of adding Delta Degrees of Freedom ddof=0 (if you are unfamiliar with this, checkout Delta Degrees of Freedom Wiki), doing so results in the following:
std = df['value'].std(ddof=0)
df = df[(df.value > 7*std)]
with output:
>>> df
time value
3 4 0.000015
4 5 0.000018
P.S. If I am not wrong, its a convention here to stick to one question a post, not two.
Related
I have a data frame that one of its columns represents how many corns produced in this time stamp.
for example
timestamp corns_produced another_column
1 5 4
2 0 1
3 0 3
4 3 4
The dataframe is big.. 100,000+ rows
I want to calculate moving average and std for 1000 time stamps of corn_produced.
Luckily it is pretty easy using rolling :
my_df.rolling(1000).mean()
my_df.rolling(1000).std().
But the problem is I want to ignore the zeros, meaning if in the last 1000 timestamps there are only 5 instances in which corn was produced, I want to do the mean and std on those 5 elements.
How do I ignore the zeros ?
Just to clarify, I don't want to do the following x = my_df[my_df['corns_produced'] != 0], and than do rolling on x, because it ignores the time stamps and doesn't give me the result I need
You can use Rolling.apply:
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].mean()))
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].std()))
A faster solution: first set all zeros to np.nan, then take a rolling mean. If you are dealing with large data, it will be much faster
I have a dataframe that represents time series probabilities. Each value in column 'Single' represents the probability of that event in that time period (where each row represents one time period). Each value in column 'Cumulative' represents the probability of that event occurring every time period until that point (ie it is the product of every value in 'Single' from time 0 until now).
A simplified version of the dataframe looks like this:
Single Cumulative
0 0.990000 1.000000
1 0.980000 0.990000
2 0.970000 0.970200
3 0.960000 0.941094
4 0.950000 0.903450
5 0.940000 0.858278
6 0.930000 0.806781
7 0.920000 0.750306
8 0.910000 0.690282
9 0.900000 0.628157
10 0.890000 0.565341
In order to calculate the 'Cumulative' column based on the 'Single' column I am looping through the dataframe like this:
for index, row in df.iterrows():
df['Cumulative'][index] = df['Single'][:index].prod()
In reality, there is a lot of data and looping is a drag on performance, is it at all possible to achieve this without looping?
I've tried to find a way to vectorize this calculation or even use the pandas.DataFrame.apply function, but I don't believe I'm able to reference the current index value in either of those methods.
There's a built in function for this in Pandas:
df.cumprod()
I have a pandas dataframe and I'd like to add a new column that has the contents of an existing column, but shifted relative to the rest of the data frame. I'd also like the value that drops off the bottom to get rolled around to the top.
For example if this is my dataframe:
>>> myDF
coord coverage
0 1 1
1 2 10
2 3 50
I want to get this:
>>> myDF_shifted
coord coverage coverage_shifted
0 1 1 50
1 2 10 1
2 3 50 10
(This is just a simplified example - in real life, my dataframes are larger and I will need to shift by more than one unit)
This is what I've tried and what I get back:
>>> myDF['coverage_shifted'] = myDF.coverage.shift(1)
>>> myDF
coord coverage coverage_shifted
0 1 1 NaN
1 2 10 1
2 3 50 10
So I can create the shifted column, but I don't know how to roll the bottom value around to the top. From internet searches I think that numpy lets you do this with "numpy.roll". Is there a pandas equivalent?
Pandas probably doesn't provide an off-the-shelf method to do the exactly what you described, however if you can move a little but out of the box, numpy has exactly that
In your case it is:
import numpy as np
myDF['coverage_shifted'] = np.roll(df.coverage, 2)
You can pass in an additional argument to the shift() to achieve what you want. The previous answer is much more helpful in most cases
last_value = myDF.iloc[-1]['coverage']
myDF['coverage_shifted'] = myDF.coverage.shift(1, fill_value=last_value)
You have to manually supply the value to fill_value
same can be applied for reverse rolling
first_value = myDF.iloc[0]['coverage']
myDF['coverage_back_shifted'] = myDF.coverage.shift(-1, fill_value=first_value)
Is there a way to perform standard deviation on an array where you specify the the xi and xn variable? it appears that the standard deviation function uses the mean of the respected series. For example if I have a dataframe DF with 2 columns d and c, I would like the standard deviation function to perform is as stddev=sqrt(1/df.index*cumsum((DF.d-DF.c)^2)`.
Edit: Here is the dataframe
d c
0 1.8740 1.874000
1 1.8762 1.876114
2 1.8735 1.874886
3 1.8740 1.874633
4 1.8754 1.874746
5 1.8716 1.874110
6 1.8696 1.873351
7 1.8732 1.873324
8 1.8656 1.871752
9 1.8613 1.870247
In the pandas std class, it will calculate standard dev by using the mean of column D. I would like to and perform the calucation using an alternate calculated mean in column c. Basically column C is weighted average. I wanted to use the expanding_std class but no way, that I can see to define the mean variable.
I have a data file with a fields separated by commas that I received from someone. I have to systematically go through each column to understand things like usual descriptive statistics:
-Min
-Max
-Mean
-25th percentile
-50th percentile
-75th percentile
or if it's text:
-number of distinct values
but also I need to find
-number of null or missing values
-number of zeroes
Sometimes the oddities of a feature mean something, i.e. contains information. And I might need to circle back with the client about oddities I find. Or if I'm going to replace values I have to make sure I'm not steamrolling over something recklessly.
So my question is this: Is there a package in python that will find this for me without my presupposing the data type? And if it did exist, would pandas be a good home for it?
I see that pandas makes it easy peezy to replace values but in the beginning I just want to look.
You can use the describe method:
In [1]: df = pd.DataFrame(randn(10, 3), columns=list('ABC'))
In [2]: df
Out[2]:
A B C
0 1.389738 -0.205485 -0.775810
1 -1.166596 -0.898761 -1.805333
2 -1.016509 -0.816037 0.169265
3 -0.440860 -1.147164 1.558606
4 0.763012 1.068694 -0.711795
5 0.075961 -0.597715 0.699023
6 3.006095 -0.354879 -0.718440
7 -1.249588 -0.372235 1.611717
8 0.518770 -0.742766 1.956372
9 1.304080 -0.803262 -0.609970
In [3]: df.describe()
Out[3]:
A B C
count 10.000000 10.000000 10.000000
mean 0.318410 -0.486961 0.137363
std 1.360633 0.616566 1.266616
min -1.249588 -1.147164 -1.805333
25% -0.872596 -0.812843 -0.716779
50% 0.297366 -0.670240 -0.220352
75% 1.168813 -0.359218 1.343710
max 3.006095 1.068694 1.956372
It has a percentile_width argument, which defaults to 50.