I am using Pandas.rolling_corr to calculate correlation of two Pandas series.
pd.rolling_corr(x, y, 10)
x and y have very little variation. For instance
x[0] = 1.3342323
x[1] = 1.3342317
Since correlation is covariance divided by standard deviation, the correlation should only be inf or -inf if the standard deviation is 0. However, in the data set that I have, values are very close but no values are exactly same as another. But for some reason, I'm getting inf or -inf values in the correlation.
Here is my question: Is there a limit in pandas.rolling_corr, in which the number is just automatically rounded in calculation if the number if too small? (x<1e-7)
the dataset that I'm working with (x and y) are in 'float64' and I have set the chop_threshold to 0.
EDITED*
Here's is a sample of data that I'm working with. I'm trying to compute the correlation between the two columns, but the result is inf.
1144 679.5999998
1144 679.600001
1143.75 679.6000003
1143.75 679.5999993
1143 679.6000009
I think this can more generally be considered as a question about precision rather than correlation. And generally, you can expect things behind the scenes to be done at double precision which means that it's around 13 or 14 decimal places that things can get wonky (though certainly at 11 or 12 decimal places (or less) in many cases).
So here's a little example program that starts with differences of 1 in the 'x' column (1e0) and then drops it in steps all the way down to 1e-20.
pd.options.display.float_format = '{:12,.9f}'.format
df_corr = pd.DataFrame()
diffs = np.arange(4)
for i in range(20):
df = pd.DataFrame({ 'x': [1,1,1,1], 'y':[5,7,8,6] })
df['x'] = df['x'] + diffs*.1**i
df_corr = df_corr.append( pd.rolling_corr(df.x,df.y,3)[2:4], ignore_index=True )
A couple of comments before looking at results. First, pandas default display will sometimes hide very small differences in numbers so you need to force more precise formatting in some way. Here I used pd.options.
Second, I won't put a ton of comments in here, but the main thing you need to know is that the 'x' columns varies in each loop. You can add a print statement if you like, but the first 4 iterations look like this:
iteration values of column 'x'
0 [ 1. 2. 3. 4. ]
1 [ 1. 1.1 1.2 1.3 ]
2 [ 1. 1.01 1.02 1.03 ]
3 [ 1. 1.001 1.002 1.003 ]
So by the time we get to later iterations, the differences will be so small that pandas just sees it as [1,1,1,1]. The practical question then is at what point the differences are so small as to be indistinguishable to pandas.
To make this a little more concrete, this is what it looks like for the 8th iteration (and the rolling_corr uses a window of 3 since this dataset has only 4 rows):
x y
0 1.000000000 5
1 1.000000100 7
2 1.000000200 8
3 1.000000300 6
So here are the results, where the index value also gives the iteration number (and hence the precision of the differences in column 'x')
df_corr
2 3
0 0.981980506 -0.500000000
1 0.981980506 -0.500000000
2 0.981980506 -0.500000000
3 0.981980506 -0.500000000
4 0.981980506 -0.500000000
5 0.981980506 -0.500000000
6 0.981980505 -0.500000003
7 0.981980501 -0.500000003
8 0.981980492 -0.499999896
9 0.981978898 -0.500001485
10 0.981971784 -0.500013323
11 0.981893289 -0.500133227
12 0.981244570 -0.502373626
13 0.976259968 -0.505124339
14 0.954868546 -0.132598709
15 0.785584405 1.060660172
16 0.000000000 0.000000000
17 nan nan
18 nan nan
19 nan nan
So you can see there that you are getting reasonable results down to about the 13th decimal place (although accuracy is declining prior to that), and then they fall apart although you don't actually get NaNs until the 17th.
I'm not sure how to directly answer your question but hopefully that sheds some light. I would say that there is really no such thing as automatic rounding that occurs in these calculations but it's hard to generalize. It's certainly the case that some algorithms deal better than others with precision-type issues like this, but it's not possible to make any broad generalizations there, even amongst different functions or methods within pandas. (There is for example a current discussion about how numpy gets a different answer than the analogous function in pandas due to precision issues.)
Related
I have a data frame that one of its columns represents how many corns produced in this time stamp.
for example
timestamp corns_produced another_column
1 5 4
2 0 1
3 0 3
4 3 4
The dataframe is big.. 100,000+ rows
I want to calculate moving average and std for 1000 time stamps of corn_produced.
Luckily it is pretty easy using rolling :
my_df.rolling(1000).mean()
my_df.rolling(1000).std().
But the problem is I want to ignore the zeros, meaning if in the last 1000 timestamps there are only 5 instances in which corn was produced, I want to do the mean and std on those 5 elements.
How do I ignore the zeros ?
Just to clarify, I don't want to do the following x = my_df[my_df['corns_produced'] != 0], and than do rolling on x, because it ignores the time stamps and doesn't give me the result I need
You can use Rolling.apply:
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].mean()))
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].std()))
A faster solution: first set all zeros to np.nan, then take a rolling mean. If you are dealing with large data, it will be much faster
I wanted to calculate the mean and standard deviation of a sample. The sample is two columns, first is a time and second column, separated by space is value. I don't know how to calculate mean and standard deviation of the second column of vales using python, maybe scipy? I want to use that method for large sets of data.
I also want to check which number of a set is seven times higher than standard deviation.
Thanks for help.
time value
1 1.17e-5
2 1.27e-5
3 1.35e-5
4 1.53e-5
5 1.77e-5
The mean is 1.418e-5 and the standard deviation is 2.369-6.
To answer your first question, assuming your samplee's dataframe is df, the following should work:
import pandas as pd
df = pd.DataFrame({'time':[1,2,3,4,5], 'value':[1.17e-5,1.27e-5,1.35e-5,1.53e-5,1.77e-5]}
df will be something like this:
>>> df
time value
0 1 0.000012
1 2 0.000013
2 3 0.000013
3 4 0.000015
4 5 0.000018
Then to obtain the standard deviation and mean of the value column respectively, run the following and you will get the outputs:
>>> df['value'].std()
2.368966019173766e-06
>>> df['value'].mean()
1.418e-05
To answer your second question, try the following:
std = df['value'].std()
df = df[(df.value > 7*std)]
I am assuming you want to obtain the rows at which value is greater than 7 times the sample standard deviation. If you actually want greater than or equal to, just change > to >=. You should then be able to obtain the following:
>>> df
time value
4 5 0.000018
Also, following #Mad Physicist's suggestion of adding Delta Degrees of Freedom ddof=0 (if you are unfamiliar with this, checkout Delta Degrees of Freedom Wiki), doing so results in the following:
std = df['value'].std(ddof=0)
df = df[(df.value > 7*std)]
with output:
>>> df
time value
3 4 0.000015
4 5 0.000018
P.S. If I am not wrong, its a convention here to stick to one question a post, not two.
I am trying to create a new column in my pandas dataframe that is the result of a basic mathematical equation performed on other columns in the dataset. The problem now is that the values captured in the column are extremely rounded up and does not represent the true values.
2.5364 should not be rounded off to 2.5 and 3.775 should not be rounded off to 3.8
I have tried to declare the denominators as floats in a bid to trick the system to supply values that look like that. ie 12/3.00 should be 4.00 but this is still returning 4.0 instead.
This is currently what I am doing:
normal_load = 3
df['FirstPart_GPA'] = ((df[first_part].sum(axis = 1, skipna = True))/(normal_load*5.00))
I set skipna to true because sometimes a column might not have any value but I still want to be able to calculate the GPA without the system throwing out any errors since any number plus NAN would give NAN.
I am working with a dataframe that looks like this:
dict = {'course1': [15,12],
'course2': [9,6],
'course3': [12,15],
'course4': [15,3],
'course5': [15,9],
'course6': [9,12]}
df = pd.DataFrame(dict)
Note that the dataframe I have contains some null values because some courses are electives. Please help me out. I am out of ideas.
You have not defined the first_part variable in your code, so I am going to assume it is some subset of dataframe columns, e.g:
first_part=['course1', 'course2', 'course3']
All of the numbers in your dataframe are integer multiples of 3, therefore when you sum up any of them and divide by 15, you will always get a decimal number with no more than 1 digit after the decimal dot. Your values are not rounded up, they are exact.
To display numbers with two digits after the decimal dot, add a line:
pd.options.display.float_format = '{:,.2f}'.format
Now
df['FirstPart_GPA'] = ((df[first_part].sum(axis = 1, skipna = True))/(normal_load*5.00))
df
course1 course2 course3 course4 course5 course6 FirstPart_GPA
0 15 9 12 15 15 9 2.40
1 12 6 15 3 9 12 2.20
You can add float formatting something like this:
result= "%0.2f" % your_calc_result
Example using this code:
dict = {'course1': [15,12],
'course2': [9,6],
'course3': [12,15],
'course4': [15,3],
'course5': [15,9],
'course6': [9,12]}
df = pd.DataFrame(dict)
normal_load = 3.0
result=[]
for i in range(len(df.index)):
result.append("%0.2f" % (float(df.loc[i].sum())/(normal_load*5.00)))
df['FirstPart_GPA']=result
Output:
course1 course2 course3 course4 course5 course6 FirstPart_GPA
0 15 9 12 15 15 9 5.00
1 12 6 15 3 9 12 3.80
OMG! I now see what the problem is. I just threw my file into excel and did the calculation and it turns out that the code is fine. I am sorry I took any of your time and at the same time I appreciate your quick response.
I always assumed that GPAs would have lots of decimals but the code uses a 5-point grading system which means that if a student has an A in a course that has a course load of 3, she would have scored 15 points.
A student has to take 5 courses per semester. All 5 courses have a load of 3. This means that all 5 courses = 15.
So because the possible values a student can have are mostly multiples of 3 (0,3,6,9,12,15), when we divide the sum of all his units across all 5 courses by 15, 3 would always go through it ie 3+12+12+3+9/15 = 13/5
5 is so unproblematic and it would mostly not spill over in extra decimals unlike 10/3 that keeps giving me recursive 3s in the decimal part, 5 is co-operative. Therefore 13/5 = 2.6
In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).
Is there any good way to improve this?
Dataframe example:
id age
1 18
2 32
3 54
4 63
5 42
6 23
What I have done so far:
#Summary stats
df.describe('age').show()
#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)
The first improvment to do would be to do all the quantile calculations at the same time:
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)
Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):
relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).
I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.