Pandas series: conditional rolling standard deviation - python

I have a Pandas series of random numbers from -1 to +1:
from pandas import Series
from random import random
x = Series([random() * 2 - 1. for i in range(1000)])
x
Output:
0 -0.499376
1 -0.386884
2 0.180656
3 0.014022
4 0.409052
...
995 -0.395711
996 -0.844389
997 -0.508483
998 -0.156028
999 0.002387
Length: 1000, dtype: float64
I can get the rolling standard deviation of the full Series easily:
x.rolling(30).std()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.575365
996 0.580220
997 0.580924
998 0.577202
999 0.576759
Length: 1000, dtype: float64
But what I would like to do is to get the standard deviation of only positive numbers within the rolling window. In our example, the window length is 30... say there are only 15 positive numbers in the window, I want the standard deviation of only those 15 numbers.
One could remove all negative numbers from the Series and calculate the rolling standard deviation:
x[x > 0].rolling(30).std()
Output:
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
...
988 0.286056
990 0.292455
991 0.283842
994 0.291798
999 0.291824
Length: 504, dtype: float64
...But this isn't the same thing, as there will always be 30 positive numbers in the window here, whereas for what I want, the number of positive numbers will change.
I want to avoid iterating over the Series; I was hoping there might be a more Pythonic way to solve my problem. Can anyone help ?

Mask the non positive values with NaN then calculate the rolling std with min_periods=1 and optionally set the first 29 values to NaN.
w = 30
s = x.mask(x <= 0).rolling(w, min_periods=1).std()
s.iloc[:w - 1] = np.nan
Note
Passing the argument min_periods=1 is important here because there can be certain windows where the number of non-null values is not equal to length of that window and in such case you will get the NaN result.

Another possible solution:
pd.Series(np.where(x >= 0, x, np.nan)).rolling(30, min_periods=1).std()
Output:
0 NaN
1 NaN
2 NaN
3 0.441567
4 0.312562
...
995 0.323768
996 0.312461
997 0.304077
998 0.308342
999 0.301742
Length: 1000, dtype: float64

You may first turn non-positive values into np.nan, then apply np.nanstd to each window. So
x[x.values <= 0] = np.nan
rolling_list = [np.nanstd(window.to_list()) for window in x.rolling(window=30)]
will return
[0.0,
0.0,
0.38190115685808856,
0.38190115685808856,
0.38190115685808856,
0.3704840425749437,
0.33234158296550925,
0.33234158296550925,
0.3045579286056045,
0.2962826377559198,
0.275920580105683,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554
...]

IIUC, after rolling, you want to calculate std of only positive values in each rolling window
out = x.rolling(30).apply(lambda w: w[w>0].std())
print(out)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.324031
996 0.298276
997 0.294917
998 0.304506
999 0.308050
Length: 1000, dtype: float64

Related

How to convert the data type from object to numeric & then find the mean for each row in pandas ? eg. convert '<17,500, >=15,000' to 16250(mean val)

data['family_income'].value_counts()
>=35,000 2517
<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
The data column to be shown as a MEAN value instead of values in range
data['family_income']
0 <17,500, >=15,000
1 <27,500, >=25,000
2 <30,000, >=27,500
3 <15,000, >=12,500
4 <30,000, >=27,500
...
10150 <30,000, >=27,500
10151 <25,000, >=22,500
10152 >=35,000
10153 <10,000, >= 8,000
10154 <27,500, >=25,000
Name: family_income, Length: 10155, dtype: object
Output: as mean imputed value
0 16250
1 26250
3 28750
...
10152 35000
10153 9000
10154 26500
data['family_income']=data['family_income'].str.replace(',', ' ').str.replace('<',' ')
data[['income1','income2']] = data['family_income'].apply(lambda x: pd.Series(str(x).split(">=")))
data['income1']=pd.to_numeric(data['income1'], errors='coerce')
data['income1']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10150 NaN
10151 NaN
10152 NaN
10153 NaN
10154 NaN
Name: income1, Length: 10155, dtype: float64
In this case, conversion of datatype from object to numeric doesn't seem to work since all the values are returned as NaN. So, how to convert to numeric data type and find mean imputed values?
You can use the following snippet:
# Importing Dependencies
import pandas as pd
import string
# Replicating Your Data
data = ['<17,500, >=15,000', '<27,500, >=25,000', '< 4,000 ', '>=35,000']
df = pd.DataFrame(data, columns = ['family_income'])
# Removing punctuation from family_income column
df['family_income'] = df['family_income'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
# Splitting ranges to two columns A and B
df[['A', 'B']] = df['family_income'].str.split(' ', 1, expand=True)
# Converting cols A and B to float
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
# Creating mean column from A and B
df['mean'] = df[['A', 'B']].mean(axis=1)
# Input DataFrame
family_income
0 <17,500, >=15,000
1 <27,500, >=25,000
2 < 4,000
3 >=35,000
# Result DataFrame
mean
0 16250.0
1 26250.0
2 4000.0
3 35000.0

iteritems() in dataframe column

I have a dataset of U.S. Education Datasets: Unification Project. I want to find out
Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.
I am having problem in updating the count whenever the value in the if statement is correct.
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape
df = df.iloc[:, -6]
for key, value in df.iteritems():
count = 0
count1 = 0
if value < 5000:
count += 1
elif value < 20000 and value > 10000:
count1 += 1
print(str(count) + str(count1))
df looks like this
0 196386.0
1 30847.0
2 175210.0
3 123113.0
4 1372011.0
5 160299.0
6 126917.0
7 28338.0
8 18173.0
9 511557.0
10 315539.0
11 43882.0
12 66541.0
13 495562.0
14 278161.0
15 138907.0
16 120960.0
17 181786.0
18 196891.0
19 59289.0
20 189795.0
21 230299.0
22 419351.0
23 224426.0
24 129554.0
25 235437.0
26 44449.0
27 79975.0
28 57605.0
29 47999.0
...
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
1467 NaN
1468 NaN
1469 NaN
1470 NaN
1471 NaN
1472 NaN
1473 NaN
1474 NaN
1475 NaN
1476 NaN
1477 NaN
1478 NaN
1479 NaN
1480 NaN
1481 NaN
1482 NaN
1483 NaN
1484 NaN
1485 NaN
1486 NaN
1487 NaN
1488 NaN
1489 NaN
1490 NaN
1491 NaN
Name: GRADES_9_12_G, Length: 1492, dtype: float64
In the output I got
00
With Pandas, using loops is almost always the wrong way to go. You probably want something like this instead:
print(len(df.loc[df['GRADES_9_12_G'] < 5000]))
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.
The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)]
uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.
To go off your method:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
Why did I change the code in my answer instead of using yours?
Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.

Applying a map across several columns in Pandas

I am trying to apply a map across several columns in pandas to reflect when data is invalid. When data is invalid in my df['Count'] column, I want to set my df['Value'], df['Lower Confidence Interval'], df['Upper Confidence Interval'] and df['Denominator'] columns to -1.
This is a sample of the dataframe:
Count Value Lower Confidence Interval Upper Confidence Interval Denominator
121743 54.15758428 53.95153779 54.36348867 224794
280 91.80327869 88.18009411 94.38654088 305
430 56.95364238 53.39535553 60.44152684 755
970 70.54545455 68.0815009 72.89492873 1375
nan
70 28.57142857 23.27957213 34.52488678 245
125 62.5 55.6143037 68.91456314 200
Currently, I am trying:
set_minus_1s = {np.nan: -1, '*': -1, -1: -1}
then:
df[['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']] = df['Count'].map(set_minus_1s)
and getting the error:
ValueError: Must have equal len keys and value when setting with an iterable
Is there any way of chaining the column references to make one call to the map rather than have separate lines for each column to call the set_minus_1s dictionary as a map?
I think you can use where or mask and replace all rows where not isnull after apply map:
val = df['Count'].map(set_minus_1s)
print (val)
0 NaN
1 NaN
2 NaN
3 NaN
4 -1.0
5 NaN
6 NaN
Name: Count, dtype: float64
cols =['Value','Count','Lower Confidence Interval','Upper Confidence Interval','Denominator']
df[cols] = df[cols].where(val.isnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0
cols = ['Value', 'Count', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Denominator']
df[cols] = df[cols].mask(val.notnull(), val, axis=0)
print (df)
Count Value Lower Confidence Interval Upper Confidence Interval \
0 121743.0 54.157584 53.951538 54.363489
1 280.0 91.803279 88.180094 94.386541
2 430.0 56.953642 53.395356 60.441527
3 970.0 70.545455 68.081501 72.894929
4 -1.0 -1.000000 -1.000000 -1.000000
5 70.0 28.571429 23.279572 34.524887
6 125.0 62.500000 55.614304 68.914563
Denominator
0 224794.0
1 305.0
2 755.0
3 1375.0
4 -1.0
5 245.0
6 200.0

pandas, dataframe, groupby, std

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?
It works for grouping by a single column:
df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial 132564 non-null values
host 132564 non-null values
idnum 132564 non-null values
operation 132564 non-null values
time 132564 non-null values
...
dtypes: float32(1), int64(2), object(6)
byhost = df.groupby('host')
byhost.std()
Out[362]:
datespecial idnum time
host
ahost1.test 11946.961952 40367.033852 0.003699
host1.test 15484.975077 38206.578115 0.008800
host10.test NaN 37644.137631 0.018001
...
Nice. Now:
byhostandop = df.groupby(['host', 'operation'])
byhostandop.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
386 # todo, implement at cython level?
387 if ddof == 1:
--> 388 return self._cython_agg_general('std')
389 else:
390 f = lambda x: x.std(ddof=ddof)
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
1615
1616 def _cython_agg_general(self, how, numeric_only=True):
-> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
1618 return self._wrap_agged_blocks(new_blocks)
1619
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
1653 values = com.ensure_float(values)
1654
-> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
1656
1657 # see if we can cast the block back to the original dtype
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
838 if is_numeric:
839 result = lib.row_bool_subset(result,
--> 840 (counts > 0).view(np.uint8))
841 else:
842 result = lib.row_bool_subset_object(result,
/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
Huh?? Why do I get this exception?
More questions:
how do I calculate std deviation on dataframe.groupby([several columns])?
how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.
It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:
df.astype('float64')
To calculate std() on selected columns, just select columns :)
>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
a b c g
0 0 10 a 1
1 1 11 b 1
2 2 12 c 1
3 3 13 d 2
4 4 14 e 2
5 5 15 f 2
6 6 16 g 3
7 7 17 h 3
8 8 18 i 3
9 9 19 j 3
>>> df.groupby('g')[['a', 'b']].std()
a b
g
1 1.000000 1.000000
2 1.000000 1.000000
3 1.290994 1.290994
update
As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():
byhostandop['time'].apply(lambda x: x.std())

pandas DataFrame Dividing a column by itself

I have a pandas dataframe that I filled with this:
import pandas.io.data as web
test = web.get_data_yahoo('QQQ')
The dataframe looks like this in iPython:
In [13]: test
Out[13]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 729 entries, 2010-01-04 00:00:00 to 2012-11-23 00:00:00
Data columns:
Open 729 non-null values
High 729 non-null values
Low 729 non-null values
Close 729 non-null values
Volume 729 non-null values
Adj Close 729 non-null values
dtypes: float64(5), int64(1)
When I divide one column by another, I get a float64 result that has a satisfactory number of decimal places. I can even divide one column by another column offset by one, for instance test.Open[1:]/test.Close[:], and get a satisfactory number of decimal places. When I divide a column by itself offset, however, I get just 1:
In [83]: test.Open[1:] / test.Close[:]
Out[83]:
Date
2010-01-04 NaN
2010-01-05 0.999354
2010-01-06 1.005635
2010-01-07 1.000866
2010-01-08 0.989689
2010-01-11 1.005393
...
In [84]: test.Open[1:] / test.Open[:]
Out[84]:
Date
2010-01-04 NaN
2010-01-05 1
2010-01-06 1
2010-01-07 1
2010-01-08 1
2010-01-11 1
I'm probably missing something simple. What do I need to do in order to get a useful value out of that sort of calculation? Thanks in advance for the assistance.
If you're looking to do operations between the column and lagged values, you should be doing something like test.Open / test.Open.shift().
shift realigns the data and takes an optional number of periods.
You may not be getting what you think you are when you do test.Open[1:]/test.Close. Pandas matches up the rows based on their index, so you're still getting each element of one column divided by its corresponding element in the other column (not the element one row back). Here's an example:
>>> print d
A B C
0 1 3 7
1 -2 1 6
2 8 6 9
3 1 -5 11
4 -4 -2 0
>>> d.A / d.B
0 0.333333
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
>>> d.A[1:] / d.B
0 NaN
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
Notice that the values returned are the same for both operations. The second one just has nan for the first one, since there was no corresponding value in the first operand.
If you really want to operate on offset rows, you'll need to dig down to the numpy arrays that underpin the pandas DataFrame, to bypass pandas's index-aligning features. You can get at these innards with the values attribute of a column.
>>> d.A.values[1:] / d.B.values[:-1]
array([-0.66666667, 8. , 0.16666667, 0.8 ])
Now you really are getting each value divided by the one before it in the other column. Note that here you have to explicitly slice the second operand to leave off the last element, to make them equal in length.
So you can do the same to divide a column by an offset version of itself:
>>> d.A.values[1:] / d.A.values[:-1]
45: array([-2. , -4. , 0.125, -4. ])

Categories