pandas groupby report empty bins - python

I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,
import numpy
import pandas
numpy.random.seed(35)
values = numpy.random.random((2,10000))
xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)
I can easily get the desired output with
print numpy.histogram2d(values[0], values[1], (xbins,ybins))
giving
[[ 408. 373. 405. 411. 400.]
[ 390. 413. 400. 414. 368.]
[ 354. 414. 421. 400. 413.]
[ 426. 393. 407. 416. 412.]
[ 412. 397. 396. 356. 401.]
[ 0. 0. 0. 0. 0.]]
However, with pandas,
df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
pandas.cut(df['y'], ybins)])
print binned.size().unstack()
prints
y (0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
x
(0, 0.2] 408 373 405 411 400
(0.2, 0.4] 390 413 400 414 368
(0.4, 0.6] 354 414 421 400 413
(0.6, 0.8] 426 393 407 416 412
(0.8, 1] 412 397 396 356 401
i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.
In principle I can come up with ways to check if an index is present, using something like
allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
'({0}, {1}]'.format(ybins[j-1], ybins[j]))
for j in xrange(1, len(ybins))
for i in xrange(1, len(xbins))]
However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().
Any help is much appreciated.

It appears that pd.cut keeps your binning information which means we can use it in a reindex:
In [79]: xcut = pd.cut(df['x'], xbins)
In [80]: ycut = pd.cut(df['y'], ybins)
In [81]: binned = df.groupby([xcut, ycut])
In [82]: sizes = binned.size()
In [85]: (sizes.reindex(pd.MultiIndex.from_product([xcut.cat.categories, ycut.cat.categories]))
...: .unstack()
...: .fillna(0.0))
...:
Out[85]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
(0.0, 0.2] 408.0 373.0 405.0 411.0 400.0
(0.2, 0.4] 390.0 413.0 400.0 414.0 368.0
(0.4, 0.6] 354.0 414.0 421.0 400.0 413.0
(0.6, 0.8] 426.0 393.0 407.0 416.0 412.0
(0.8, 1.0] 412.0 397.0 396.0 356.0 401.0
(1.0, 1.2] 0.0 0.0 0.0 0.0 0.0

Related

How to set precision on column names made by np.arange()?

I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64

Pandas Replace Values with Dictionary

I have a data frame with the below structure:
Ranges Relative_17-Aug Relative_17-Sep Relative_17-Oct
0 (0.0, 0.1] 1372 1583 1214
1 (0.1, 0.2] 440 337 648
2 (0.2, 0.3] 111 51 105
3 (0.3, 0.4] 33 10 19
4 (0.4, 0.5] 16 4 9
5 (0.5, 0.6] 7 7 1
6 (0.6, 0.7] 4 3 0
7 (0.7, 0.8] 5 1 0
8 (0.8, 0.9] 2 3 0
9 (0.9, 1.0] 2 0 1
10 (1.0, 2.0] 6 0 2
I am trying to replace column ranges with a dictionary using the below code but it is not working, any hints if I am doing something wrong:
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"].replace(mydict,inplace=True)
Thanks!
I think here is best use parameter labels in time of create Ranges column in cut:
labels = ['<=10%','>10% and <20%', ...]
#change by your bins
bins = [0,0.1,0.2...]
t_df['Ranges'] = pd.cut(t_df['col'], bins=bins, labels=labels)
If not possible, cast to string should help as suggest #Dark in comments, for better performance use map:
t_df["Ranges"] = t_df["Ranges"].astype(str).map(mydict)
By using map function this can be achieved easily and in a straight forward manner as shown below..
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"] = t_df["Ranges"].map(lambda x : mydict[str(x)])
Hope this helps..!!

Python KS test - why is the P Value so large

I am trying to run the KS test for two graphs
One is the raw data plot(red), and the other is the power law fit
from scipy import stats
stats.ks_2samp(Red.Y, Blue.Y)
where Red.Y is the y value at each point of x, and Blue.Y is the power law y value at each of x.
Out[210]:
Ks_2sampResult(statistic=0.16666666666666669, pvalue=0.99133252540492101)
it seems like the p value is extremely large as the graphs are not alike. May I ask the reason?
the values for Red.Y is:
(0.03, 0.09] 0.000018
(0.09, 0.16] 0.000019
(0.16, 0.29] 0.000016
(0.29, 0.5] 0.000018
(0.5, 0.77] 0.000018
(0.77, 1.0] 0.000022
(1.0, 1.05] 0.000021
(1.05, 1.5] 0.000022
(1.5, 2.0] 0.000025
(2.0, 3.0] 0.000025
(3.0, 4.0] 0.000024
(4.0, 6.42] 0.000026
the values of Blue.Y is:
(0.03, 0.09] 0.000017
(0.09, 0.16] 0.000017
(0.16, 0.29] 0.000018
(0.29, 0.5] 0.000019
(0.5, 0.77] 0.000020
(0.77, 1.0] 0.000021
(1.0, 1.05] 0.000021
(1.05, 1.5] 0.000022
(1.5, 2.0] 0.000023
(2.0, 3.0] 0.000024
(3.0, 4.0] 0.000025
(4.0, 6.42] 0.000026
Basically, in KS-test, you want to compare 2 cumulative distribution (CDF) of 2 data samples (see from from Wikipedia). Suppose, you have blue-line data and red-line data
red_line = [0.000018, 0.000019, 0.000016,
0.000018, 0.000018, 0.000022,
0.000021, 0.000022, 0.000025,
0.000025, 0.000024, 0.000026]
blue_line = [0.000017, 0.000017, 0.000018,
0.000019, 0.000020, 0.000021,
0.000021, 0.000022, 0.000023,
0.000024, 0.000025, 0.000026]
n1 = len(red_line)
n2 = len(blue_line)
# CDF of red line
cdf1 = np.searchsorted(red_line, red_line + blue_line, side='right') / (1.0*len(red_line))
# CDF of blue line
cdf2 = np.searchsorted(blue_line, red_line + blue_line, side='right') / (1.0*len(blue_line))
# D-statistic
d = np.max(np.absolute(cdf1 - cdf2))
the D statistic (first value return) is the maximum distance between 2 CDFs.
For p-value, it is calculated by multiplying this CDF difference with Brownian bridge distribution. You can see how they calculate from the source code. Basically, if you compare difference between CDF with the distribution and it is still similar, we are going to get p > 0.1 for example (means you cannot reject that they are not from the same distribution).
from scipy.stats import distributions
en = np.sqrt(n1 * n2 / float(n1 + n2))
prob = distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d) # p-value
From the given data here, I got (D, p) = (0.1667, 0.9913).
So yeah, even graph looks different, when you plot CDF of 2 samples, it can be very similar and that's why p-value is still large.

Generate Sample dataframes with constraints

I have a dataframe of records that looks like:
'Location' 'Rec ID' 'Duration' 'Rec-X'
0 Houston 126 17 [0.2, 0.34, 0.45, ..., 0.28]
1 Chicago 126 19.3 [0.12, 0.3, 0.41, ..., 0.39]
2 Boston 348 17.3 [0.12, 0.3, 0.41, ..., 0.39]
3 Chicago 138 12.3 [0.12, 0.3, 0.41, ..., 0.39]
4 New York 238 11.3 [0.12, 0.3, 0.41, ..., 0.39]
...
500 Chicago 126 19.3 [0.12, 0.3, 0.41, ..., 0.39]
And as part of a genetic algorithm process, I want to initialize a population (10) of records. I want each of my subset to contain 10 records, however I want NOT to contain the same 'Rec-ID' two times.
Any idea on how to generate those 10 different dataframes?
Thanks,
you can drop duplicates based on your column from the dataframe and then access the 10 elements
df2 = df.drop_duplicates('Rec ID')
df2.head(10)
EDIT
If you want to select randomly 10 unique elements
Then something like this will work
def selectRandomUnique(df) :
d2 = df.sample(n=3).drop_duplicates('ID')
while len(d2) != 3 :
d2 = df.sample(n=3).drop_duplicates('ID')
return d2
In this first you randomly select the rows and then drop any duplicates that may exist.

reassigning expanded recarray field

I am loading file data into a numpy recarray and subsequently filling in known gaps with NaNs. However, I can not find a way to increase the size of the field in the recarray in order to reassign the array with filled gaps. An example of my problem (given below) throws a valueerror about broadcasting from a larger to smaller shape.
using python 2.7.6.1, numpy 1.8.1-6
Thanks, Rob
import numpy as np
import numpy.ma as ma
a1 = np.arange(0,20,1)
a2 = np.arange(100,120,1)
X = np.recarray((20,), dtype=[('g', float), ('h', int)])
X['g'][:] = a1
X['h'][:] = a2
for afield in X.dtype.names:
Y = X[afield].copy(order='K')
for icnt in range(0,3):
Y = np.insert(Y, 5, np.nan, axis=0)
ma.resize(X[afield], (len(Y),) )
X[afield][:] = Y[:]
You did not "expand" your recarray X. Recarrays cannot be expanded per label (name/column), which is what you were hoping to do with ma.resize. Note that ma.resize returns a new (masked) array with the new shape without altering the arrays passed to it, but in your code you are not using the return value. So that line doesn't do anything. To clarify:
X[afield] = ma.resize(X[afield], (len(Y),) )
would also not work, because record arrays cannot be expanded per label ('column').
If you want to expand a recarray, you'll need to do it in one go (with functions from np.lib.recfunctions), so add an entirely new column or add several new records for all existing columns.
That being said, why not just try this:
>>> Y = np.arange(20, dtype=np.float)
>>> Ynan = np.insert(Y, (5,)*3, (np.nan,)*3)
>>> X = np.rec.fromarrays([Ynan, Ynan+100], names='g,h')
>>> X
rec.array([(0.0, 100.0), (1.0, 101.0), (2.0, 102.0), (3.0, 103.0),
(4.0, 104.0), (nan, nan), (nan, nan), (nan, nan), (5.0, 105.0),
(6.0, 106.0), (7.0, 107.0), (8.0, 108.0), (9.0, 109.0),
(10.0, 110.0), (11.0, 111.0), (12.0, 112.0), (13.0, 113.0),
(14.0, 114.0), (15.0, 115.0), (16.0, 116.0), (17.0, 117.0),
(18.0, 118.0), (19.0, 119.0)],
dtype=[('g', '<f8'), ('h', '<f8')])
Note that you cannot convert the 2nd column (label 'h') to an int, because np.nan is a floating point type. If you tried, you'd get garbage:
>>> X['h'].astype(np.int)
array([ 100, 101, 102,
103, 104, -9223372036854775808,
-9223372036854775808, -9223372036854775808, 105,
106, 107, 108,
109, 110, 111,
112, 113, 114,
115, 116, 117,
118, 119])
I think what you're after is actually masked record arrays:
>>> import numpy.ma.mrecords as mrecords
>>>
>>> X = np.rec.fromarrays([Ynan, (Ynan+100).astype(np.int)], names='g,h')
>>> Z = np.ma.array(X, mask=np.isnan(Ynan))
>>> Z2 = Z.view(mrecords.mrecarray)
>>>
>>> Z2
masked_records(
g : [0.0 1.0 2.0 3.0 4.0 -- -- -- 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0
15.0 16.0 17.0 18.0 19.0]
h : [100 101 102 103 104 -- -- -- 105 106 107 108 109 110 111 112 113 114 115
116 117 118 119]
fill_value : (1e+20, 999999)
)
>>>
>>> Z2['h']
masked_array(data = [100 101 102 103 104 -- -- -- 105 106 107 108 109 110 111 112 113 114 115
116 117 118 119],
mask = [False False False False False True True True False False False False
False False False False False False False False False False False],
fill_value = 999999)
As you can see, the "columns" of Z2 have the desired dtype (float and int), are accessible by their column names and have some of the data masked.

Categories