Multiplying Different Columns/Rows In Same Dataframe - python

I have a dataframe - see below. I would like to multiply 30 by .115 and create a new column (second image). I started writing the code below but Im not sure if Im even on the right track.
DF.loc[(DF['PRO_CHARGE']=="(1.0, 99283.0)")), '%'] = DF.loc(DF['PRO_CHARGE']=="(0.0, 99283.0)")....?

With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"PRO_CHARGE": [(0.0, 99283.0), (1.0, 99283.0), (1.0, 99284.0)],
"TOTAL": [15, 30, 23],
"PERC_99282": [0.115, 0.49, 0.53],
}
)
print(df)
# Output
PRO_CHARGE TOTAL PERC_99282
0 (0.0, 99283.0) 15 0.115
1 (1.0, 99283.0) 30 0.490
2 (1.0, 99284.0) 23 0.530
Here is one way to it with Pandas .at property, which provides fast setting (as opposed to .loc) for a single value:
df.at[df["PRO_CHARGE"] == (1.0, 99283.0), "%"] = (
df.loc[df["PRO_CHARGE"] == (1.0, 99283.0), "TOTAL"].values[0]
* df.loc[df["PRO_CHARGE"] == (0.0, 99283.0), "PERC_99282"].values[0]
)
Then:
print(df)
# Output
PRO_CHARGE TOTAL PERC_99282 %
0 (0.0, 99283.0) 15 0.115 NaN
1 (1.0, 99283.0) 30 0.490 3.45
2 (1.0, 99284.0) 23 0.530 NaN

Related

Python: pandas.cut labels are ignored

I want to cut one column in my pandas.DataFrame using pandas.cut(), but the labels I put into labels argument are not applied. Let me show you an example.
I have got the following data frame:
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [-0.009, 0.089, 0.095, 0.096, 0.198]})
>>> print(df)
x
0 -0.009
1 0.089
2 0.095
3 0.096
4 0.198
And I cut x column like this:
>>> bins = pd.IntervalIndex.from_tuples([(-0.2, -0.1), (-0.1, 0.0), (0.0, 0.1), (0.1, 0.2)])
>>> labels = [100, 200, 300, 400]
>>> df['x_cut'] = pd.cut(df['x'], bins, labels=labels)
>>> print(df)
x x_cut
0 -0.009 (-0.1, 0.0]
1 0.089 (0.0, 0.1]
2 0.095 (0.0, 0.1]
3 0.096 (0.0, 0.1]
4 0.198 (0.1, 0.2]
However, I expected the data frame looking like this:
id x x_cut
0 6 0.089 200
1 6 0.089 300
2 6 0.095 300
3 6 0.096 300
4 6 0.098 400
What am I missing? How can I get the data frame with correct labels?
It is bug issue 21233.
For me working like #anky_91 commented mapping by dictionary created by zip:
df['x_cut'] = pd.cut(df['x'], bins).map(dict(zip(bins, labels)))
print(df)
x x_cut
0 -0.009 200
1 0.089 300
2 0.095 300
3 0.096 300
4 0.198 400

How to set precision on column names made by np.arange()?

I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64

Pandas Replace Values with Dictionary

I have a data frame with the below structure:
Ranges Relative_17-Aug Relative_17-Sep Relative_17-Oct
0 (0.0, 0.1] 1372 1583 1214
1 (0.1, 0.2] 440 337 648
2 (0.2, 0.3] 111 51 105
3 (0.3, 0.4] 33 10 19
4 (0.4, 0.5] 16 4 9
5 (0.5, 0.6] 7 7 1
6 (0.6, 0.7] 4 3 0
7 (0.7, 0.8] 5 1 0
8 (0.8, 0.9] 2 3 0
9 (0.9, 1.0] 2 0 1
10 (1.0, 2.0] 6 0 2
I am trying to replace column ranges with a dictionary using the below code but it is not working, any hints if I am doing something wrong:
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"].replace(mydict,inplace=True)
Thanks!
I think here is best use parameter labels in time of create Ranges column in cut:
labels = ['<=10%','>10% and <20%', ...]
#change by your bins
bins = [0,0.1,0.2...]
t_df['Ranges'] = pd.cut(t_df['col'], bins=bins, labels=labels)
If not possible, cast to string should help as suggest #Dark in comments, for better performance use map:
t_df["Ranges"] = t_df["Ranges"].astype(str).map(mydict)
By using map function this can be achieved easily and in a straight forward manner as shown below..
mydict= {"(0.0, 0.1]":"<=10%","(0.1, 0.2]":">10% and <20%","(0.2, 0.3]":">20% and <30%", "(0.3, 0.4]":">30% and <40%", "(0.4, 0.5]":">40% and <50%", "(0.5, 0.6]":">50% and <60%", "(0.6, 0.7]":">60% and <70%", "(0.7, 0.8]":">70% and <80%", "(0.8, 0.9]":">80% and <90%", "(0.9, 1.0]":">90% and <100%", "(1.0, 2.0]":">100%"}
t_df["Ranges"] = t_df["Ranges"].map(lambda x : mydict[str(x)])
Hope this helps..!!

Python KS test - why is the P Value so large

I am trying to run the KS test for two graphs
One is the raw data plot(red), and the other is the power law fit
from scipy import stats
stats.ks_2samp(Red.Y, Blue.Y)
where Red.Y is the y value at each point of x, and Blue.Y is the power law y value at each of x.
Out[210]:
Ks_2sampResult(statistic=0.16666666666666669, pvalue=0.99133252540492101)
it seems like the p value is extremely large as the graphs are not alike. May I ask the reason?
the values for Red.Y is:
(0.03, 0.09] 0.000018
(0.09, 0.16] 0.000019
(0.16, 0.29] 0.000016
(0.29, 0.5] 0.000018
(0.5, 0.77] 0.000018
(0.77, 1.0] 0.000022
(1.0, 1.05] 0.000021
(1.05, 1.5] 0.000022
(1.5, 2.0] 0.000025
(2.0, 3.0] 0.000025
(3.0, 4.0] 0.000024
(4.0, 6.42] 0.000026
the values of Blue.Y is:
(0.03, 0.09] 0.000017
(0.09, 0.16] 0.000017
(0.16, 0.29] 0.000018
(0.29, 0.5] 0.000019
(0.5, 0.77] 0.000020
(0.77, 1.0] 0.000021
(1.0, 1.05] 0.000021
(1.05, 1.5] 0.000022
(1.5, 2.0] 0.000023
(2.0, 3.0] 0.000024
(3.0, 4.0] 0.000025
(4.0, 6.42] 0.000026
Basically, in KS-test, you want to compare 2 cumulative distribution (CDF) of 2 data samples (see from from Wikipedia). Suppose, you have blue-line data and red-line data
red_line = [0.000018, 0.000019, 0.000016,
0.000018, 0.000018, 0.000022,
0.000021, 0.000022, 0.000025,
0.000025, 0.000024, 0.000026]
blue_line = [0.000017, 0.000017, 0.000018,
0.000019, 0.000020, 0.000021,
0.000021, 0.000022, 0.000023,
0.000024, 0.000025, 0.000026]
n1 = len(red_line)
n2 = len(blue_line)
# CDF of red line
cdf1 = np.searchsorted(red_line, red_line + blue_line, side='right') / (1.0*len(red_line))
# CDF of blue line
cdf2 = np.searchsorted(blue_line, red_line + blue_line, side='right') / (1.0*len(blue_line))
# D-statistic
d = np.max(np.absolute(cdf1 - cdf2))
the D statistic (first value return) is the maximum distance between 2 CDFs.
For p-value, it is calculated by multiplying this CDF difference with Brownian bridge distribution. You can see how they calculate from the source code. Basically, if you compare difference between CDF with the distribution and it is still similar, we are going to get p > 0.1 for example (means you cannot reject that they are not from the same distribution).
from scipy.stats import distributions
en = np.sqrt(n1 * n2 / float(n1 + n2))
prob = distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d) # p-value
From the given data here, I got (D, p) = (0.1667, 0.9913).
So yeah, even graph looks different, when you plot CDF of 2 samples, it can be very similar and that's why p-value is still large.

pandas groupby report empty bins

I want to make a 2d histogram (or other statistics, but let's take a histogram for the example) of a given 2d data set. The problem is that empty bins seem to be discarded altogether. For instance,
import numpy
import pandas
numpy.random.seed(35)
values = numpy.random.random((2,10000))
xbins = numpy.linspace(0, 1.2, 7)
ybins = numpy.linspace(0, 1, 6)
I can easily get the desired output with
print numpy.histogram2d(values[0], values[1], (xbins,ybins))
giving
[[ 408. 373. 405. 411. 400.]
[ 390. 413. 400. 414. 368.]
[ 354. 414. 421. 400. 413.]
[ 426. 393. 407. 416. 412.]
[ 412. 397. 396. 356. 401.]
[ 0. 0. 0. 0. 0.]]
However, with pandas,
df = pandas.DataFrame({'x': values[0], 'y': values[1]})
binned = df.groupby([pandas.cut(df['x'], xbins),
pandas.cut(df['y'], ybins)])
print binned.size().unstack()
prints
y (0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
x
(0, 0.2] 408 373 405 411 400
(0.2, 0.4] 390 413 400 414 368
(0.4, 0.6] 354 414 421 400 413
(0.6, 0.8] 426 393 407 416 412
(0.8, 1] 412 397 396 356 401
i.e., the last row, with 1 < x <= 1.2, is missing entirely, because there are no values in it. However I would like to see that explicitly (as when using numpy.histogram2d). In this example I can use numpy just fine but on more complicated settings (n-dimensional binning, or calculating statistics other than counts, etc), pandas can be more efficient to code and to calculate than numpy.
In principle I can come up with ways to check if an index is present, using something like
allkeys = [('({0}, {1}]'.format(xbins[i-1], xbins[i]),
'({0}, {1}]'.format(ybins[j-1], ybins[j]))
for j in xrange(1, len(ybins))
for i in xrange(1, len(xbins))]
However, the problem is that the index formatting is not consistent, in the sense that, as you see above, the first index of binned is ['(0, 0.2]', '(0, 0.2]'] but the first entry in allkeys is ['(0.0, 0.2]', '(0.0, 0.2]'], so I cannot match allkeys to binned.viewkeys().
Any help is much appreciated.
It appears that pd.cut keeps your binning information which means we can use it in a reindex:
In [79]: xcut = pd.cut(df['x'], xbins)
In [80]: ycut = pd.cut(df['y'], ybins)
In [81]: binned = df.groupby([xcut, ycut])
In [82]: sizes = binned.size()
In [85]: (sizes.reindex(pd.MultiIndex.from_product([xcut.cat.categories, ycut.cat.categories]))
...: .unstack()
...: .fillna(0.0))
...:
Out[85]:
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
(0.0, 0.2] 408.0 373.0 405.0 411.0 400.0
(0.2, 0.4] 390.0 413.0 400.0 414.0 368.0
(0.4, 0.6] 354.0 414.0 421.0 400.0 413.0
(0.6, 0.8] 426.0 393.0 407.0 416.0 412.0
(0.8, 1.0] 412.0 397.0 396.0 356.0 401.0
(1.0, 1.2] 0.0 0.0 0.0 0.0 0.0

Categories