pd.qcut is returning negative values - python

Here is a simple sample serie of data :
sample
Out[2]:
0 0.047515
1 0.026392
2 0.024652
3 0.022854
4 0.020397
5 0.000087
6 0.000087
7 0.000078
8 0.000078
9 0.000078
The lower value is 0.000078 and max value is 0.047515.
When I use the qcut function on it, the results give me negative data on my categories.
pd.qcut(sample, 4)
Out[31]:
0 (0.0242, 0.0475]
1 (0.0242, 0.0475]
2 (0.0242, 0.0475]
3 (0.0102, 0.0242]
4 (0.0102, 0.0242]
5 (8.02e-05, 0.0102]
6 (8.02e-05, 0.0102]
7 (-0.000922, 8.02e-05]
8 (-0.000922, 8.02e-05]
9 (-0.000922, 8.02e-05]
Name: data, dtype: category
Categories (4, interval[float64]): [(-0.000922, 8.02e-05] < (8.02e-05, 0.0102] < (0.0102, 0.0242] < (0.0242, 0.0475]]
Is it an expected behavior ? I thought that I would find my min and max as lower and upper bound of my categories.
(I use pandas 0.22.0 and python-2.7)

This happens because the binning procedure subtracts .001 from the lowest value in your range. If the edges of a bin == an exact number in your series, it is unclear which bin the number should be placed into. Thus, it makes sense to slightly adjust the min and max before creating the qtiles.
See lines 210-213 in the source code for pd.cut. https://github.com/pandas-dev/pandas/blob/v0.23.4/pandas/core/reshape/tile.py#L210-L213
0.000078 -.001
Out[21]: -0.0009220000000000001

Related

sum of specific elements in normalized value count in pandas (KNN Classification)

I am trying some knn-Classification and when testing the model with 30% of the original data,
I want to calculate the percentage of correct classification within a +/-3 point range (left side in below output).
In other words the sum of the seven floats at the bottom of the below output:
18 0.000028
15 0.000028
14 0.000083
13 0.000193
-11 0.000276
12 0.000634
-10 0.000689
-9 0.001019
11 0.001323
-8 0.002976
10 0.003141
9 0.005097
-7 0.006833
8 0.009093
7 0.012702
-6 0.013005
6 0.020941
-5 0.021905
-4 0.036646
5 0.036674
4 0.055713
-3 0.058965
3 0.087896
-2 0.088777
-1 0.116166
2 0.119031
1 0.142893
0 0.157276
To be precise the sum of these floats:
-3 0.058965
3 0.087896
-2 0.088777
-1 0.116166
2 0.119031
1 0.142893
0 0.157276
Which would be 0,771004
So how do I make Python add only those values together?
Important is, that these values are not always the seven at the bottom, but may be spread around, depending on the chosen value for k.
I think my problem is that I have the logic for sorting and summing up mixed up.
This output is generated by this command:
perc_pred_dev = check_test_series['deviation_from_tru'].value_counts(sort=True, ascending=True, normalize=True)
print(perc_pred_dev)
check_test_series is a dataframe with info necessary to check accuracy
deviation_from_tru is the difference between original value and prediction by the model.
You can lose the sort because it is un needed, but in any case you should sum by index:
df.loc[-3:0]['0'].sum()
assuming df is your dataframe and '0' is the name of the column

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)
The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.
P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

scale numerical values for different groups in python

I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

Finding the percent change of values in a Series

I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...

Using pandas for the probability of two nodes to be linked in a network

From a network, I want to plot the probability of two nodes to be connected as a function of their distance to each other.
I have two pandas series, one (distance) is the distance between each pair of node and the other (adjacency) is filled with zeros and ones and tells if the nodes are connected.
My idea was to use cut and value_counts to first compute the number of pairs having a distance inside bins, which works fine:
factor = pandas.cut(distance, 100)
num_bin = pandas.value_counts(factor)
Now if had a vector of the same size of num_bin with the number of connected nodes inside each bins, i would have my probability. but how to compute this vector?
My problem is how to know among, lets says the 3 couple of nodes inside the second bin, how many are connected?
thanks
You could use crosstab for this:
import numpy as np
import pandas as pd
factor = pd.cut(distance, 100)
# the crosstab dataframe with the value counts in each bucket
ct = pd.crosstab(factor, adjacency, margins=True,
rownames=['distance'], colnames=['adjacency'])
# from here computing the probability of nodes being adjacent is straightforward
ct['prob'] = np.true_divide(ct[1], ct['All'])
Which gives a dataframe of this form:
>>> ct
adjacency 0 1 All prob
distance
(0.00685, 0.107] 7 4 11 0.363636
(0.107, 0.205] 6 9 15 0.600000
(0.205, 0.304] 6 6 12 0.500000
(0.304, 0.403] 5 2 7 0.285714
(0.403, 0.502] 4 6 10 0.600000
(0.502, 0.6] 8 3 11 0.272727
(0.6, 0.699] 6 2 8 0.250000
(0.699, 0.798] 4 6 10 0.600000
(0.798, 0.896] 4 5 9 0.555556
(0.896, 0.995] 5 2 7 0.285714
All 55 45 100 0.450000

Categories