Selecting elements of a pandas dataframe that fall above a critical threshold - python

I have a pandas.df and I'm trying to remove all hypotheses that can be rejected.
Here is a snippet of the df in question:
best value p_value
0 11.9549 0.986927
1 11.9588 0.986896
2 12.1185 0.985588
3 12.1682 0.985161
4 12.3907 0.983131
5 12.4148 0.982899
6 12.6273 0.980750
7 12.9020 0.977680
8 13.4576 0.970384
9 13.5058 0.969679
10 13.5243 0.969405
11 13.5886 0.968439
12 13.8025 0.965067
13 13.9840 0.962011
14 14.1896 0.958326
15 14.3939 0.954424
16 14.6229 0.949758
17 14.6689 0.948783
18 14.9464 0.942626
19 15.1216 0.938494
20 15.5326 0.928039
21 17.7720 0.851915
22 17.8668 0.847993
23 17.9662 0.843822
24 19.2481 0.785072
25 19.5257 0.771242
I want to remove the elements with a p_value greater then a critical threshold alpha by selecting the ones fall below alpha. The p value is calculated using scipy.stats.chisqprob(chisq,df) where chisq is the chi squared statistic and df is the degrees of freedom. This is all done using the custom method self.get_p_values shown below.
def reject_null_hypothesis(self,alpha,df):
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df) #calculates the data frame above
return p_value.loc[p_value['best value']
Im then calling this method using:
PE=Modelling_Tools.PE_Results(PE_file) #Modelling.Tools is the module and PE_Results is the class which is given the data 'PE_file'
print PE.reject_null_hypothesis(0.5,25)
From what I've read this should do what I want but I'm new to pandas.df and this code returns the unchanged

Are you getting any errors when you run this? I ask because:
print PE.reject_null_hypothesis(0.5, 25)
is passing into reject_null_hypothesis() 25, an int object instead of a pandas.DataFrame object, in the last argument position.
(Apologies. I would respond with this as a comment instead of an answer, but I only have 46 reputation at the moment, and 50 is needed to comment.)

refer indexging with boolean array
df[ df.p_value < threshold ]

Turns out there is a simple way to do what I want. Here is the code for those who want to know.
def reject_null_hypothesis(self,alpha,df):
'''
alpha = critical threshold for chisq statistic
df=degrees of freedom
values below this critical threshold are rejected.
values above this threshold are not 'proven' but
cannot be rejected and must therefore be subject to
further statistics
'''
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df)
passed= p_value[p_value.loc[:,'p_value']>alpha].index
return p_value[:max(passed)]

Related

tabula extract table from pdf remove line break

I have a table with wrapped text in a pdf file
I used tabula to extract table from the pdf file
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1,lattice=True)
table[0]
However, the end result looking like this:
is there a way to interpret line break or wrapped text for table in pdf as its own row? not extra rows?
End result should be looking like this using tabula:
You need to add a parameter. Replace
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1)
table[0]
with
file1 = "path_to_pdf_file"
table = tabula.read_pdf(file1,pages=1, lattice = True)
table[0]
All this according to the documention here
Here is an example:
Se the article "https://effectivehealthcare.ahrq.gov/sites/default/files/pdf/methods-guidance-tests-bias_methods.pdf"
import tabula
import io
import pandas as pd
file1 = r"C:\Users\s-degossondevarennes\.......\Desktop\methods-guidance-tests-bias_methods.pdf"
table = tabula.read_pdf(file1,pages=3,lattice=True, )
df = table[0]
df = df.drop(['Unnamed: 1','Unnamed: 2','Description','Unnamed: 3'],axis=1)
df
returns:
Unnamed: 0 \
0 NaN
1 Spectrum effect
2 Context bias
3 Selection bias
4 NaN
5 Variation in test execution
6 Variation in test technology
7 Treatment paradox
8 Disease progression bias
9 NaN
10 Inappropriate reference\rstandard
11 Differential verification bias
12 Partial verification bias
13 NaN
14 Review bias
15 Clinical review bias
16 Incorporation bias
17 Observer variability
18 NaN
19 Handling of indeterminate\rresults
20 Arbitrary choice of threshold\rvalue
Source of Systematic Bias
0 Population
1 Tests may perform differently in various sampl...
2 Prevalence of the target condition varies acco...
3 The selection process determines the compositi...
4 Test Protocol: Materials and Methods
5 A sufficient description of the execution of i...
6 When the characteristics of a medical test cha...
7 Occurs when treatment is started on the basis ...
8 Occurs when the index test is performed an unu...
9 Reference Standard and Verification Procedure
10 Errors of imperfect reference standard bias th...
11 Part of the index test results is verified by ...
12 Only a selected sample of patients who underwe...
13 Interpretation
14 Interpretation of the index test or reference ...
15 Availability of clinical data such as age, sex...
16 The result of the index test is used to establ...
17 The reproducibility of test results is one det...
18 Analysis
19 A medical test can produce an uninterpretable ...
20 The selection of the threshold value for the i...
The three dots in the column Source of Systematic Bias show that everything that was in that cell, with line breaks i considered as a single cell (item), not multiple cells. Another proof of that is
df.iloc[2,1]
returns the cell content:
'Prevalence of the target condition varies according to setting and may affect\restimates of test performance. Interpreters may consider test results to be\rpositive more frequently in settings with higher disease prevalence, which may\ralso affect estimates of test performance.'
There must be something with your pdf. If it's available online, share the link and I'll take a look.

Python: numpy, pandas, and performing operations on the previous array value (smoothed averages): any way to not use FOR loop? EWMA?

Tbh, I'm not really sure how to ask this question. I've got an array of values, and I'm looking to take the smoothed average of these values moving forward. In Excel, the calculation process is:
average_val_1 = mean average of values through window_size
average_val_2 = (value at location window_size+1 * window_size-1 + average_val_1) / window_size
average_val_3 = (value at location window_size+2 * window_size-1 + average_val_2) / window_size
etc., etc.
In pandas and numpy, my code for this is the following
df = pd.DataFrame({'av':np.nan, 'values':np.random.rand(10)})
df = df[['values','av']]
window = 5
df['av'].iloc[5] = np.mean(df['values'][:5])
for i in range(window+1,len(df.index)):
df['av'].iloc[i] = (df['values'].iloc[i] * (window-1) + df['av'].iloc[i-1])/window
Which returns:
values av
0 0.418498 NaN
1 0.570326 NaN
2 0.296878 NaN
3 0.308445 NaN
4 0.127376 NaN
5 0.381160 0.344305
6 0.239725 0.260641
7 0.928491 0.794921
8 0.711632 0.728290
9 0.319791 0.401491
These are the values I am looking for, but there has to be a better way than using for loops. I think the answer has something to do with using exponentially weighted moving averages, but I'll be damned if I can figure out the syntax to make any sense of that.
Any suggestions?
you can use ewm such as:
window = 5
df['av'] = np.nan
df['av'].iloc[window] = np.mean(df['values'][:window])
df.loc[window:,'av'] = (df.loc[window:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())
and you get the same result than with your loop for. To be sure it works, column 'av' must be nan otherwise the fillna with column 'values' will not happen and the value calculted in 'av' will be wrong. The parameter alpha in ewm is what helps you to weigth the row you are calculating.
Note: while this code does as yours, I would recommend to have a look at this line in your code:
df['av'].iloc[5] = np.mean(df['values'][:5])
Because of the exclusion of the uppper bound when doing slicing [:5], df['values'][:5] is:
0 0.418498
1 0.570326
2 0.296878
3 0.308445
4 0.127376
Name: values, dtype: float64
so I think that what you should do is df['av'].iloc[4] = np.mean(df['values'][:5]). If you agree, then my above must be slightly changed
df['av'].iloc[window-1] = np.mean(df['values'][:window])
df.loc[window-1:,'av'] = (df.loc[window-1:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())

Calculating Variable Cash-flow IRR in Python (pandas)

I have a DataFrame of unpredictable cashflows and unpredictable period lengths, and I need to generate a backward-looking IRR.
Doing it in Excel is pretty straightforward using the solver, wondering if there's a good way to pull it off in Python. (I think I could leverage openpyxl to get solver to work in excel from python, but that feels unnecessarily cumbersome).
The problem is pretty straightforward:
NPV of Cash Flow = ((cash_flow)/(1+IRR)^years_ago)
GOAL: Find IRR where SUM(NPV) = 0
My dataframe looks something like this:
cash_flow |years_ago
-----------------------
-3.60837e+06 |4.09167
31462 |4.09167
1.05956e+06 |3.63333
-1.32718e+06 |3.28056
-4.46554e+06 |3.03889
It seems as though other IRR calculators (such as numpy.irr) assume strict period cutoffs (every 3 months, 1 year, etc), which won't work. The other option seems to be the iterative route, where I continually guess, check, and iterate, but that feels like the wrong way to tackle this. Ideally, I'm looking for something that would do this:
irr = calc_irr((cash_flow1,years_ago1),(cash_flow2,years_ago2),etc)
EDIT: Here is the code I'm running the problem from. I have a list of transactions, and I've chosen to create temporary tables by id.
for id in df_tran.id.unique():
temp_df = df_tran[df_tran.id == id]
cash_flow = temp_df.cash_flows.values
years = temp_df.years.values
print(id, cash_flow)
print(years)
#irr_calc = irr(cfs=cash_flow, yrs=years,x0=0.100000)
#print(sid, irr_calc)
where df_tran (which temp_df is based on) looks like:
cash_flow |years |id
0 -3.60837e+06 4.09167 978237
1 31462 4.09167 978237
4 1.05956e+06 3.63333 978237
6 -1.32718e+06 3.28056 978237
8 -4.46554e+06 3.03889 978237
10 -3.16163e+06 2.81944 978237
12 -5.07288e+06 2.58889 978237
14 268833 2.46667 978237
17 -4.74703e+06 1.79167 978237
20 -964987 1.40556 978237
22 -142920 1.12222 978237
24 163894 0.947222 978237
26 -2.2064e+06 0.655556 978237
27 1.23804e+06 0.566667 978237
29 180655 0.430556 978237
30 -85297 0.336111 978237
34 -2.3529e+07 0.758333 1329483
36 21935 0.636111 1329483
38 -3.55067e+06 0.366667 1329483
41 -4e+06 4.14167 1365051
temp_df looks identical to df_tran, except it only holds transactions for a single id.
You can use scipy.optimize.fsolve:
Return the roots of the (non-linear) equations defined by func(x) = 0
given a starting estimate.
First define the function that will be the func parameter to fsolve. This is NPV as a result of your IRR, cash flows, and years. (Vectorize with NumPy.)
import numpy as np
def npv(irr, cfs, yrs):
return np.sum(cfs / (1. + irr) ** yrs)
An example:
cash_flow = np.array([-2., .5, .75, 1.35])
years = np.arange(4)
# A guess
print(npv(irr=0.10, cfs=cash_flow, yrs=years))
0.0886551465064
Now to use fsolve:
from scipy.optimize import fsolve
def irr(cfs, yrs, x0):
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
Your IRR is:
print(irr(cfs=cash_flow, yrs=years, x0=0.10))
0.12129650313214262
And you can confirm that this gets you to a 0 NPV:
res = irr(cfs=cash_flow, yrs=years, x0=0.10)
print(np.allclose(npv(res, cash_flow, years), 0.))
True
All code together:
import numpy as np
from scipy.optimize import fsolve
def npv(irr, cfs, yrs):
return np.sum(cfs / (1. + irr) ** yrs)
def irr(cfs, yrs, x0, **kwargs):
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs), **kwargs))
To make this compatible with your pandas example, just use
cash_flow = df.cash_flow.values
years = df.years_ago.values
Update: the values in your question seem a bit nonsensical (your IRR is going to be some astronomical number if it even exists) but here is how you'd run:
cash_flow = np.array([-3.60837e+06, 31462, 1.05956e+06, -1.32718e+06, -4.46554e+06])
years_ago = np.array([4.09167, 4.09167, 3.63333, 3.28056, 3.03889])
print(irr(cash_flow, years_ago, x0=0.10, maxfev=10000))
1.3977721900669127e+82
Second update: there are a couple minor typos in your code, and your actual flows of $ and timing work out to nonsensical IRRs, but here's what you're looking to do, below. For instance, notice you have one id with one single negative transaction, a negatively infinite IRR.
for i, df in df_tran.groupby('id'):
cash_flow = df.cash_flow.values
years = df.years.values
print('id:', i, 'irr:', irr(cash_flow, years, x0=0.))
id: 978237 irr: 347.8254979851405
id: 1329483 irr: 3.2921314448062817e+114
id: 1365051 irr: 1.0444951674872467e+25

Normalize columns in pandas data frame while once column is in a specific range

I have a data frame in pandas which contains my Experimental data. It looks like this:
KE BE EXP_DATA COL_1 COL_2 COL_3 .....
10 1 5 1 2 3
9 2 . . . .
8 3 . .
7 4
6 5
.
.
The column KE is not used. BE are the Values for the x-axis and all other colums are y-axis values.
For normalisation i use the idea wich is also presented here Normalise in the post of Michael Aquilina.
There fore i need to find the maximum and the minimum of my Data. I do it like this
minBE = self.data[EXP_DATA].min()
maxBE = self.data[EXP_DATA].max()
Now i want to find the maximum and minimum value of this column but only for the Range in the "column" EXP_DATA when the "column" BE is in a certain range. So in essence i want to normalize the data only in a certain X-Range.
Solution
Thanks to the solution Milo gave me i now use this function:
def normalize(self, BE="Exp",NRANGE=False):
"""
Normalize data by dividing all components by the max value of the data.
"""
if BE not in self.data.columns:
raise NameError("'{}' is not an existing column. ".format(BE) +
"Try list_columns()")
if NRANGE and len(NRANGE)==2:
upper_be = max(NRANGE)
lower_be = min(NRANGE)
minBE = self.data[BE][(self.data.index > lower_be) & (self.data.index < upper_be)].min()
maxBE = self.data[BE][(self.data.index > lower_be) & (self.data.index < upper_be)].max()
for col in self.data.columns: # this is done so the data in NRANGE is realy scalled between [0,1]
msk = (self.data[col].index < max(NRANGE)) & (self.data[col].index > min(NRANGE))
self.data[col]=self.data[col][msk]
else:
minBE = self.data[BE].min()
maxBE = self.data[BE].max()
for col in self.data.columns:
self.data[col] = (self.data[col] - minBE) / (maxBE - minBE)
If i call the function with the parameter NRANGE=[a,b] and a and b are also the x limits of my plot it automatically scales the visible Y-values between 0 and 1 as the rest of the data is masked. IF the function is called without the NRANGE parameter the whole Range of the data passed to the function is scaled from 0 o 1.
Thank you for your help!
You can use boolean indexing. For example to select max and min values in column EXP_DATA where BE is larger than 2 and less than 5:
lower_be = 2
upper_be = 5
max_in_range = self.data['EXP_DATA'][(self.data['BE'] > lower_be) & (self.data['BE'] < upper_be)].max()
min_in_range = self.data['EXP_DATA'][(self.data['BE'] > lower_be) & (self.data['BE'] < upper_be)].min()

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Categories