Create a pd.DataFrame with a minimum number of rows using hypothesis - python

I'm using the hypothesis library and I would like to create a pd.DataFrame with three columns. Each column may contain integer values, either +1, 0, or -1. The values doesn't need to be unique. Also, I would like to get at least ten rows.
With the following code, hypothesis seems to produce either empty dataframes or a dataframe with only one row.
When adding assume(len(signals) > 10) to the test, hypothesis is not able to find a suitable example.
from hypothesis import strategies as st
from hypothesis.extra.pandas import columns, data_frames
#given(
signals=data_frames(
columns=columns(
["sec_1", "sec_2", "sec_3"],
elements=st.integers(min_value=-1, max_value=1),
)
)
)
def test_length(signals: pd.DataFrame) -> None:
assume(len(signals) > 10)
assert len(signals) > 10
What am I doing wrong here?

Related

Checking multiple condition with numpy array

I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.
Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)
Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...
It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe

In python How to ensure that seed in using randint keeps changing when i am trying to pick a random number?

def claims(dataframe):
dataframe.loc[(dataframe.severity ==1),'claims_made']= randint(200, 20000)
return dataframe
here 'severity' is an existing column and 'claims_made' is a new column, I want to have the randint keep picking different values that are being assigned to the 'claims_made' column. because for now it's just picking one random value out of the bucket specified and is assigning the same value to all the rows that satisfy the condition
Your code gets a single randint and applies that one value to the column you create. Its the same as if you had done
val = randint(20, 20000)
dataframe.loc[(dataframe.severity ==1),'claims_made']= val
Instead you could get an index of the rows you want to assign. Use it to create a series of random integers and when you assign that back to the dataframe, non-indexed rows become NaN.
import pandas as pd
import numpy as np
def claims(dataframe):
wanted_index = dataframe[df.severity==1].index
dataframe["claims_made"] = pd.Series(
np.random.randint(20,20000, size=len(wanted_index)),
index=wanted_index)
return dataframe
df = pd.DataFrame({"severity":[1, 1, 0, 8, -1, 99, 1]})
print(claims(df))
If you want to stick with your existing approach, you could do something like this:
def claims2(df):
n_rows = len(df.loc[(df.severity==1), 'claims_made'])
vals = [randint(200, 20000) for _ in range(n_rows)]
df.loc[(df.severity==1), 'claims_made'] = vals
return df
p.s. I'd recommend accessing columns via df['severity'] instead of df.severity -- you can get into trouble using the . syntax if you have a dataset with spaces etc. in the column names.
I'll give you a broad hint; coding is up to you.
Form a series (a temporary column object) of random numbers in the desired range. Assign that series to your data frame column. You can find examples of this technique in any tutorial on data frames.

Non-zero mean when calculating a Z-score in Python / Pandas

I am attempting calculate z-scores at once for a series of columns, but inspecting the data reveals that the mean values for columns are NOT 0 as you should expect for the calculation of a z-score.
As you can see by running the code below, column a and column d does not have 0 means in the newly created *_zscore column.
import pandas as pd
df = pd.DataFrame({'a': [500,4000,20], 'b': [10,20,30], 'c': [30,40,50], 'd':[50,400,20] })
cols = list(df.columns)
for col in cols:
col_zscore = col + '_zscore'
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df.describe())
My actual data is obviously different, but the results are similar (i.e.: non-zero means). I have also used
from scipy import stats
stats.zscore(df)
which leads to a similar result. Doing the same transformation in R (i.e.: scaled.df <- scale(df)) works though.
Does anyone have an idea what is going on here? The columns with error contain higher values, but it should also be possible to z-transform them.
EDIT: as Rob pointed out, the results are essentially 0.
Your mean values are of the order 10^-17, which for all practical purposes is equal to zero. The reason why you do not get exactly zero has to do with the way floating point numbers are represented (finite precision).
I'm surprised that you don't see it in R, but that may have to do with the example you use and the fact that scale is implemented a bit differently in R (ddof=1 e.g.). But in R, you see the same thing happening:
> mean(scale(c(5000,40000,2000)))
[1] 7.401487e-17

Pandas Panel fancy indexing: How to return (index of) all DataFrames in Panel based on Boolean of multiple columns in each df

I've got a Pandas Panel with many DataFrames with the same rows/column labels. I want to make a new panel with DataFrames that fulfill certain criteria based on a couple columns.
This is easy with dataframes and rows: Say I have a df, zHe_compare. I can get the suitable rows with:
zHe_compare[(zHe_compare['zHe_calc'] > 100) & (zHe_compare['zHe_med'] > 100) | ((zHe_obs_lo_2s <=zHe_compare['zHe_calc']) & (zHe_compare['zHe_calc'] <= zHe_obs_hi_2s))]
but how do I do (pseudocode, simplified boolean):
good_results_panel = results_panel[ all_dataframes[ sum ('zHe_calc' < 'zHe_obs') > min_num ] ]
I know the the inner boolean part, but how do I specify this for each dataframe in a panel? Because I need multiple columns from each df, I haven't met success using the panel.minor_xs slicing techniques.
thanks!
As mentioned in its documentation, Panel is currently a bit under-developed, so the sweet syntax you've come to rely on when working with DataFrame isn't there yet.
Meanwhile, I would suggest using the Panel.select method:
def is_good_result(item_label):
# whatever condition over the selected item
df = results_panel[item_label]
return df['col1'].sum() > 5
good_results = results.select(is_good_result)
The is_good_result function returns a boolean value. Note that its argument is not a DataFrame instance, because Panel.select applies its argument to the item label, rather than the DataFrame content of that item.
Of course, you can stuff that whole criterion function into a lambda in one statement, if you're into the whole brevity thing:
good_results = results.select(
lambda item_label: results[item_label]['col1'].sum() > 5
)

Selecting dataframe rows based on multiple columns, where new functions should be created to handle conditions in some columns

I have a dataframe that consists of multiple columns. I want to select rows based on conditions in multiple columns. Assuming that I have four columns in a dataframe:
import pandas as pd
di={"A":[1,2,3,4,5],
"B":['Tokyo','Madrid','Professor','helsinki','Tokyo Oliveira'],
"C":['250','200//250','250//250//200','12','200//300'],
"D":['Left','Right','Left','Right','Right']}
data=pd.DataFrame(di)
I want to select Tokyo in column B, 200 in column C, Left in column D. By that, the first row will be only selected. I have to create a function to handle column C. Since I need to check the first value if the row contains a list with //
To handle this, I assume this can be done through the following:
def check_200(thecolumn):
thelist=[]
for i in thecolumn:
f=i
if "//" in f:
#split based on //
z=f.split("//")
f=z[0]
f=float(f)
if f > 200.00:
thelist.append(True)
else:
thelist.append(False)
return thelist
Then, I will create the multiple conditions:
selecteddata=data[(data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left"))&(check_200(data.C))]
Is this the best way to do that, or there is an easier pandas function that can handle such requirements ?
I don't think there is a most pythonic way to do this, but I think this is what you want:
bool_idx = ((data.B.str.contains("Tokyo")) &
(data.D.str.contains("Left")) & (data.C.str.contains("//")
& (data.C.str.split("//")[0].astype(float)>200.00))
selecteddata=data[bool_idx]
Bruno's answer does the job, and I agree that boolean masking is the way to go. This answer keeps the code a little closer to the requested format.
import numpy as np
def col_condition(col):
col = col.apply(lambda x: float(x.split('//')[0]) > 200)
return col
data = data[(data.B.str.contains('Tokyo')) & (data.D.str.contains("Left")) &
col_condition(data.C)]
The function reads in a Series, and converts each element to True or False, depending on the condition. It then returns this mask.

Categories