Comparing floats in a pandas column - python

I have the following dataframe:
actual_credit min_required_credit
0 0.3 0.4
1 0.5 0.2
2 0.4 0.4
3 0.2 0.3
I need to add a column indicating where actual_credit >= min_required_credit. The result would be:
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.1 0.3 False
I am doing the following:
df['result'] = abs(df['actual_credit']) >= abs(df['min_required_credit'])
However the 3rd row (0.4 and 0.4) is constantly resulting in False. After researching this issue at various places including: What is the best way to compare floats for almost-equality in Python? I still can't get this to work. Whenever the two columns have an identical value, the result is False which is not correct.
I am using python 3.3

Due to imprecise float comparison you can or your comparison with np.isclose, isclose takes a relative and absolute tolerance param so the following should work:
df['result'] = df['actual_credit'].ge(df['min_required_credit']) | np.isclose(df['actual_credit'], df['min_required_credit'])

#EdChum's answer works great, but using the pandas.DataFrame.round function is another clean option that works well without the use of numpy.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df['result'] = df['actual_credit'].round(1) >= df['min_required_credit'].round(1)
print(df)
actual_credit min_required_credit result
0 0.3 0.400 False
1 0.5 0.200 True
2 0.4 0.401 True
3 0.2 0.300 False
You might consider using round() to more permanently edit your dataframe, depending if you desire that precision or not. In this example, it seems like the OP suggests this is probably just noise and is just causing confusion.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df = df.round(1)
df['result'] = df['actual_credit'] >= df['min_required_credit']
print(df)
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.2 0.3 False

In general numpy Comparison functions work well with pd.Series and allow for element-wise comparisons:
isclose, allclose, greater, greater_equal, less, less_equal etc.
In your case greater_equal would do:
df['result'] = np.greater_equal(df['actual_credit'], df['min_required_credit'])
or alternatively, as proposed, using pandas.ge(alternatively le, gt etc.):
df['result'] = df['actual_credit'].ge(df['min_required_credit'])
The risk with oring with ge (as mentioned above) is that e.g. comparing 3.999999999999 and 4.0 might return True which might not necessarily be what you want.

Use pandas.DataFrame.abs() instead of the built-in abs():
df['result'] = df['actual_credit'].abs() >= df['min_required_credit'].abs()

Related

How to check each column for value greater than and accept if 3 out of 5 columns has that value

I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.
My dataframe now looks like this:
id
Article
cosinesin1
cosinesin2
cosinesin3
cosinesin4
cosinesin5
cosinesin6
Similar
id1
[Article1]
0.2
0.5
0.6
0.8
0.7
0.8
True
id2
[Article2]
0.1
0.2
0.03
0.8
0.2
0.45
False
So I want to add Similar column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.
Is there any way to do this in python?
Thanks
In Python, you can treat True and False as integers, 1 and 0, respectively.
So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True for each row. Comparing those numbers to 3 yields the column you want:
cos_cols = [f"cosinesin{i}" for i in range(1, 7)]
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3

how to compare and merge two list according to its sequence in python?

Let say i have two lists that is intended to be merged:
a=[0.1,0.2,'-','-',0.3]
b=['-','-',0.4,0.5,'-']
how to merge a and b to its current position to be like this?
c=[0.1,0.2,0.4,0.5,0.3]
thanks
Assuming either a or b has a float for a given position.
One possibility:
c = [x if isinstance(x, float) else y for x,y in zip(a,b)]
output: [0.1, 0.2, 0.4, 0.5, 0.3]
Using pandas:
pd.Series(a).replace('-', pd.NA).fillna(pd.Series(b))
output:
0 0.1
1 0.2
2 0.4
3 0.5
4 0.3
dtype: float64

Replace same value on second row and on with 0

Is this possible to do in DataFrame Pandas?
I want to keep only first row value on the same column, replace second row, and on with 0
Input
Name--------Date-------Amount-----Labor
A--------------1/1/1972-------5-------- 0.3
A--------------1/1/1972-------5-------- 0.1
A--------------1/1/1972-------5-------- 0.7
A--------------1/1/1972-------1-------- 0.3
B--------------7/2/1980-------1-------- 0.6
B--------------7/2/1980-------1-------- 0.3
B--------------7/2/1980-------1-------- 0.7
C--------------6/9/1965-------4-------- 0.2
C--------------6/9/1965-------4-------- 0.3
C--------------6/9/1965-------4-------- 0.4
Output
Name--------Date-------Amount-----Labor
A--------------1/1/1972-------5-------- 0.3
A--------------1/1/1972-------0-------- 0.1
A--------------1/1/1972-------0-------- 0.7
A--------------1/1/1972-------0-------- 0.3
B--------------7/2/1980-------1-------- 0.6
B--------------7/2/1980-------0-------- 0.3
B--------------7/2/1980-------0-------- 0.7
C--------------6/9/1965-------4-------- 0.2
C--------------6/9/1965-------0-------- 0.3
C--------------6/9/1965-------0-------- 0.4
As simple as multiplying by a boolean mask.
df['Amount'] *= df['Amount'].ne(df['Amount'].shift())
Yes, that is possible. You can use .duplicated(..) to construct a series that marks all duplicates with True. Then you thus can assign values with that mask:
df.loc[df['Amount'].duplicated(), 'Amount'] = 0
Or if you only want to set values that are duplicates in a "sequence", we can work with .diff().eq(0):
df.loc[df['Amount'].diff().eq(0), 'Amount'] = 0

Scoring for each row based on matrix in python

I have a matrix as follows
0 1 2 3 ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2
The data is in a dataframe as shown
Genes string
Gene1 ATGC
Gene2 GCTA
Gene3 ATCG
I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))
The output should be as follows:
Genes Score
Gene1 1.1
Gene2 1.5
Gene3 0.7
I tried using biopython but could not get it right. Can anyone please help!
Let df and genes be your DataFrames. First, let's convert df into a "tall" form:
tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!
Create a new tuple-based index for the trall DF:
tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)
This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:
def gene2score(gene):
return tall.loc[list(enumerate(gene))]['score'].sum()
genes['string'].apply(gene2score)
#Genes
#Gene1 1.1
#Gene2 1.5
#Gene3 0.7

Floating point comparison not working on pandas groupby output

I am facing issues with pandas filtering of rows. I am trying to filter out team whose sum of weight is not equal to one.
dfteam
Team Weight
A 0.2
A 0.5
A 0.2
A 0.1
B 0.5
B 0.25
B 0.25
dfteamtemp = dfteam.groupby(['Team'], as_index=False)['Weight'].sum()
dfweight = dfteamtemp[(dfteamtemp['Weight'].astype(float)!=1.0)]
dfweight
Team Weight
0 A 1.0
I am not sure about the reason for this output. I should get an empty dataframe but it is giving me Team A even thought the sum is 1.
You are a victim of floating point inaccuracies. The first value does not exactly add up to 1.0 -
df.groupby('Team').Weight.sum().iat[0]
0.99999999999999989
You can resolve this by using np.isclose instead -
np.isclose(df.groupby('Team').Weight.sum(), 1.0)
array([ True, True], dtype=bool)
And filter on this array. Or, as #ayhan suggested, use groupby + filter -
df.groupby('Team').filter(lambda x: not np.isclose(x['Weight'].sum(), 1))
Empty DataFrame
Columns: [Team, Weight]
Index: []

Categories