Floating point comparison not working on pandas groupby output - python

I am facing issues with pandas filtering of rows. I am trying to filter out team whose sum of weight is not equal to one.
dfteam
Team Weight
A 0.2
A 0.5
A 0.2
A 0.1
B 0.5
B 0.25
B 0.25
dfteamtemp = dfteam.groupby(['Team'], as_index=False)['Weight'].sum()
dfweight = dfteamtemp[(dfteamtemp['Weight'].astype(float)!=1.0)]
dfweight
Team Weight
0 A 1.0
I am not sure about the reason for this output. I should get an empty dataframe but it is giving me Team A even thought the sum is 1.

You are a victim of floating point inaccuracies. The first value does not exactly add up to 1.0 -
df.groupby('Team').Weight.sum().iat[0]
0.99999999999999989
You can resolve this by using np.isclose instead -
np.isclose(df.groupby('Team').Weight.sum(), 1.0)
array([ True, True], dtype=bool)
And filter on this array. Or, as #ayhan suggested, use groupby + filter -
df.groupby('Team').filter(lambda x: not np.isclose(x['Weight'].sum(), 1))
Empty DataFrame
Columns: [Team, Weight]
Index: []

Related

How to check each column for value greater than and accept if 3 out of 5 columns has that value

I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.
My dataframe now looks like this:
id
Article
cosinesin1
cosinesin2
cosinesin3
cosinesin4
cosinesin5
cosinesin6
Similar
id1
[Article1]
0.2
0.5
0.6
0.8
0.7
0.8
True
id2
[Article2]
0.1
0.2
0.03
0.8
0.2
0.45
False
So I want to add Similar column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.
Is there any way to do this in python?
Thanks
In Python, you can treat True and False as integers, 1 and 0, respectively.
So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True for each row. Comparing those numbers to 3 yields the column you want:
cos_cols = [f"cosinesin{i}" for i in range(1, 7)]
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3

Pandas Data Frame - Sum all the values in a previous column which match a specific condition and add it to a new column

I'm probably missing something, but I was not able to find a solution for this.
Is there a way in python to add values to a new column which satisfy a certain condition.
In Excel I would apply the following formula in the new column and paste it below
=SUMIF(A1:C1, ">0")
val1
val2
val3
output
0.5
0.7
-0.9
1.2
0.3
-0.7
0.3
-0.5
-0.7
-0.9
0
Also in my extracts, there are a few blank values. Can you please help me understand what code should be written for this?
df['total'] = df[['A','B']].sum(axis=1).where(df['A'] > 0, 0)
I came across the above code, but it checks only one condition. What I need is a sum of all of those columns which match the given condition.
Thanks!
pandas can handle that quite out of the box, like that:
import pandas as pd
df = pd.DataFrame([[0.5,.7,-.9],[0.3,-.7,None],[-0.5,-.7,-.9]], columns=['val1','val2','val3'])
df['output'] = df[df>0].sum(axis=1)
Another way, somewhat similar to SUMIF:
# this is the "IF"
is_positive = df.loc[:, "val1": "val3"] > 0
# this is selecting the parts where condition holds & sums
df["output"] = df.loc[:, "val1": "val3"][is_positive].sum(axis=1)
where axis=1 in last line is to sum along rows,
to get
>>> df
val1 val2 val3 output
0 0.5 0.7 -0.9 1.2
1 0.3 -0.7 NaN 0.3
2 -0.5 -0.7 -0.9 0.0
Use DataFrame.clip before sum:
df['total'] = df[['val1','val2','val3']].clip(lower=0).sum(axis=1)
#solution by Nk03 from comments
cols = ['val1','val2','val3']
df['total'] = df[cols].mask(df[cols]<0).sum(axis=1)
EDIT: For test another mask by another columns convert them to numpy array:
df['total'] = df.loc[:, "D":"F"].mask(df.loc[:, "A":"C"].to_numpy() == 'Y', 0).sum(axis=1)
You can do it in the following way:
df["total"] = df.apply(lambda x: sum(x), axis=1).where((df['A'] > 0) & (df['B'] > 0) & (another_condition) & (another_condition), 0)
Note the code will take sum across all columns at once.
For taking sum of specific columns you can do the following:
df['total'] = df[['A','B','C','D','E']].sum(axis=1).where((df['A'] > 0) & (df['B'] > 0) & (another_condition) & (another_condition), 0)

Pandas groupby and change/reassign one element

I want to groupby given dataframe, and then, for each group, for given column p overwrite the value of its last element (of each group) to 1 - sum(p[:-1]) (with sum being the sum of all the elements apart from the last one).
Note that after performing the operation, the sum of all values in p for each group is equal to 1.
For example, for this input dataframe (grouping by c1 and c2):
c1 c2 p
0 x a 0.4
1 y a 0.2
2 x a 0.3
3 y b 0.6
the expected output would be:
c1 c2 p
0 x a 0.4
1 y a 1.0
2 x a 0.6
3 y b 1.0
I managed to perform the operation using for loop:
for _, g in df.groupby(['c1', 'c2']):
df.loc[g.tail(1).index, 'p'] = 1 - g['p'][:-1].sum()
but I am looking for more elegant way of doing this, without explicitly looping through each group.
I tried this:
>>> df.loc[df.groupby(['c1', 'c2']).tail(1).index, 'p']
1 0.2
2 0.3
3 0.6
>>> 1 - df.groupby(['c1', 'c2']).apply(lambda x: x.iloc[:-1].sum())['p']
c1 c2
x a 0.6
y a 1.0
b 1.0
But I don't really know how to assemble those outputs given that their indices differ.
Here's a possible one-line solution:
df.groupby(['c1', 'c2']).apply(
lambda x: x.assign(p=x['p'][:-1].tolist()+[1 - x.iloc[:-1].sum()['p']])
).reset_index(level=[0,1], drop=True)
To make the above code more readable, here is a nearly equivalent version of my one-line solution:
def func(row):
result = 1 - row.iloc[:-1].sum()['p']
row['p'].iloc[-1] = result
return row
df.groupby(['c1', 'c2']).apply(func)
With those solutions in mind, I am not entirely sure why you don't want to use the .groupby call in an explicit python for-loop. My hunch is that an explicit python for-loop would be more than adequate, but I don't know your specific use case/data. I would highly recommend doing some speed comparisons using %timeit with your specific data, as I think that will help shed light on which approach you ultimately end up using.

Comparing floats in a pandas column

I have the following dataframe:
actual_credit min_required_credit
0 0.3 0.4
1 0.5 0.2
2 0.4 0.4
3 0.2 0.3
I need to add a column indicating where actual_credit >= min_required_credit. The result would be:
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.1 0.3 False
I am doing the following:
df['result'] = abs(df['actual_credit']) >= abs(df['min_required_credit'])
However the 3rd row (0.4 and 0.4) is constantly resulting in False. After researching this issue at various places including: What is the best way to compare floats for almost-equality in Python? I still can't get this to work. Whenever the two columns have an identical value, the result is False which is not correct.
I am using python 3.3
Due to imprecise float comparison you can or your comparison with np.isclose, isclose takes a relative and absolute tolerance param so the following should work:
df['result'] = df['actual_credit'].ge(df['min_required_credit']) | np.isclose(df['actual_credit'], df['min_required_credit'])
#EdChum's answer works great, but using the pandas.DataFrame.round function is another clean option that works well without the use of numpy.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df['result'] = df['actual_credit'].round(1) >= df['min_required_credit'].round(1)
print(df)
actual_credit min_required_credit result
0 0.3 0.400 False
1 0.5 0.200 True
2 0.4 0.401 True
3 0.2 0.300 False
You might consider using round() to more permanently edit your dataframe, depending if you desire that precision or not. In this example, it seems like the OP suggests this is probably just noise and is just causing confusion.
df = pd.DataFrame( # adding a small difference at the thousandths place to reproduce the issue
data=[[0.3, 0.4], [0.5, 0.2], [0.400, 0.401], [0.2, 0.3]],
columns=['actual_credit', 'min_required_credit'])
df = df.round(1)
df['result'] = df['actual_credit'] >= df['min_required_credit']
print(df)
actual_credit min_required_credit result
0 0.3 0.4 False
1 0.5 0.2 True
2 0.4 0.4 True
3 0.2 0.3 False
In general numpy Comparison functions work well with pd.Series and allow for element-wise comparisons:
isclose, allclose, greater, greater_equal, less, less_equal etc.
In your case greater_equal would do:
df['result'] = np.greater_equal(df['actual_credit'], df['min_required_credit'])
or alternatively, as proposed, using pandas.ge(alternatively le, gt etc.):
df['result'] = df['actual_credit'].ge(df['min_required_credit'])
The risk with oring with ge (as mentioned above) is that e.g. comparing 3.999999999999 and 4.0 might return True which might not necessarily be what you want.
Use pandas.DataFrame.abs() instead of the built-in abs():
df['result'] = df['actual_credit'].abs() >= df['min_required_credit'].abs()

Pandas Dataframe Comparison and Floating Point Precision

I'm looking to compare two dataframes which should be identical. However due to floating point precision I am being told the values don't match. I have created an example to simulate it below. How can I get the correct result so the final comparison dataframe returns true for both cells?
a = pd.DataFrame({'A':[100,97.35000000001]})
b = pd.DataFrame({'A':[100,97.34999999999]})
print a
A
0 100.00
1 97.35
print b
A
0 100.00
1 97.35
print (a == b)
A
0 True
1 False
OK you can use np.isclose for this:
In [250]:
np.isclose(a,b)
Out[250]:
array([[ True],
[ True]], dtype=bool)
np.isclose takes relative tolerance and absolute tolerance. These have default values: rtol=1e-05, atol=1e-08 respectively

Categories