Pandas Dataframe Comparison and Floating Point Precision - python

I'm looking to compare two dataframes which should be identical. However due to floating point precision I am being told the values don't match. I have created an example to simulate it below. How can I get the correct result so the final comparison dataframe returns true for both cells?
a = pd.DataFrame({'A':[100,97.35000000001]})
b = pd.DataFrame({'A':[100,97.34999999999]})
print a
A
0 100.00
1 97.35
print b
A
0 100.00
1 97.35
print (a == b)
A
0 True
1 False

OK you can use np.isclose for this:
In [250]:
np.isclose(a,b)
Out[250]:
array([[ True],
[ True]], dtype=bool)
np.isclose takes relative tolerance and absolute tolerance. These have default values: rtol=1e-05, atol=1e-08 respectively

Related

Compare 2 consecutive cells in a dataframe for equality

I have the following problem, I want to detect if 2 or more consecutive values in a column of a dataframe have a value greater than 0.5. For this I have chosen the following approach: I check each cell if the value is less than 0.5 and create an entry in the column "condition". (See table)
Now I have the following problem, how can I detect in a column if 2 consecutive cells have the same value (row 4-5)? Or is it possible to detect the problem also directly in the Data column.
If 2 consecutive cells are False, the dataframe can be discarded.
I would be very grateful for any help!
data
condition
0
0.1
True
1
0.1
True
2
0.25
True
3
0.3
True
4
0.6
False
5
0.7
False
6
0.3
True
7
0.1
True
6
0.9
False
7
0.1
True
You can compute a boolean series of values greater than 0.5 (i.e True when invalid). Then apply a boolean and (&) between this series and its shift. Any two consecutive True values will yield True. You can check if any is present to decide to discard the dataset:
s = df['data'].gt(0.5)
(s&s.shift()).any()
Output: True -> the dataset is invalid
You can use the .diff method and check that it is equal to zero.
df['eq_to_prev'] = df.data.diff().eq(0)

Filter a data frame containing NaN values, results in empty data frame as result [duplicate]

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

How to count longest uninterrupted sequence in pandas

Let's say I have pd.Series like below
s = pd.Series([False, True, False,True,True,True,False, False])
0 False
1 True
2 False
3 True
4 True
5 True
6 False
7 False
dtype: bool
I want to know how long is the longest True sequence, in this example, it is 3.
I tried it in a stupid way.
s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
if item:
count +=1
else:
if count>max_count:
max_count = count
count = 0
print(max_count)
It will print 3, but in a Series of all True, it will print 0
Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts
(~s).cumsum()[s].value_counts().max()
3
explanation
(~s).cumsum() is a pretty standard way to produce distinct True/False groups
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 4
dtype: int64
But you can see that the group we care about is represented by the 2s and there are four of them. That's because the group is initiated by the first False (which becomes True with (~s)). Therefore, we mask this cumulative sum with the boolean mask we started with.
(~s).cumsum()[s]
1 1
3 2
4 2
5 2
dtype: int64
Now we see the three 2s pop out and we just have to use a method to extract them. I used value_counts and max.
Option 2
Use factorize and bincount
a = s.values
b = pd.factorize((~a).cumsum())[0]
np.bincount(b[a]).max()
3
explanation
This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum() we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.
After pd.factorize I use those integer values in np.bincount which accumulates the total number of times each integer is used. Then take the maximum.
Option 3
As stated in the explanation of option 2, this also works:
a = s.values
np.bincount((~a).cumsum()[a]).max()
3
I think this could work
pd.Series(s.index[~s].values).diff().max()-1
Out[57]: 3.0
Also outside pandas' we can back to python groupby
from itertools import groupby
max([len(list(group)) for key, group in groupby(s.tolist())])
Out[73]: 3
Update :
from itertools import compress
max(list(compress([len(list(group)) for key, group in groupby(s.tolist())],[key for key, group in groupby(s.tolist())])))
Out[84]: 3
You can use (inspired by #piRSquared answer):
s.groupby((~s).cumsum()).sum().max()
Out[513]: 3.0
Another option to use a lambda func to do this.
s.to_frame().apply(lambda x: s.loc[x.name:].idxmin() - x.name, axis=1).max()
Out[429]: 3
Edit: As piRSquared mentioned, my previous solution needs to append two False at the beginning and at the end of the series. piRSquared kindly gave an answer based on that.
(np.diff(np.flatnonzero(np.append(True, np.append(~s.values, True)))) - 1).max()
My original trial is
(np.diff(s.where(~s).dropna().index.values) - 1).max()
(This will not give the correct answer if the longest True starts at the beginning or ends at the end as pointed out by piRSquared. Please use the solution above given by piRSquared. This work remains only for explanation.)
Explanation:
This finds the indices of the False parts and by finding the gaps between the indices of False, we can know the longest True.
s.where(s == False).dropna().index.values finds all the indices of False
array([0, 2, 6, 7])
We know that Trues live between the Falses. Thus, we can use
np.diff to find the gaps between these indices.
array([2, 4, 1])
Minus 1 in the end as Trues lies between these indices.
Find the maximum of the difference.
Your code was actually very close. It becomes perfect with a minor fix:
count = 0
maxCount = 0
for item in s:
if item:
count += 1
if count > maxCount:
maxCount = count
else:
count = 0
print(maxCount)
I'm not exactly sure how to do it with pandas but what about using itertools.groupby?
>>> import pandas as pd
>>> s = pd.Series([False, True, False,True,True,True,False, False])
>>> max(sum(1 for _ in g) for k, g in groupby(s) if k)
3

Floating point comparison not working on pandas groupby output

I am facing issues with pandas filtering of rows. I am trying to filter out team whose sum of weight is not equal to one.
dfteam
Team Weight
A 0.2
A 0.5
A 0.2
A 0.1
B 0.5
B 0.25
B 0.25
dfteamtemp = dfteam.groupby(['Team'], as_index=False)['Weight'].sum()
dfweight = dfteamtemp[(dfteamtemp['Weight'].astype(float)!=1.0)]
dfweight
Team Weight
0 A 1.0
I am not sure about the reason for this output. I should get an empty dataframe but it is giving me Team A even thought the sum is 1.
You are a victim of floating point inaccuracies. The first value does not exactly add up to 1.0 -
df.groupby('Team').Weight.sum().iat[0]
0.99999999999999989
You can resolve this by using np.isclose instead -
np.isclose(df.groupby('Team').Weight.sum(), 1.0)
array([ True, True], dtype=bool)
And filter on this array. Or, as #ayhan suggested, use groupby + filter -
df.groupby('Team').filter(lambda x: not np.isclose(x['Weight'].sum(), 1))
Empty DataFrame
Columns: [Team, Weight]
Index: []

Why does pandas dataframe indexing change axis depending on index type?

when you index into a pandas dataframe using a list of ints, it returns columns.
e.g. df[[0, 1, 2]] returns the first three columns.
why does indexing with a boolean vector return a list of rows?
e.g. df[[True, False, True]] returns the first and third rows. (and errors out if there aren't 3 rows.)
why? Shouldn't it return the first and third columns?
Thanks!
Because if use:
df[[True, False, True]]
it is called boolean indexing by mask:
[True, False, True]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
print (df[[True, False, True]])
A B C
0 1 4 7
2 3 6 9
Boolean mask is same as:
print (df.B != 5)
0 True
1 False
2 True
Name: B, dtype: bool
print (df[df.B != 5])
A B C
0 1 4 7
2 3 6 9
There are very specific slicing accessors to target rows and columns in specific ways.
Mixed Position and Label Based Selection
Select by position
Selection by Label
loc[], at[], and get_value() take row and column labels and return the appropriate slice
iloc[] and iat[] take row and column positions and return the appropriate slice
What you are seeing is the result of pandas trying to infer what you are trying to do. As you have noticed this is inconsistent at times. In fact, it is more pronounced than just what you've highlighted... but I wont go into that now.
See also
pandas docs
However, when an axis is integer based,
ONLY label based access and not positional access is supported.
Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

Categories