In Python, compare row diffs for multiple columns - python

I want to perform a row by row comparison over multiple columns. I want a single series, indicating if all entries in a row (over several columns) are the same as the previous row.
Lets say I have the following dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1, 1, 1, 2, 2],
'B' : [2, 2, 3, 3, 3],
'C' : [1, 1, 1, 2, 2]})
I can compare all the rows, of all the columns
>>> df.diff().eq(0)
A B C
0 False False False
1 True True True
2 True False True
3 False True False
4 True True True
This gives a dataframe comparing each series individually. What I want is the comparison of all columns in one series.
I can achieve this by looping
compare_all = df.diff().eq(0)
compare_tot = compare_all[compare_all.columns[0]]
for c in compare_all.columns[1:]:
compare_tot = compare_tot & compare_all[c]
This gives
>>> compare_tot
0 False
1 True
2 False
3 False
4 True
dtype: bool
as expected.
Is it possible to achieve this in with a one-liner, that is without the loop?

>>> (df == df.shift()).all(axis=1)
0 False
1 True
2 False
3 False
4 True
dtype: bool

You need all
In [1306]: df.diff().eq(0).all(1)
Out[1306]:
0 False
1 True
2 False
3 False
4 True
dtype: bool

Related

pandas isin comparison to multiple columns, not including index

I'm trying to test whether a dictionary of key-value pairs is contained in a DataFrame with columns having the same names as the dictionary.
example:
df1 = pd.DataFrame({'A': [2,8,4,9,6], 'B': [7,1,8,3,5], 'C': [8,4,9,1,6], 'D': [7,8,9,1,2], 'E': [3,8,4,9,6]})
df1
A B C D E
0 2 7 8 7 3
1 8 1 4 8 8
2 4 8 9 9 4
3 9 3 1 1 9
4 6 5 6 2 6
d = {'A': 9, 'B': 3, 'C': 1, 'D': 1, 'E': 9}
df2 = pd.DataFrame([d])
df2
A B C D E
0 9 3 1 1 9
What I want is a statement that returns True if the entire row of values in df2 is matched anywhere in df1. I've tried passing d and df2 to the .isin values parameter:
df1.isin(d)
results in an error.
TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'int'
while using df2 returns all False.
df1.isin(df2)
A B C D E
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
I played around with the last example in the pandas.DataFrame.isin doc, and realized my test with df2 fails because the index doesn't match (3 in df1 versus 0 in df2).
Is there an easy way to do this with isin that ignores the index, or some other method that doesn't involve stringing together five equality tests?
Is this what you expect?
>>> df1.eq(df2.values).all(axis=1).any()
True
You can also use d directly:
>>> df1.eq(d).all(axis=1).any()
True

(Python) Selecting rows containing a string in ANY column?

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

Applying a pandas GroupBy with mixed boolean and numerical values

How can I apply a pandas groupby to columns that are numerical and boolean? I want to sum over the numerical columns and I want the aggregation of the boolean values to be any, that is True if there are any Trues and False if there are only False.
Performing a sum aggregation will give the desired result as long as you cast the boolean columns back to boolean types. Example
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'bool': [True, False, True, True, False, False],
'c': [10, 10, 15, 15, 20, 20]})
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
df.groupby('id').sum()
bool c
id
1 1.0 20
2 2.0 30
3 0.0 40
As you can see when applying the sum True is cast as 1 and False is cast as zero. This effectively acts as the desired any operation. Casting back to boolean:
df['bool'] = df['bool'].astype(bool)
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
You can choose the functions which you aggregate by with the following:
df.groupby("id").agg({
"bool":lambda arr: any(arr),
"c":sum,
})

Filtering rows in a DataFrame when condition is rows in another DataFrame

I have a huge timeseries DataFrame (about 100 000 000 rows) and I need to filter rows by conditions. Each condition there are in an each row of an another Dataframe. This Dataframe has about 2000 rows and each row is condition.
Toy example:
df = pd.DataFrame({val: [1, 3, 2, 4, 3, 1, 2, 3], date: [2015-03-12, 2015-04-12, 2015-05-13, 2016-03-12, 2016-04-07, 2016-05-12, 2017-01-11, 2017-03-20]})
df_condition = pd.DataFrame({val: [2, 3], date: [2015-07-13, 2016-04-08]})
Condition is remove all raws in df, where val appears earlier than date in df_condition:
df = df[(df['val']==2) & (df['date']>'2015-07-13')]
df = df[(df['val']==3) & (df['date']>'2016-04-08')]
and so on about 2000 conditions
I use this way, but it too long (about 5 hours). Is there a faster method?
vals = df_condition.val.values
dates = vals = df_condition.dates.values
for i in range(len(df_condition)):
df = df[~((df.val==vals[i])&(df.date < dates[i]))]
I believe you can create list of all masks and then reduce:
masks = [(df.val==x.val)&(df.date >= x.date) for x in df_condition.itertuples()]
print (masks)
[0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
dtype: bool, 0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
dtype: bool]
df1 = df[np.logical_and.reduce(masks)]

Condensing pandas dataframe by dropping missing elements

Problem
I have a dataframe that looks like this:
Key Var ID_1 Var_1 ID_2 Var_2 ID_3 Var_3
1 True 1.0 True NaN NaN 5.0 True
2 True NaN NaN 4.0 False 7.0 True
3 False 2.0 False 5.0 True NaN NaN
Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.
Desired Output
Key Var First_ID First_Var Second_ID Second_Var
1 True 1 True 5 True
2 True 4 False 7 True
3 False 2 False 5 True
The ordering is not important, so long as the Id/Var pairs are maintained.
Current Solution
Below is a working solution that I have:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})
sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]
output = np.empty(shape=[data.shape[0], 6], dtype=str)
for i, *row in data.itertuples():
output[i] = [element for element in row if np.isfinite(element)]
print(output)
[['1' 'T' '1' 'T' '5' 'T']
['2' 'T' '4' 'F' '7' 'T']
['3' 'F' '2' 'F' '5' 'T']]
This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.
Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.
There is one simple solution i.e push the nans to right and drop the nans on axis 1. i.e
ndf = data.apply(lambda x : sorted(x,key=pd.isnull),1).dropna(1)
Output:
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True
Hope it helps.
A numpy solution from Divakar here for 10x speed i.e
def mask_app(a):
out = np.full(a.shape,np.nan,dtype=a.dtype)
mask = ~np.isnan(a.astype(float))
out[np.sort(mask,1)[:,::-1]] = a[mask]
return out
ndf = pd.DataFrame(mask_app(data.values),columns=data.columns).dropna(1)
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True

Categories