Condensing pandas dataframe by dropping missing elements - python

Problem
I have a dataframe that looks like this:
Key Var ID_1 Var_1 ID_2 Var_2 ID_3 Var_3
1 True 1.0 True NaN NaN 5.0 True
2 True NaN NaN 4.0 False 7.0 True
3 False 2.0 False 5.0 True NaN NaN
Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.
Desired Output
Key Var First_ID First_Var Second_ID Second_Var
1 True 1 True 5 True
2 True 4 False 7 True
3 False 2 False 5 True
The ordering is not important, so long as the Id/Var pairs are maintained.
Current Solution
Below is a working solution that I have:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})
sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]
output = np.empty(shape=[data.shape[0], 6], dtype=str)
for i, *row in data.itertuples():
output[i] = [element for element in row if np.isfinite(element)]
print(output)
[['1' 'T' '1' 'T' '5' 'T']
['2' 'T' '4' 'F' '7' 'T']
['3' 'F' '2' 'F' '5' 'T']]
This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.
Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.

There is one simple solution i.e push the nans to right and drop the nans on axis 1. i.e
ndf = data.apply(lambda x : sorted(x,key=pd.isnull),1).dropna(1)
Output:
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True
Hope it helps.
A numpy solution from Divakar here for 10x speed i.e
def mask_app(a):
out = np.full(a.shape,np.nan,dtype=a.dtype)
mask = ~np.isnan(a.astype(float))
out[np.sort(mask,1)[:,::-1]] = a[mask]
return out
ndf = pd.DataFrame(mask_app(data.values),columns=data.columns).dropna(1)
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True

Related

TypeError: cannot do positional indexing on Int64Index with these indexers [Int64Index([5], dtype='int64')] of type Int64Index

I have a dataframe (small sample) like this:
import pandas as pd
data = [['A', False, 2], ['A', True, 8], ['A', False, 25], ['A', False, 30], ['B', False, 4], ['B', False, 8], ['B', True, 2], ['B', False, 3]]
df = pd.DataFrame(data = data, columns = ['group', 'indicator', 'val'])
group indicator val
0 A False 2
1 A True 8
2 A False 25
3 A False 30
4 B False 4
5 B False 8
6 B True 2
7 B False 3
I would like to select n rows above and below the row with indicator == True for each group. For example I would like to get n = 1 rows which means that for group A it would return the rows with index: 0, 1, 2 and for group B rows with index: 5, 6, 7. I tried the following code:
# subset each group to list
dfs = [x for _, x in df.groupby('group')]
for i in dfs:
# select dataframe
df_sub = dfs[1]
# get index of row with indicator True
idx = df_sub.index[df_sub['indicator'] == True]
# select n rows above and below row with True
df_sub = df_sub.iloc[idx - 1: idx + 1]
# combine each dataframe again
df_merged = pd.concat(df_sub)
print(df_merged)
But I get the following error:
TypeError: cannot do positional indexing on Int64Index with these indexers [Int64Index([5], dtype='int64')] of type Int64Index
This is the desired output:
data = [['A', False, 2], ['A', True, 8], ['A', False, 25], ['B', False, 8], ['B', True, 2], ['B', False, 3]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'indicator', 'val'])
group indicator val
0 A False 2
1 A True 8
2 A False 25
3 B False 8
4 B True 2
5 B False 3
I don't understand why this error happens and how to solve it. Does anyone know how to fix this issue?
You can use a groupby.rolling with a centered window of 2*n+1 to get the n rows before and after each True, then perform boolean indexing:
n = 1
mask = (df.groupby('group')['indicator']
.rolling(n*2+1, center=True, min_periods=1)
.max().droplevel(0)
.astype(bool)
)
out = df.loc[mask]
output:
group indicator val
0 A False 2
1 A True 8
2 A False 25
5 B False 8
6 B True 2
7 B False 3

Python pandas equivalent to R's group_by, mutate, and ifelse

Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame:
import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
"boolean": [True, True, True, True, True, False],
"numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df
I want to 1) group by letters, 2) take the mean of numbers if all values in boolean have the same value. In R I would write:
library(dplyr)
df %>%
group_by(letters) %>%
mutate(
condition = n_distinct(boolean) == 1,
numbers = ifelse(condition, mean(numbers), numbers)
) %>%
select(-condition)
This would result in the following output:
# A tibble: 6 x 3
# Groups: letters [2]
letters boolean numbers
<chr> <lgl> <dbl>
1 a TRUE 2
2 a TRUE 2
3 a TRUE 2
4 b TRUE 1
5 b TRUE 2
6 b FALSE 3
How would you do it using Python pandas?
We can use lazy groupby and transform:
g = df.groupby('letters')
df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')
Output:
letters boolean numbers
0 a True 2
1 a True 2
2 a True 2
3 b True 1
4 b True 2
5 b False 3
Another way would be to use np.where. where a group has one unique value, find mean. Where it doesnt keep the numbers. Code below
df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Alternatively, mask where condition does not apply as you compute the mean.
m=df.groupby('letters')['boolean'].transform('nunique')==1
df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')
Since you are comparing drectly to R, I would prefer to use siuba rather than pandas:
from siuba import mutate, if_else, _, select, group_by, ungroup
df1 = df >>\
group_by(_.letters) >> \
mutate( condition = _.boolean.unique().size == 1,
numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
) >>\
ungroup() >> select(-_.condition)
print(df1)
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Note that >> is the pipe. I added \ in order to jump to the next line. Also note that to refer to the variables you use _.variable
EDIT
It seems your R code has an issue, In R, you should rather use condition = all(boolean) instead of the code you have. That will translate to condition = boolean.all() in Python
datar is another solution for you:
>>> import pandas as pd
>>> data = {"letters": ["a", "a", "a", "b", "b", "b"],
... "boolean": [True, True, True, True, True, False],
... "numbers": [1, 2, 3, 1, 2, 3]}
>>> df = pd.DataFrame(data)
>>>
>>> from datar.all import f, group_by, mutation, n_distinct, if_else, mean, select
>>> df >> group_by(f.letters) \
... >> mutate(
... condition=n_distinct(f.boolean) == 1,
... numbers = if_else(f.condition, mean(f.numbers), f.numbers)
... ) \
... >> select(~f.condition)
letters boolean numbers
<object> <bool> <float64>
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
[Groups: letters (n=2)]

(Python) Selecting rows containing a string in ANY column?

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

pandas: How to count the unique categories?

I have a dataframe
df_input = pd.DataFrame(
{
"col_cate": ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
"target_bool": [True, False, True, False, True, False, True, False]
}
)
And I want to count the number of unique categories. So I am expecting the output to be like this
col_cate, target_bool, cnt
'A' , True , 2
'A' , False , 2
'B' , True , 2
'B' , False , 2
But df_input.group_by(["col_cate", "target_bool"]).count() gives
Empty DataFrame
Columns: []
Index: [(A, False), (A, True), (B, False), (B, True)]
But adding a dummy to the df_input works, like df_input["dummy"] = 1.
How do I get the group by count table without adding a dummy?
df_input.groupby('col_cate')['target_bool'].value_counts()
col_cate target_bool
A False 2
True 2
B False 2
True 2
then you can reset_index()
Because function GroupBy.count is used for counts values with exclude missing values if exist is necessary specify column after groupby, if both columns are used in by parameter in groupby:
df = (df_input.groupby(by=["col_cate", "target_bool"])['col_cate']
.count()
.reset_index(name='cnt'))
print (df)
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2
If want count all columns, here both is it possible (but here always same output) if specify both columns:
df1 = (df_input.groupby(["col_cate", "target_bool"])[['col_cate','target_bool']]
.count()
.add_suffix('_count')
.reset_index())
print (df1)
col_cate target_bool col_cate_count target_bool_count
0 A False 2 2
1 A True 2 2
2 B False 2 2
3 B True 2 2
Or if use GroupBy.size method it working a bit different - it count all values, not exclude missing, so no column is necessary specify:
df = df_input.groupby(["col_cate", "target_bool"]).size().reset_index(name='cnt')
print (df)
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2
Like this also:
In [54]: df_input.groupby(df_input.columns.tolist()).size().reset_index().\
...: rename(columns={0:'cnt'})
Out[54]:
col_cate target_bool cnt
0 A False 2
1 A True 2
2 B False 2
3 B True 2

In Python, compare row diffs for multiple columns

I want to perform a row by row comparison over multiple columns. I want a single series, indicating if all entries in a row (over several columns) are the same as the previous row.
Lets say I have the following dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1, 1, 1, 2, 2],
'B' : [2, 2, 3, 3, 3],
'C' : [1, 1, 1, 2, 2]})
I can compare all the rows, of all the columns
>>> df.diff().eq(0)
A B C
0 False False False
1 True True True
2 True False True
3 False True False
4 True True True
This gives a dataframe comparing each series individually. What I want is the comparison of all columns in one series.
I can achieve this by looping
compare_all = df.diff().eq(0)
compare_tot = compare_all[compare_all.columns[0]]
for c in compare_all.columns[1:]:
compare_tot = compare_tot & compare_all[c]
This gives
>>> compare_tot
0 False
1 True
2 False
3 False
4 True
dtype: bool
as expected.
Is it possible to achieve this in with a one-liner, that is without the loop?
>>> (df == df.shift()).all(axis=1)
0 False
1 True
2 False
3 False
4 True
dtype: bool
You need all
In [1306]: df.diff().eq(0).all(1)
Out[1306]:
0 False
1 True
2 False
3 False
4 True
dtype: bool

Categories