Drop rows in pandas dataframe based on columns value - python

I have a dataframe like this :
cols = [ 'a','b']
df = pd.DataFrame(data=[[NaN, -1, NaN, 34],[-32, 1, -4, NaN],[4,5,41,14],[3, NaN, 1, NaN]], columns=['a', 'b', 'c', 'd'])
I want to retrieve all rows, when the columns 'a' and 'b' are non-negative but if any of them or all are missing, I want to keep them.
The result should be
a b c d
2 4 5 41 14
3 3 NaN 1 NaN
I've tried this but it doesn't give the expected result.
df[(df[cols]>0).all(axis=1) | df[cols].isnull().any(axis=1)]

IIUC, you actually want
>>> df[((df[cols] > 0) | df[cols].isnull()).all(axis=1)]
a b c d
2 4 5 41 14
3 3 NaN 1 NaN
Right now you're getting "if they're all positive" or "any are null". You want "if they're all (positive or null)". (Replace > 0 with >=0 for nonnegativity.)
And since NaN isn't positive, we could simplify by flipping the condition, and use something like
>>> df[~(df[cols] <= 0).any(axis=1)]
a b c d
2 4 5 41 14
3 3 NaN 1 NaN

Related

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

Update NaN values with dictionary of values nased on condition

I have data frame like this:
c1 c2
0 a 12
1 b NaN
2 a 45
3 c NaN
4 c 32
5 b NaN
and I have dictionary like this
di = {
'a': 10, 'b': 20, 'c':30
}
I want to update my data frame like this
c1 c2
0 a 12
1 b 20
2 a 45
3 c 30
4 c 32
5 b 20
is there any way to do it without using long lambda function with conditions
Here's the code to create your data frame
a = pd.DataFrame({
'c1': ['a', 'b', 'a', 'c', 'c', 'b'],
'c2': [12, np.NaN, 45, np.NaN, 32, np.NaN]
})
di = {
'a': 10, 'b': 20, 'c':30
}
di
You can use apply() method to deal with this.
Create a function and then apply that function to the required features.
`def deal_na(cols):
x=cols[0]
y=cols[1]
if pd.isnull(y):
return di[x]
else:
return y
a['c2'] = a[['c1','c2']].apply(deal_na,axis=1)`
Here, we pass values of feature 'c1' and 'c2' as a list to the function in the cols variable. Then we assign each value to 2 variables x and y. We check whether y is null or not. If it is null then replace with di[x] otherwise return as it is.
Use Series.map with Series.fillna for replace only missing values:
a['c2'] = a['c2'].fillna(a['c1'].map(di))
print (a)
c1 c2
0 a 12.0
1 b 20.0
2 a 45.0
3 c 30.0
4 c 32.0
5 b 20.0
Last if all values of c1 are in keys of dictionary, all values of missing values are replaced and is possible convert to integers:
a['c2'] = a['c2'].fillna(a['c1'].map(di)).astype(int)
print (a)
c1 c2
0 a 12
1 b 20
2 a 45
3 c 30
4 c 32
5 b 20

pd.Series.explode and ValueError: cannot reindex from a duplicate axis

I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can't quite figure out what exactly is throwing me the error.
Below is my best at reproducing the spirit of the dataframe, which does throw the error.
d = {"id" : [1,2,3,4,5],
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
"catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
"catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}
df = pd.DataFrame(d)
df.head()
id cata catb catc
0 1 [aaa1, bbb2, ccc3] [ddd1, eee2, fff3, ggg4] [hhh1, iii2, jjj3, kkk4, lll5]
1 2 [aaa4, bbb5, ccc6] [ddd5, eee6, fff7, ggg8] [hhh6, iii7, jjj8, kkk9, lll10]
2 3 [aaa7, bbb8, ccc9] [ddd9, eee10, fff11, ggg12] [hhh11, iii12, jjj13, kkk14, lll15]
3 4 [aaa10, bbb11, ccc12] [ddd13, eee14, fff15, ggg16] [hhh16, iii17, jjj18, kkk18, lll19]
4 5 [aaa13, bbb14, ccc15] [ddd17, eee18, fff19, ggg20] [hhh20, iii21, jjj22, kkk23, lll24]
df.set_index(['id']).apply(pd.Series.explode).reset_index()
Here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()
14 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
The dataset I'm using is a few hundred MBs and it's a pain - lots of lists inside lists, but the example of above is a fair representation of where I'm stuck. Even when I try to generate a fake dataframe with unique values, I still don't understand why I'm getting the ValueError.
I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1), which doesn't throw a value error, however, it's definitely not as fast and I'd probably would reconsider how I'm processing the df. Still, I want to understand why I'm getting the ValueError I'm getting when I don't have any obvious duplicate values.
Thanks!!!!
Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.
id cata catb catc
0 1 aaa1 ddd1 hhh1
1 1 bbb2 eee2 iii2
2 1 ccc3 fff3 jjj3
3 1 NaN ggg4 kkk4
4 1 NaN NaN lll5
5 2 aaa4 ddd5 hhh6
6 2 bbb5 eee6 iii7
7 2 ccc6 fff7 jjj8
8 2 NaN ggg8 kkk9
9 2 NaN NaN lll10
10 3 aaa7 ddd9 hhh11
11 3 bbb8 eee10 iii12
12 3 ccc9 fff11 jjj13
13 3 NaN ggg12 kkk14
14 3 NaN NaN lll15
15 4 aaa10 ddd13 hhh16
16 4 bbb11 eee14 iii17
17 4 ccc12 fff15 jjj18
18 4 NaN ggg16 kkk18
19 4 NaN NaN lll19
20 5 aaa13 ddd17 hhh20
21 5 bbb14 eee18 iii21
22 5 ccc15 fff19 jjj22
23 5 NaN ggg20 kkk23
24 5 NaN NaN lll24
The error of pd.Series.explode() cannot be solved, but a long form with an 'id' column is created.
tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)
id value
0 0 aaa1
1 0 aaa4
2 0 aaa7
3 0 aaa10
4 0 aaa13
5 1 bbb2
6 1 bbb5
7 1 bbb8
8 1 bbb11
9 1 bbb14
10 2 ccc3
11 2 ccc6
12 2 ccc9
...
I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., "name") However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.
For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.
from pandas import DataFrame as df
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"id" : [1,2,3],
0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
)
print(df)
"""
id 0 1 2
0 1 [x, y, z] [a, b, c] [a, b, c]
1 2 [a, b, c] [a, b, c] [x, y, z]
2 3 [a, b, c] [a, b, c] [a, b, c]
"""
bb = (
df.set_index('id').stack().explode()
.reset_index(name='val')
.drop(columns='level_1').reindex()
)
print (bb)
"""
id val
0 1 x
1 1 y
2 1 z
3 1 a
4 1 b
5 1 c
6 1 a
7 1 b
8 1 c
9 2 a
10 2 b
11 2 c
12 2 a
13 2 b
14 2 c
15 2 x
16 2 y
17 2 z
18 3 a
19 3 b
20 3 c
21 3 a
22 3 b
23 3 c
24 3 a
25 3 b
26 3 c
"""
aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
id 0 1 2
0 1 x a a
1 1 y b b
2 1 z c c
3 2 a a x
4 2 b b y
5 2 c c z
6 3 a a a
7 3 b b b
8 3 c c c
"""

Pandas: set all values that are <= 0 to the maximum value in a column by group, but only after the last positive value in that group

I am trying to set all values that are <= 0, by group, to the maximum value in that group, but only after the last positive value. That is, all values <=0 in the group that come before the last positive value must be ignored. Example:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]}
df = pd.DataFrame(data)
df
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 0
5 B -1
6 B 0
7 B 9
8 B -2
9 B 0
10 B 0
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
and the result must be:
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 8
5 B -1
6 B 0
7 B 9
8 B 9
9 B 9
10 B 9
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
Thanks to advise
Start by adding a column to identify the rows with negative value (more precisely <= 0):
df['neg'] = (df['value'] <= 0)
Then, for each group, find the sequence of last few entries that have 'neg' set to True and that are contiguous. In order to do that, reverse the order of the DataFrame (with .iloc[::-1]) and then use .cumprod() on the 'neg' column. cumprod() will treat True as 1 and False as 0, so the cumulative product will be 1 as long as you're seeing all True's and will become and stay 0 as soon as you see the first False. Since we reversed the order, we're going backwards from the end, so we're finding the sequence of True's at the end.
df['upd'] = df.iloc[::-1].groupby('group')['neg'].cumprod().astype(bool)
Now that we know which entries to update, we just need to know what to update them to, which is the max of the group. We can use transform('max') on a groupby to get that value and then all that's left is to do the actual update of 'value' where 'upd' is set:
df.loc[df['upd'], 'value'] = df.groupby('group')['value'].transform('max')
We can finish by dropping the two auxiliary columns we used in the process:
df = df.drop(['neg', 'upd'], axis=1)
The result I got matches your expected result.
UPDATE: Or do the whole operation in a single (long!) line, without adding any auxiliary columns to the original DataFrame:
df.loc[
df.assign(
neg=(df['value'] <= 0)
).iloc[::-1].groupby(
'group'
)['neg'].cumprod().astype(bool),
'value'
] = df.groupby(
'group'
)['value'].transform('max')
You can do it this way.
(df.loc[(df.assign(m=df['value'].lt(0)).groupby(['group'], sort=False)['m'].transform('any')) &
(df.index>=df.groupby('group')['value'].transform('idxmin')),'value']) = np.nan
df['value']=df.groupby('group').ffill()
df
Output
group value
0 A 3.0
1 A 0.0
2 A 8.0
3 A 7.0
4 A 0.0
5 B -1.0
6 B 0.0
7 B 9.0
8 B 9.0
9 B 9.0
10 B 9.0
11 C 2.0
12 C 0.0
13 C 5.0
14 C 0.0
15 C 1.0

How to split a pandas dataframe of different column sizes into separate dataframes?

I have a large pandas dataframe, consisting of a different number of columns throughout the dataframe.
Here is an example: Current dataframe example
I would like to split the dataframe into multiple dataframes, based on the number of columns it has.
Example output image here:Output image
Thanks.
If you have a dataframe of say 10 columns and you want to put the records with 3 NaN values in another result dataframe as those with 1 NaN, you can do this as follows:
# evaluate the number of NaNs per row
num_counts=df.isna().sum('columns')
# group by this number and add the grouped
# dataframe to a dictionary
results= dict()
num_counts=df.isna().sum('columns')
for key, sub_df in df.groupby(num_counts):
results[key]= sub_df
After executing this code, results contains subsets of df where each subset contains the same number of NaNs (so the same number of non-NaNs).
If you want to write your results to a excel file, you just need to execute the following code:
with pd.ExcelWriter('sorted_output.xlsx') as writer:
for key, sub_df in results.items():
# if you want to avoid the detour of using dicitonaries
# just replace the previous line by
# for key, sub_df in df.groupby(num_counts):
sub_df.to_excel(
writer,
sheet_name=f'missing {key}',
na_rep='',
inf_rep='inf',
float_format=None,
index=True,
index_label=True,
header=True)
Example:
# create an example dataframe
df=pd.DataFrame(dict(a=[1, 2, 3, 4, 5, 6], b=list('abbcac')))
df.loc[[2, 4, 5], 'c']= list('xyz')
df.loc[[2, 3, 4], 'd']= list('vxw')
df.loc[[1, 2], 'e']= list('qw')
It looks like this:
Out[58]:
a b c d e
0 1 a NaN NaN NaN
1 2 b NaN NaN q
2 3 b x v w
3 4 c NaN x NaN
4 5 a y w NaN
5 6 c z NaN NaN
If you execute the code above on this dataframe, you get a dictionary with the following content:
0: a b c d e
2 3 b x v w
1: a b c d e
4 5 a y w NaN
2: a b c d e
1 2 b NaN NaN q
3 4 c NaN x NaN
5 6 c z NaN NaN
3: a b c d e
0 1 a NaN NaN NaN
The keys of the dictionary are the number of NaNs in the row and the values are the dataframes which contain only rows with that number of NaNs in them.
If I'm getting you right, what you want to do is to split existing 1 dataframe with n columns into ceil(n/5) dataframes, each with 5 columns, and the last one with the reminder of n/5.
If that's the case this will do the trick:
import pandas as pd
import math
max_cols=5
dt={"a": [1,2,3], "b": [6,5,3], "c": [8,4,2], "d": [8,4,0], "e": [1,9,5], "f": [9,7,9]}
df=pd.DataFrame(data=dt)
dfs=[df[df.columns[max_cols*i:max_cols*i+max_cols]] for i in range(math.ceil(len(df.columns)/max_cols))]
for el in dfs:
print(el)
And output:
a b c d e
0 1 6 8 8 1
1 2 5 4 4 9
2 3 3 2 0 5
f
0 9
1 7
2 9
[Program finished]

Categories