How to keep columns based on a given row values - python

Here how the datalooks like in df dataframe:
A B C D
0.js 2 1 1 -1
1.js 3 -5 1 -4
total 5 -4 2 -5
And I would get new dataframe df1:
A C
0.js 2 1
1.js 3 1
total 5 2
So basically it should look like this:
df1 = df[df["total"] > 0]
but it should filter on row instead of column and I can't figure it out..

You want to use .loc[:, column_mask] i.e.
In [11]: df.loc[:, df.sum() > 0]
Out[11]:
A C
total 5 2
# or
In [12]: df.loc[:, df.iloc[0] > 0]
Out[12]:
A C
total 5 2

Use .where to set negative values to NaN and then dropna setting axis = 1:
df.where(df.gt(0)).dropna(axis=1)
A C
total 5 2

You can use, loc with boolean indexing or reindex:
df.loc[:, df.columns[(df.loc['total'] > 0)]]
OR
df.reindex(df.columns[(df.loc['total'] > 0)], axis=1)
Output:
A C
0.js 2 1
1.js 3 1
total 5 2

Related

Swapping values in columns depending on value type in one of the columns

Suppose I have the following pandas dataframe:
df = pd.DataFrame([['A','B'],[8,'s'],[5,'w'],['e',1],['n',3]])
print(df)
0 1
0 A B
1 8 s
2 5 w
3 e 1
4 n 3
If there is an integer in column 1, then I want to swap the value with the value from column 0, so in other words I want to produce this dataframe:
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
Replace numbers from second column with mask by to_numeric with errors='coerce' and Series.notna:
m = pd.to_numeric(df[1], errors='coerce').notna()
Another solution with convert to strings by Series.astype and Series.str.isnumeric - but working only for integers:
m = df[1].astype(str).str.isnumeric()
And then replace by DataFrame.loc with DataFrame.values for numpy array for avoid columns alignment:
df.loc[m, [0, 1]] = df.loc[m, [1, 0]].values
print(df)
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
Last if possible better is convert first row to columns names:
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
print(df)
A B
1 8 s
2 5 w
3 1 e
4 3 n
or possible removing header=None in read_csv.
sorted
with a key that test for int
df.loc[:] = [
sorted(t, key=lambda x: not isinstance(x, int))
for t in zip(*map(df.get, df))
]
df
0 1
0 A B
1 8 s
2 5 w
3 1 e
4 3 n
You can be explicit with the columns if you'd like
df[[0, 1]] = [
sorted(t, key=lambda x: not isinstance(x, int))
for t in zip(df[0], df[1])
]

Efficiently scale df columns to zero

I am trying to develop a process that automatically scales each Series in a pandas df to zero. For instance, if we use the df below:
import pandas as pd
d = ({
'A' : [0,1,2,3],
'B' : [6,7,8,9],
'C' : [10,11,12,13],
'D' : [-4,-5,-4,-3],
})
df = pd.DataFrame(data=d)
I'm manually adjusting each Column so it begins at zero. You'll notice the increments are either +1 or -1 but the starting integers vary.
df['B'] = df['B'] - 6
df['C'] = df['C'] - 10
df['D'] = df['D'] + 4
Output:
A B C D
0 0 0 0 0
1 1 1 1 -1
2 2 2 2 -2
3 3 3 3 -3
This isn't very efficient as I have to go through each series to determine the scaling factor. Is there a more efficient way to determine this?
You can subtract first row byiloc with sub:
df = df.sub(df.iloc[0])
#same as
#df = df - df.iloc[0]
print (df)
A B C D
0 0 0 0 0
1 1 1 1 -1
2 2 2 2 0
3 3 3 3 1
Detail:
print (df.iloc[0])
A 0
B 6
C 10
D -4
Name: 0, dtype: int64

How to select cells greater than a value in a multi-index Pandas dataframe?

Try 1:
df[ df > 1.0 ] : this returned all cells in NAN.
Try2:
df.loc[ df > 1.0 ] : this returned KeyError: 0
df[df['A']> 1.0] : this works - But I want to apply the filter condition to all columns.
If what you are trying to do is to select only rows where any one column meets the condition , you can use DataFrame.any() along with axis=1 (to do row-wise grouping) . Example -
In [3]: df
Out[3]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
In [6]: df[(df <= 2).any(axis=1)]
Out[6]:
A B C
0 1 2 3
2 3 1 4
Alternatively, if you are trying for filtering rows where all columns meet the condition , use .all() inplace of .any() . Example of all -
In [8]: df = pd.DataFrame([[1,2,3],[3,4,5],[3,1,4],[1,2,1]],columns=['A','B','C'])
In [9]: df
Out[9]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
3 1 2 1
In [11]: df[(df <= 2).all(axis=1)]
Out[11]:
A B C
3 1 2 1

How to apply a condition to a large number of columns in a pandas dataframe

I want to eliminate all rows that are equal to a certain values (or in a certain range) within a dataframe with a large number of columns. For example, if I had the following dataframe:
a b
0 1 0
1 2 1
2 3 2
3 0 3
and wanted to remove all rows containing 0, I could use:
a_df[(a_df['a'] != 0) & (a_df['b'] !=0)]
but this becomes a pain when you're dealing with a large number of columns. It could be done as:
for i in a_df.columns.values:
a_df = a_df[a_df[i] != 0]
but this seems inefficient. Is there a better way to do this?
Just do it for the whole df and call dropna:
In [45]:
df[df != 0].dropna()
Out[45]:
a b
1 2 1
2 3 2
The condition df != 0 produces a boolean mask:
In [47]:
df != 0
Out[47]:
a b
0 True False
1 True True
2 True True
3 False True
When this is combined with the df it produces NaN values where the condition is not met:
In [48]:
df[df != 0]
Out[48]:
a b
0 1 NaN
1 2 1
2 3 2
3 NaN 3
Calling dropna drops any row with a NaN value
Here's a variant of EdChum's approach. You could do df != 0 and then use all to make your selector:
>>> (df != 0).all(axis=1)
0 False
1 True
2 True
3 False
dtype: bool
and then use this to select:
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 1
2 3 2
The advantage of this is that it keeps NaNs if you want, e.g.
>>> df
a b
0 1 0
1 2 NaN
2 3 2
3 0 3
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 NaN
2 3 2
>>> df[(df != 0)].dropna()
a b
2 3 2
as you've mentioned in your question you may need to drop rows that have a value in a certain range you can do this by the following
suppose the range is 0 , 10 , 20
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
mask = frame.applymap(lambda x : False if x in [0 , 10 , 20] else True )
frame[mask.all(axis = 1)]

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?
This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0
How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1
I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

Categories