How to remove rows from DataFrame - python

I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?

You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973

Related

Looking for an iterative loop in python which can add up all column values if certain condition meets

I have a dataframe with columns m, n:
m=[0, 0, 1, 0, 0, 0, 4, 0, 0]
n=[6, 1, 2, 1, 4, 3, 1, 3, 5, 1]
I am looking for an iterative loop that adds up values of column n if the value in column m is non-zero. For example at 3rd place of column m the value is 1 (non-zero) so it should add in the column n from index 0 to 2 i.e. 6+1+2=9. Similarly, at m[6]=4 (non-zero) this implies 1+4+3+1=9 and so on.
Let's say you have a dataframe and you want to sum the elements in each column based on the position of non-zero values in column "m". The following code gives you the output as a dataframe. See the comment in the code if you are just looking for summing the values in column "n":
import pandas as pd
from random import randint
m = [0, 1, 0, 0, 1, 0, 0, 0, 2]
n = [1, 1, 3, 4, 1, 1, 2, 1, 3]
r = [randint(1, 3) for _ in m]
names = ['lev', 'yan', 'coke' , 'coke', 'yan', 'lev', 'lev', 'yan', 'lev']
df = pd.DataFrame({'m': m, 'n': n, 'r': r, 'names': names})
print(f"Input dataframe:\n{df}")
# if you want to iterate over all columns
iter_cols = df.columns.tolist()
iter_cols.remove('m')
# To iterate over an specific column (e.g. 'n') you use iter_cols = ['n']
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.m):
if val != 0:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
print(f"Output dataframe:\n{sum_df}")
Output:
Input dataframe:
m n r names
0 0 1 2 lev
1 1 1 3 yan
2 0 3 1 coke
3 0 4 2 coke
4 1 1 2 yan
5 0 1 3 lev
6 0 2 3 lev
7 0 1 3 yan
8 2 3 2 lev
Output dataframe:
n names r
0 2.0 levyan 5.0
1 8.0 cokecokeyan 5.0
2 7.0 levlevyanlev 11.0
And if you want to iterate over distinct values in names column and sum the values in 'n' column accordingly:
iter_cols = ['n']
distinct_names = set(df.names)
print(distinct_names)
out_dct = {}
for name in distinct_names:
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.names):
if val == name:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
out_dct[name] = sum_df

Changing the values in a column using python

import pandas as pd
import numpy as np
data_A=pd.read_csv('D:/data_A.csv')
data_A has column named power.
powercolumn only has 0 and 1 and dtype is int64.
I want to make sure that there are only 0 and 1 in column power.
So, if there are other numbers except 0 and 1 in column power, I want to make the values 0. How can I do?
You can use DataFrame.loc to conditionally access a group of rows and columns.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"power": [1, 0, 1, 2, 5, 6, 0, 1]})
>>> df
power
0 1
1 0
2 1
3 2
4 5
5 6
6 0
7 1
>>> df.loc[~(df["power"].isin([1, 0])), "power"] = 0
>>> df
power
0 1
1 0
2 1
3 0
4 0
5 0
6 0
7 1
The condition ~(df["power"].isin([1, 0])) returns a Boolean Series which can be use to select the rows that have 'power' not equal to 1 or 0
You could also use list comprehension if your dataframe is small.
data_A.power = [x if x == 1 else 0 for x in data_A.power]
Or numpy for a longer column (this solution assumes you don't have negative values)
import numpy as np
power_np = np.array(data_A.power)
power_np[power_np > 1] = 0
data_A.power = power_np
Try this:
import pandas as pd
# example df
p = [1, 0, 3, 4, 's']
data_A = pd.DataFrame(p, columns=['power'])
def convert_row(row):
if row == 1 or row == 0:
return row
else:
return 0
data_A['power'] = data_A['power'].apply(convert_row)
print(data_A)

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

Find integer row-index from pandas index

The following code find index where df['A'] == 1
import pandas as pd
import numpy as np
import random
index = range(10)
random.shuffle(index)
df = pd.DataFrame(np.zeros((10,1)).astype(int), columns = ['A'], index = index)
df.A.iloc[3:6] = 1
df.A.iloc[6:] = 2
print df
print df.loc[df['A'] == 1].index.tolist()
It returns pandas index correctly. How do I get the integer index ([3,4,5]) instead using pandas API?
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
[3, 7, 1]
what about?
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
or (depending on your goals):
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Demo:
In [11]: df
Out[11]:
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Here is one way:
df.reset_index().index[df.A == 1].tolist()
This re-indexes the data frame with [0, 1, 2, ...], then extracts the integer index values based on the boolean mask df.A == 1.
Edit Credits to #Max for the index[df.A == 1] idea.
No need for numpy, you're right. Just pure python with a listcomp:
Just find the indexes where the values are 1
print([i for i,x in enumerate(df['A'].values) if x == 1])

How to return all opposite pairs in a Pandas DataFrame?

For the dataframe below, how to return all opposite pairs?
import pandas as pd
df1 = pd.DataFrame([1,2,-2,2,-1,-1,1,1], columns=['a'])
a
0 1
1 2
2 -2
3 2
4 -1
5 -1
6 1
7 1
The output should be as below:
(1) sum of all rows is 0
(2) as there are 3 "1" and 2 "-1" in
original data, output includes 2 "1" and 2"-1".
a
0 1
1 2
2 -2
4 -1
5 -1
6 1
Thank you very much.
Well, I thought this would take fewer lines (and probably can) but this does work. First just create a couple of new columns to simplify the later syntax:
>>> df1['abs_a'] = np.abs( df1['a'] )
>>> df1['ones'] = 1
Then the main thing you need is to do some counting. For example, are there fewer 1s or fewer -1s?
>>> df2 = df1.groupby(['abs_a','a']).count()
ones
abs_a a
1 -1 2
1 3
2 -2 1
2 2
>>> df3 = df2.groupby(level=0).min()
ones
abs_a
1 2
2 1
That's basically the answer right there, but I'll put it closer to the form you asked for:
>>> lst = [ [i]*j for i, j in zip( df3.index.tolist(), df3['ones'].tolist() ) ]
>>> arr = np.array( [item for sublist in lst for item in sublist] )
>>> np.hstack( [arr,-1*arr] )
array([ 1, 1, 2, -1, -1, -2], dtype=int64)
Or if you want to put it back into a dataframe:
>>> pd.DataFrame( np.hstack( [arr,-1*arr] ) )
0
0 1
1 1
2 2
3 -1
4 -1
5 -2

Categories