Find integer row-index from pandas index - python

The following code find index where df['A'] == 1
import pandas as pd
import numpy as np
import random
index = range(10)
random.shuffle(index)
df = pd.DataFrame(np.zeros((10,1)).astype(int), columns = ['A'], index = index)
df.A.iloc[3:6] = 1
df.A.iloc[6:] = 2
print df
print df.loc[df['A'] == 1].index.tolist()
It returns pandas index correctly. How do I get the integer index ([3,4,5]) instead using pandas API?
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
[3, 7, 1]

what about?
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
or (depending on your goals):
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Demo:
In [11]: df
Out[11]:
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')

Here is one way:
df.reset_index().index[df.A == 1].tolist()
This re-indexes the data frame with [0, 1, 2, ...], then extracts the integer index values based on the boolean mask df.A == 1.
Edit Credits to #Max for the index[df.A == 1] idea.

No need for numpy, you're right. Just pure python with a listcomp:
Just find the indexes where the values are 1
print([i for i,x in enumerate(df['A'].values) if x == 1])

Related

How to iterate over previous rows in a dataframe

I have three columns: id (non-unique id), X (categories) and Y (categories). (I don't have a dataset to share yet. I'll try to replicate what I have using a smaller dataset and edit as soon as possible)
I ran a for loop on a very small subset and based on those results it might take over 4 hours to run this code. I'm looking for a faster way to do this task using pandas (maybe using iterrows, like iterating over previous rows within apply)
For each row I check
whether the current X matches any of previous Xs (check_X = X[:row] == X[row])
whether the current Y matches any of previous Ys (check_Y = Y[:row] == Y[row])
whether the current id does not match any of previous ids (check_id = id[:row] != id[row])
if sum(check_X & check_Y & check_id)>0: then append 1 to the array
else: append 0
Your are probably looking for duplicated:
df = pd.DataFrame({'id': [0, 0, 0, 1, 0],
'X': [1, 1, 2, 1, 1],
'Y': [2, 2, 2, 2, 2]})
df['dup'] = ~df[df.duplicated(['X', 'Y'])].duplicated('id', keep=False).loc[lambda x: ~x]
df['dup'] = df['dup'].fillna(False).astype(int)
print(df)
# Output
id X Y dup
0 0 1 2 0
1 0 1 2 0
2 0 2 2 0
3 1 1 2 1
4 0 1 2 0
Update
X and Y should be checked separately:
df = pd.DataFrame({'id': [0, 1, 1, 2, 2, 3, 4],
'X': [0, 1, 1, 1, 1, 1, 1],
'Y': [0, 2, 2, 2, 2, 2, 2]})
df['dup'] = np.where(df['X'].duplicated() & df['Y'].duplicated() & ~df['id'].duplicated(), 1, 0)
print(df)
# Output
id X Y dup
0 0 0 0 0
1 1 1 2 0
2 1 1 2 0
3 2 1 2 1
4 2 1 2 0
5 3 1 2 1
6 4 1 2 1
EDIT answer from #Corralien using duplicates() will likely be much faster and the best answer for this specific problem. However, apply is more flexible if you have different things to check.
You could do it with iterrows() or apply(). As far as I know apply() is faster:
check_id, check_x, check_y = set(), set(), set()
def apply_func(row):
global check_id, check_x, check_y
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
row['duplicate'] = 1
else:
row['duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])
return row
df.apply(apply_func, axis=1)
With iterrows():
check_id, check_x, check_y = set(), set(), set()
for i, row in df.iterrows():
if row["id"] not in check_id and row['x'] in check_x and row['y'] in check_y:
df.loc[i, 'duplicate'] = 1
else:
df.loc[i, 'duplicate'] = 0
check_id.add(row['id'])
check_x.add(row['x'])
check_y.add(row['y'])

Changing the values in a column using python

import pandas as pd
import numpy as np
data_A=pd.read_csv('D:/data_A.csv')
data_A has column named power.
powercolumn only has 0 and 1 and dtype is int64.
I want to make sure that there are only 0 and 1 in column power.
So, if there are other numbers except 0 and 1 in column power, I want to make the values 0. How can I do?
You can use DataFrame.loc to conditionally access a group of rows and columns.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"power": [1, 0, 1, 2, 5, 6, 0, 1]})
>>> df
power
0 1
1 0
2 1
3 2
4 5
5 6
6 0
7 1
>>> df.loc[~(df["power"].isin([1, 0])), "power"] = 0
>>> df
power
0 1
1 0
2 1
3 0
4 0
5 0
6 0
7 1
The condition ~(df["power"].isin([1, 0])) returns a Boolean Series which can be use to select the rows that have 'power' not equal to 1 or 0
You could also use list comprehension if your dataframe is small.
data_A.power = [x if x == 1 else 0 for x in data_A.power]
Or numpy for a longer column (this solution assumes you don't have negative values)
import numpy as np
power_np = np.array(data_A.power)
power_np[power_np > 1] = 0
data_A.power = power_np
Try this:
import pandas as pd
# example df
p = [1, 0, 3, 4, 's']
data_A = pd.DataFrame(p, columns=['power'])
def convert_row(row):
if row == 1 or row == 0:
return row
else:
return 0
data_A['power'] = data_A['power'].apply(convert_row)
print(data_A)

Multiply all elements of a column in a pandas dataframe

I have a pandas dataframe:
Idx
A
B
C
1
2
5
1
2
1
2
2
3
3
1
1
4
2
3
0
I want to calculate the product of all elements in all columns. e.g.
- P_A = 2*1*3*2 = 12
- P_B = 5*2*1*3 = 30
- P_C = 1*2*1*0 = 0
Ideally, the result would be in a list format [P_A, P_B, P_C].
What is the most efficient way to compute this?
Try:
>>> df[['A', 'B', 'C']].prod().tolist()
[12, 30, 0]
>>>
Or:
>>> df.set_index('Idx').prod().tolist()
[12, 30, 0]
>>>
Or also:
>>> df.filter(regex='[^Idx]').prod().tolist()
[12, 30, 0]
>>>
Or with iloc:
>>> df.iloc[:, 1:].prod().tolist()
[12, 30, 0]
>>>
Or with drop:
>>> df[df.columns.drop('Idx')].prod().tolist()
[12, 30, 0]
>>>
You can apply numpy.product:
import numpy as np
np.product(df.set_index('Idx'))
output:
A 12
B 30
C 0
as list:
products = np.product(df.set_index('Idx')).to_list()

Numpy / Pandas slicing based on intervals

Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11

How to remove rows from DataFrame

I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?
You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973

Categories