import pandas as pd
import numpy as np
data_A=pd.read_csv('D:/data_A.csv')
data_A has column named power.
powercolumn only has 0 and 1 and dtype is int64.
I want to make sure that there are only 0 and 1 in column power.
So, if there are other numbers except 0 and 1 in column power, I want to make the values 0. How can I do?
You can use DataFrame.loc to conditionally access a group of rows and columns.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"power": [1, 0, 1, 2, 5, 6, 0, 1]})
>>> df
power
0 1
1 0
2 1
3 2
4 5
5 6
6 0
7 1
>>> df.loc[~(df["power"].isin([1, 0])), "power"] = 0
>>> df
power
0 1
1 0
2 1
3 0
4 0
5 0
6 0
7 1
The condition ~(df["power"].isin([1, 0])) returns a Boolean Series which can be use to select the rows that have 'power' not equal to 1 or 0
You could also use list comprehension if your dataframe is small.
data_A.power = [x if x == 1 else 0 for x in data_A.power]
Or numpy for a longer column (this solution assumes you don't have negative values)
import numpy as np
power_np = np.array(data_A.power)
power_np[power_np > 1] = 0
data_A.power = power_np
Try this:
import pandas as pd
# example df
p = [1, 0, 3, 4, 's']
data_A = pd.DataFrame(p, columns=['power'])
def convert_row(row):
if row == 1 or row == 0:
return row
else:
return 0
data_A['power'] = data_A['power'].apply(convert_row)
print(data_A)
Related
Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11
I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?
This is pretty simple with pandas groupby:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.
You can use collections.defaultdict, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.
I have a DataFrame with n rows and an ndarray with n values (-1 for outliers and 1 for inlier). Is there a pythonic way to remove DataFrame rows that match the indices of the elements of the nparray marked as -1?
You can just do: new_df = old_df[arr == 1].
Example:
df = pd.DataFrame(np.random.randn(5,5))
arr = np.random.choice([1,-1], 5)
>>> df
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
1 -0.162404 -1.272317 0.342051 -0.787938 0.464699
2 -0.965481 0.727143 -0.887149 -0.430592 -2.074865
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
4 0.228538 0.799445 -0.217787 0.398572 -1.255639
>>> arr
array([ 1, -1, -1, 1, -1])
>>> df[arr == 1]
0 1 2 3 4
0 -0.238418 0.291475 0.139162 -0.030003 -0.515817
3 0.699129 -0.242738 1.754805 -0.120637 -1.536973
I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A and B, and every group's value is determined by A.
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.
The following code find index where df['A'] == 1
import pandas as pd
import numpy as np
import random
index = range(10)
random.shuffle(index)
df = pd.DataFrame(np.zeros((10,1)).astype(int), columns = ['A'], index = index)
df.A.iloc[3:6] = 1
df.A.iloc[6:] = 2
print df
print df.loc[df['A'] == 1].index.tolist()
It returns pandas index correctly. How do I get the integer index ([3,4,5]) instead using pandas API?
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
[3, 7, 1]
what about?
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
or (depending on your goals):
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Demo:
In [11]: df
Out[11]:
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Here is one way:
df.reset_index().index[df.A == 1].tolist()
This re-indexes the data frame with [0, 1, 2, ...], then extracts the integer index values based on the boolean mask df.A == 1.
Edit Credits to #Max for the index[df.A == 1] idea.
No need for numpy, you're right. Just pure python with a listcomp:
Just find the indexes where the values are 1
print([i for i,x in enumerate(df['A'].values) if x == 1])