I am trying to create another label column which is based on multiple conditions in my existing data
df
ind group people value value_50 val_minmax
1 1 5 100 1 10
1 2 2 90 1 na
2 1 10 80 1 80
2 2 20 40 0 na
3 1 7 10 0 10
3 2 23 30 0 na
import pandas as pd
import numpy as np
df = pd.read_clipboard()
Then trying to put label on rows as per below conditions
df['label'] = np.where(np.logical_and(df.group == 2, df.value_50 == 1, df.value > 50), 1, 0)
but it is giving me an error
TypeError: return arrays must be of ArrayType
How to perform it in python?
Use & between masks:
df['label'] = np.where((df.group == 2) & (df.value_50 == 1) & (df.value > 50), 1, 0)
Alternative:
df['label'] = ((df.group == 2) & (df.value_50 == 1) & (df.value > 50)).astype(int)
Your solution should working if use reduce with list of boolean masks:
mask = np.logical_and.reduce([df.group == 2, df.value_50 == 1, df.value > 50])
df['label'] = np.where(mask, 1, 0)
#alternative
#df['label'] = mask.astype(int)
Related
Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0
import pandas as pd
import numpy as np
data_A=pd.read_csv('D:/data_A.csv')
data_A has column named power.
powercolumn only has 0 and 1 and dtype is int64.
I want to make sure that there are only 0 and 1 in column power.
So, if there are other numbers except 0 and 1 in column power, I want to make the values 0. How can I do?
You can use DataFrame.loc to conditionally access a group of rows and columns.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"power": [1, 0, 1, 2, 5, 6, 0, 1]})
>>> df
power
0 1
1 0
2 1
3 2
4 5
5 6
6 0
7 1
>>> df.loc[~(df["power"].isin([1, 0])), "power"] = 0
>>> df
power
0 1
1 0
2 1
3 0
4 0
5 0
6 0
7 1
The condition ~(df["power"].isin([1, 0])) returns a Boolean Series which can be use to select the rows that have 'power' not equal to 1 or 0
You could also use list comprehension if your dataframe is small.
data_A.power = [x if x == 1 else 0 for x in data_A.power]
Or numpy for a longer column (this solution assumes you don't have negative values)
import numpy as np
power_np = np.array(data_A.power)
power_np[power_np > 1] = 0
data_A.power = power_np
Try this:
import pandas as pd
# example df
p = [1, 0, 3, 4, 's']
data_A = pd.DataFrame(p, columns=['power'])
def convert_row(row):
if row == 1 or row == 0:
return row
else:
return 0
data_A['power'] = data_A['power'].apply(convert_row)
print(data_A)
I am attempting to change the values of two columns in my dataset from specific numeric values (2, 10, 25 etc.) to single values (1, 2, 3 or 4) based on the percentile of the specific value within the dataset.
Using the pandas quantile() function I have got the ranges I wish to replace between, but I haven't figured out a working method to do so.
age1 = datasetNB.Age.quantile(0.25)
age2 = datasetNB.Age.quantile(0.5)
age3 = datasetNB.Age.quantile(0.75)
fare1 = datasetNB.Fare.quantile(0.25)
fare2 = datasetNB.Fare.quantile(0.5)
fare3 = datasetNB.Fare.quantile(0.75)
My current solution attempt for this problem is as follows:
for elem in datasetNB['Age']:
if elem <= age1:
datasetNB[elem].replace(to_replace = elem, value = 1)
print("set to 1")
elif (elem > age1) & (elem <= age2):
datasetNB[elem].replace(to_replace = elem, value = 2)
print("set to 2")
elif (elem > age2) & (elem <= age3):
datasetNB[elem].replace(to_replace = elem, value = 3)
print("set to 3")
elif elem > age3:
datasetNB[elem].replace(to_replace = elem, value = 4)
print("set to 4")
else:
pass
for elem in datasetNB['Fare']:
if elem <= fare1:
datasetNB[elem] = 1
elif (elem > fare1) & (elem <= fare2):
datasetNB[elem] = 2
elif (elem > fare2) & (elem <= fare3):
datasetNB[elem] = 3
elif elem > fare3:
datasetNB[elem] = 4
else:
pass
What should I do to get this working?
pandas already has one function to do that, pandas.qcut.
You can simply do
q_list = [0, 0.25, 0.5, 0.75, 1]
labels = range(1, 5)
df['Age'] = pd.qcut(df['Age'], q_list, labels=labels)
df['Fare'] = pd.qcut(df['Fare'], q_list, labels=labels)
Input
import numpy as np
import pandas as pd
# Generate fake data for the sake of example
df = pd.DataFrame({
'Age': np.random.randint(10, size=6),
'Fare': np.random.randint(10, size=6)
})
>>> df
Age Fare
0 1 6
1 8 2
2 0 0
3 1 9
4 9 6
5 2 2
Output
DataFrame after running the above code
>>> df
Age Fare
0 1 3
1 4 1
2 1 1
3 1 4
4 4 3
5 3 1
Note that in your specific case, since you want quartiles, you can just assign q_list = 4.
I am trying to create a new pandas column which is normalised data from another column.
I created three separate series and then merged them into one.
While this approache has provided me with the desired result, I was wondering whether there's a better way to do this.
x = df["Data Col"].copy()
#if the value is between 70 and 30 find the difference of the previous value.
#Positive difference = 1 & Negative difference = -1
btw = pd.Series(np.where(x.between(30, 70, inclusive=False), x.diff(), 0))
btw[btw < 0] = -1
btw[btw > 0] = 1
#All values above 70 are -1
up = pd.Series(np.where(x.gt(70), -1, 0))
#All values below 30 are 1
dw = pd.Series(np.where(x.lt(30), 1, 0))
combined = up + dw + btw
df["Normalised Col"] = np.array(combined)
I tried to use functions and loops directly on the Pandas Data Column but I couldn't figure out how to get the .diff()
Use numpy.select with chain masks by & for bitwise AND and | for bitwise OR:
np.random.seed(2019)
df = pd.DataFrame({'Data Col':np.random.randint(10, 100, size=10)})
#print (df)
d = df["Data Col"].diff()
m1 = df["Data Col"].between(30, 70, inclusive=False)
m2 = d < 0
m3 = d > 0
m4 = df["Data Col"].gt(70)
m5 = df["Data Col"].lt(30)
df["Normalised Col1"] = np.select([(m1 & m2) | m4, (m1 & m3) | m5], [-1, 1], default=0)
print (df)
Data Col Normalised Col1
0 82 -1
1 41 -1
2 47 1
3 98 -1
4 72 -1
5 34 -1
6 39 1
7 25 1
8 22 1
9 26 1
I am trying to iterate over a pandas dataframe and update the value if condition is met but i am getting an error.
for line, row in enumerate(df.itertuples(), 1):
if row.Qty:
if row.Qty == 1 and row.Price == 10:
row.Buy = 1
AttributeError: can't set attribute
First iterating in pandas is possible, but very slow, so another vectorized solution are used.
I think you can use iterrows if you need iterating:
for idx, row in df.iterrows():
if df.loc[idx,'Qty'] == 1 and df.loc[idx,'Price'] == 10:
df.loc[idx,'Buy'] = 1
But better is to use vectorized solutions – set value by boolean mask with loc:
mask = (df['Qty'] == 1) & (df['Price'] == 10)
df.loc[mask, 'Buy'] = 1
Or solution with mask:
df['Buy'] = df['Buy'].mask(mask, 1)
Or if you need if...else use numpy.where:
df['Buy'] = np.where(mask, 1, 0)
Samples.
Set values by conditions:
df = pd.DataFrame({'Buy': [100, 200, 50],
'Qty': [5, 1, 1],
'Name': ['apple', 'pear', 'banana'],
'Price': [1, 10, 10]})
print (df)
Buy Name Price Qty
0 100 apple 1 5
1 200 pear 10 1
2 50 banana 10 1
mask = (df['Qty'] == 1) & (df['Price'] == 10)
df['Buy'] = df['Buy'].mask(mask, 1)
print (df)
Buy Name Price Qty
0 100 apple 1 5
1 1 pear 10 1
2 1 banana 10 1
df['Buy'] = np.where(mask, 1, 0)
print (df)
Buy Name Price Qty
0 0 apple 1 5
1 1 pear 10 1
2 1 banana 10 1
Ok, if you intend to set values in df then you need track the index values.
option 1
using itertuples
# keep in mind `row` is a named tuple and cannot be edited
for line, row in enumerate(df.itertuples(), 1): # you don't need enumerate here, but doesn't hurt.
if row.Qty:
if row.Qty == 1 and row.Price == 10:
df.set_value(row.Index, 'Buy', 1)
option 2
using iterrows
# keep in mind that `row` is a `pd.Series` and can be edited...
# ... but it is just a copy and won't reflect in `df`
for idx, row in df.iterrows():
if row.Qty:
if row.Qty == 1 and row.Price == 10:
df.set_value(idx, 'Buy', 1)
option 3
using straight up loop with get_value
for idx in df.index:
q = df.get_value(idx, 'Qty')
if q:
p = df.get_value(idx, 'Price')
if q == 1 and p == 10:
df.set_value(idx, 'Buy', 1)
pandas.DataFrame.set_value method is deprecated as of 0.21.0 pd.DataFrame.set_value
Use pandas.Dataframe.at
for index, row in df.iterrows():
if row.Qty and row.Qty == 1 and row.Price == 10:
df.at[index,'Buy'] = 1