How do I normalise a Pandas data column with multiple conditionals? - python

I am trying to create a new pandas column which is normalised data from another column.
I created three separate series and then merged them into one.
While this approache has provided me with the desired result, I was wondering whether there's a better way to do this.
x = df["Data Col"].copy()
#if the value is between 70 and 30 find the difference of the previous value.
#Positive difference = 1 & Negative difference = -1
btw = pd.Series(np.where(x.between(30, 70, inclusive=False), x.diff(), 0))
btw[btw < 0] = -1
btw[btw > 0] = 1
#All values above 70 are -1
up = pd.Series(np.where(x.gt(70), -1, 0))
#All values below 30 are 1
dw = pd.Series(np.where(x.lt(30), 1, 0))
combined = up + dw + btw
df["Normalised Col"] = np.array(combined)
I tried to use functions and loops directly on the Pandas Data Column but I couldn't figure out how to get the .diff()

Use numpy.select with chain masks by & for bitwise AND and | for bitwise OR:
np.random.seed(2019)
df = pd.DataFrame({'Data Col':np.random.randint(10, 100, size=10)})
#print (df)
d = df["Data Col"].diff()
m1 = df["Data Col"].between(30, 70, inclusive=False)
m2 = d < 0
m3 = d > 0
m4 = df["Data Col"].gt(70)
m5 = df["Data Col"].lt(30)
df["Normalised Col1"] = np.select([(m1 & m2) | m4, (m1 & m3) | m5], [-1, 1], default=0)
print (df)
Data Col Normalised Col1
0 82 -1
1 41 -1
2 47 1
3 98 -1
4 72 -1
5 34 -1
6 39 1
7 25 1
8 22 1
9 26 1

Related

Creating a new column in dataframe with range of values

I need to divide range of my passengers age onto 5 parts and create a new column where will be values from 0 to 4 respectively for every part(For 1 range value 0 for 2 range value 1 etc)
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset['Age_bin'] = a1.apply(0 for a in range(a))
Here what i tried to do but it does not work. I also pin dataset picture
DATASET
I expect to get result where i'll see a new column named 'Age_bin' and values 0 in it for Age from 0 to 16 inclusively, values 1 for age from 17 to 33 and other 3 rangers
Binning with pandas cut is appropriate here, try:
titset['Age_bin'] = titset['Age'].cut(bins=[0,17,34,51,68,81], include_lowest=True, labels=False)
First of all, the variable a is a range object, which you are calling range(a) again, which is equivalent to range(range(0, 17)), hence the error.
Secondly, even if you fixed the above problem, you will run into an error again since .apply takes in a callable (i.e., a function be it defined with def or a lambda function).
If your goal is to assign a new column that represents the age group that each row is in, you can just filter with your result and assign them:
titset = pd.DataFrame({'Age': range(1, 81)})
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset.loc[a1.index, 'Age_bin'] = 0
titset.loc[a2.index, 'Age_bin'] = 1
titset.loc[a3.index, 'Age_bin'] = 2
titset.loc[a4.index, 'Age_bin'] = 3
titset.loc[a5.index, 'Age_bin'] = 4
Or better yet, use a for loop:
age_groups = [0, 17, 34, 51, 68, 81]
for i in range(len(age_groups) - 1):
subset = titset.query(f'Age >= {age_groups[i]} & Age < {age_groups[i+1]}')
titset.loc[subset.index, 'Age_bin'] = i

Optimise processing of for loop?

I have this basic dataframe:
dur type src dst
0 0 new 543 1
1 0 new 21 1
2 1 old 4828 2
3 0 new 321 1
...
(total 450000 rows)
My aim is to replace the values in src with either 0, 1 or 2 depending on the values. I created a for loop/if else below:
for i in df['src']:
if i <= 1000:
df['src'].replace(to_replace = [i], value = [1], inplace = True)
elif i <= 2500:
df['src'].replace(to_replace = [i], value = [2], inplace = True)
elif i <= 5000:
df['src'].replace(to_replace = [i], value = [3], inplace = True)
else:
print('End!')
The above works as intended, but it is awfully slow trying to replace the entire dataframe with 450000 rows (it is taking more than 30 minutes to do this!).
Is there a more Pythonic way to speed up this algorithm?
Try numpy.select, for multiple conditions:
cond1 = df.src.le(1000)
cond2 = df.src.le(2500)
cond3 = df.src.le(5000)
condlist = [cond1, cond2, cond3]
choicelist = [1, 2, 3]
df.assign(src=np.select(condlist, choicelist))
dur type src dst
0 0 new 1 1
1 0 new 1 1
2 1 old 3 2
3 0 new 1 1
I have not tested this, but I think this should work
pd.cut(df.src, [0, 1000, 2500, 5000], labels=[1,2,3] )

Pandas set value if most columns are equal in a dataframe

starting by another my question I've done yesterday Pandas set value if all columns are equal in a dataframe
Starting by #anky_91 solution I'm working on something similar.
Instead of put 1 or -1 if all columns are equals I want something more flexible.
In fact I want 1 if (for example) the 70% percentage of the columns are 1, -1 for the same but inverse condition and 0 else.
So this is what I've wrote:
# Instead of using .all I use .sum to count the occurence of 1 and 0 for each row
m1 = local_df.eq(1).sum(axis=1)
m2 = local_df.eq(0).sum(axis=1)
# Debug print, it work
print(m1)
print(m2)
But I don't know how to change this part:
local_df['enseamble'] = np.select([m1, m2], [1, -1], 0)
m = local_df.drop(local_df.columns.difference(['enseamble']), axis=1)
I write in pseudo code what I want:
tot = m1 + m2
if m1 > m2
if(m1 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = 1
else if m2 > m1
if(m2 * 100) / tot > 0.7 # simple percentage calculus
df['enseamble'] = -1
else:
df['enseamble'] = 0
Thanks
Edit 1
This is an example of expected output:
NET_0 NET_1 NET_2 NET_3 NET_4 NET_5 NET_6
date
2009-08-02 0 1 1 1 0 1
2009-08-03 1 0 0 0 1 0
2009-08-04 1 1 1 0 0 0
date enseamble
2009-08-02 1 # because 1 is more than 70%
2009-08-03 -1 # because 0 is more than 70%
2009-08-04 0 # because 0 and 1 are 50-50
You could obtain the specified output from the following conditions:
thr = 0.7
c1 = (df.eq(1).sum(1)/df.shape[1]).gt(thr)
c2 = (df.eq(0).sum(1)/df.shape[1]).gt(thr)
c2.astype(int).mul(-1).add(c1)
Output
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
dtype: int64
Or using np.select:
pd.DataFrame(np.select([c1,c2], [1,-1], 0), index=df.index, columns=['result'])
result
2009-08-02 0
2009-08-03 0
2009-08-04 0
2009-08-05 0
2009-08-06 -1
2009-08-07 1
Try with (m1 , m2 and tot are same as what you have):
cond1=(m1>m2)&((m1 * 100/tot).gt(0.7))
cond2=(m2>m1)&((m2 * 100/tot).gt(0.7))
df['enseamble'] =np.select([cond1,cond2],[1,-1],0)
m =df.drop(df.columns.difference(['enseamble']), axis=1)
print(m)
enseamble
date
2009-08-02 1
2009-08-03 -1
2009-08-04 0

Numpy logical conditions for labeling the data

I am trying to create another label column which is based on multiple conditions in my existing data
df
ind group people value value_50 val_minmax
1 1 5 100 1 10
1 2 2 90 1 na
2 1 10 80 1 80
2 2 20 40 0 na
3 1 7 10 0 10
3 2 23 30 0 na
import pandas as pd
import numpy as np
df = pd.read_clipboard()
Then trying to put label on rows as per below conditions
df['label'] = np.where(np.logical_and(df.group == 2, df.value_50 == 1, df.value > 50), 1, 0)
but it is giving me an error
TypeError: return arrays must be of ArrayType
How to perform it in python?
Use & between masks:
df['label'] = np.where((df.group == 2) & (df.value_50 == 1) & (df.value > 50), 1, 0)
Alternative:
df['label'] = ((df.group == 2) & (df.value_50 == 1) & (df.value > 50)).astype(int)
Your solution should working if use reduce with list of boolean masks:
mask = np.logical_and.reduce([df.group == 2, df.value_50 == 1, df.value > 50])
df['label'] = np.where(mask, 1, 0)
#alternative
#df['label'] = mask.astype(int)

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

Categories