Removing rows below first line that meets threshold in pandas dataframe - python

I have a df that looks like:
import pandas as pd
import numpy as np
d = {'Hours':np.arange(12, 97, 12),
'Average':np.random.random(8),
'Count':[500, 250, 125, 75, 60, 25, 5, 15]}
df = pd.DataFrame(d)
This df has a decrease number of cases for each row. After the count drops below a certain threshold, I'd like to drop off the remainder, for example after a < 10 case threshold was reached.
Starting:
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
6 0.302894 5 84
7 0.418912 15 96
Finished (everything after row 6 removed):
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72

We can use the index generated from the boolean index and slice the df using iloc:
In [58]:
df.iloc[:df[df.Count < 10].index[0]]
Out[58]:
Average Count Hours
0 0.183016 500 12
1 0.046221 250 24
2 0.687945 125 36
3 0.387634 75 48
4 0.167491 60 60
5 0.660325 25 72
Just to break down what is happening here
In [54]:
# use a boolean mask to index into the df
df[df.Count < 10]
Out[54]:
Average Count Hours
6 0.244839 5 84
In [56]:
# we want the index and can subscript the first element using [0]
df[df.Count < 10].index
Out[56]:
Int64Index([6], dtype='int64')

Related

How do forward roll on a specific subset of data and while modifying the original dataset?

I'm trying to perform this operation on this dataset. I'm trying to calculate cummulative Sum of the specific subset of the dataset.I want the changes to reflect on real dataset.
. Table below illustrates how I want to calculate Offset.
#OFFSET
min = data.exit_block.min()
max = data.exit_block.max()
temp = 0
data['Offset']
for i in tqdm(range(min,min+10)):
offset = data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size'].sum()
data.loc[data["entry_block"] == i ,'Offset'] = data[data['entry_block']==i]['size'].cumsum() + offset
print(len(data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size']))
print(offset)
print(data[data['entry_block']==i]['size'].cumsum().head() )
print(data[data['entry_block']==i]['size'].head())
break
In the code above I'm creating a dataset B from original dataset and trying to perform of the cummulative sum operation on the origial dataset from the values driven from dataset B.
Index
Entry_block
Exit_block
Size
Offset
1
10
20
10
10
2
11
20
150
160
3
18
20
100
260
4
19
21
40
300
5
20
21
120
120
6
20
21
180
300
7
20
21
210
510
8
20
21
90
600
9
20
21
450
1050

Pandas randomly swap columns values per row

I want to train a binary classification ML model with some data that I have; something like this:
df
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
0 20 89 62 23 3 74
1 51 64 19 2 83 0
0 14 58 2 71 31 48
1 32 28 2 30 92 91
1 51 36 51 66 15 14
...
My target (y) depends on three characteristics from two groups, however I have an imbalance in my data, a count of values of my y target reveals that I have more zeros than ones in a ratio of about 2.68. I correct this by looping each row and randomly swapping values from group 1 to group 2 and viceversa, like this:
for index,row in df.iterrows():
choice = np.random.choice([0,1])
if row['y'] != choice:
df.loc[index, 'y'] = choice
for column in df.columns[1:]:
key = column.replace('g1', 'g2') if 'g1' in column else column.replace('g2', 'g1')
df.loc[index, column] = row[key]
Doing this reduce the ratio to no more than 1.3, so I was wondering if there is a more direct aproach using pandas methods.
¿Anyone have an idea how to accomplish this?
Whether or not swapping columns solves class unbalance aside, I would swap the whole data set, and randomly choose between the original and the swapped:
# Step 1: swap the columns
df1 = pd.concat((df.filter(regex='[^(_g1)]$'),
df.filter(regex='_g1$')),
axis=1)
# Step 2: rename the columns
df1.columns = df.columns
# random choice
np.random.seed(1)
is_original = np.random.choice([True,False], size=len(df))
# concat to make new dataset
pd.concat((df[is_original],df1[~is_original]))
Output:
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
2 0 14 58 2 71 31 48
3 1 32 28 2 30 92 91
0 0 23 3 74 20 89 62
1 1 2 83 0 51 64 19
4 1 66 15 14 51 36 51
Notice that row with indexes 1,4 have g1 swap with g2.

How to conditionally select column based on other columns under pandas DataFrame without using where function?

I'm working under python 2.5 (I'm restricted to that version due to external api) and would like to get same results as below code I wrote under python 2.7
import pandas as pd
df = pd.DataFrame({"lineId":[1,2,3,4], "idCaseMin": [10, 23, 40, 8], "min": [-110, -205, -80, -150], "idCaseMax": [5, 27, 15, 11], "max": [120, 150, 110, 90]})
df = df.set_index("lineId")
df["idMax"] = df["idCaseMax"].where(df["max"]>abs(df["min"]),df["idCaseMin"])
The DataFrame results in:
>>> df
idCaseMax max idCaseMin min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
The idMax column is defined based on the id which gets the greatest value, in absolute module, within max and min columns.
I can't use where function as it's not available under pandas 0.9.0 (latest version available for python 2.5) and numpy 1.7.1.
So, which options do I have to get same results for idMax column without using pandas where function?
IIUC you can use numpy.where():
In [120]: df['idMax'] = \
np.where(df["max"]<=abs(df["min"]),
df["idCaseMin"],
df["idCaseMax"])
In [121]: df
Out[121]:
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
I'll try and provide an optimised solution for 0.9. IIUC ix should work here.
m = df["max"] > df["min"].abs()
i = df.ix[m, 'idCaseMax']
j = df.ix[~m, 'idCaseMin']
df['idMax'] = i.append(j)
df
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
Your pandas should have this...
df['idMax']=(df["max"]>abs(df["min"]))* df["idCaseMax"]+(df["max"]<=abs(df["min"]))* df["idCaseMin"]
df
Out[1388]:
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
We can use the apply function as below code to attempt same results:
df["idMax"] = df.apply(lambda row: row["idCaseMax"] if row["max"]>abs(row["min"]) else row["idCaseMin"], axis = 1)

Numpy Finding Matching number with Array

Any help is greatly appreciated!! I have been trying to solve this for the last few days....
I have two arrays:
import pandas as pd
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
The result that I am trying to get is:
Array 1 and Array 2 Match by closes difference, based on left over number from Array2
20 26.12 3000 25.03
30 43.12 4000 42.12
40 46.81 6000 46
50 56.23 7000 110.05
60 111.07 8000 165.41
70 166.38 0 0
Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000.
Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.
UPDATE
so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.
import pandas as pd
import operator
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []
for oldPos in range(len(OldDataSetArray)):
PreviousNumber = abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])
while newPos <= len(NewDataSetArray) - 1:
CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])
#if it is the last row for the inner array, then match the next available
#in Array 1 to that last record
if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])
if PreviousNumber < CurrentNumber:
numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
newPos +=1
break
elif PreviousNumber > CurrentNumber:
PreviousNumber = CurrentNumber
newPos +=1
#sort by array one values
numberResults = sorted(numberResults, key=operator.itemgetter(0))
numberResultsDf = pd.DataFrame(numberResults)
You can use NumPy broadcasting to build a distance matrix:
a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])
numpy.abs(a[:, None] - b[None, :])
# array([[ 1.09, 16. , 19.62, 19.88, 83.93, 139.29],
# [ 18.09, 1. , 2.62, 2.88, 66.93, 122.29],
# [ 21.78, 4.69, 1.07, 0.81, 63.24, 118.6 ],
# [ 31.2 , 14.11, 10.49, 10.23, 53.82, 109.18],
# [ 86.04, 68.95, 65.33, 65.07, 1.02, 54.34],
# [ 141.35, 124.26, 120.64, 120.38, 56.33, 0.97]])
of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).
numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])
Compute all the differences, and use `np.argmin to lookup the closest.
a,b=np.random.rand(2,10)
all_differences=np.abs(np.subtract.outer(a,b))
ia=all_differences.argmin(axis=1)
for i in range(10):
print(i,a[i],ia[i], b[ia[i]])
0 0.231603891949 8 0.21177584152
1 0.27810475456 7 0.302647382888
2 0.582133214953 2 0.548920922033
3 0.892858042793 1 0.872622982632
4 0.67293347218 6 0.677971552011
5 0.985227546492 1 0.872622982632
6 0.82431697833 5 0.83765895237
7 0.426992114791 4 0.451084369838
8 0.181147161752 8 0.21177584152
9 0.631139744522 3 0.653554586691
EDIT
with dataframes and indexes:
va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))
dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})
all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))
ia=all_differences.argmin(axis=1)
dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)
Input :
In [337]: dfa
Out[337]:
id odo
0 72 0.426457
1 12 0.315997
2 96 0.623164
3 9 0.821498
4 72 0.071237
5 5 0.730634
6 45 0.963051
7 14 0.603289
8 5 0.401737
9 63 0.976644
In [338]: dfb
Out[338]:
id odo
0 95 0.333215
1 7 0.023957
2 61 0.021944
3 57 0.660894
4 22 0.666716
5 6 0.234920
6 83 0.642148
7 64 0.509589
8 98 0.660273
9 19 0.658639
Output :
In [339]: dfc
Out[339]:
id_x odo_x id_y odo_y
0 72 0.426457 64 0.509589
1 12 0.315997 95 0.333215
2 96 0.623164 83 0.642148
3 9 0.821498 22 0.666716
4 72 0.071237 7 0.023957
5 5 0.730634 22 0.666716
6 45 0.963051 22 0.666716
7 14 0.603289 83 0.642148
8 5 0.401737 95 0.333215
9 63 0.976644 22 0.666716

filter pandas dataframe based in another column

this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)

Categories