Dropping value in a dataframe in a loop - python

I have a dataframe with sorted values:
import numpy as np
import pandas as pd
sub_run = pd.DataFrame({'Runoff':[45,10,5,26,30,23,35], 'ind':[3, 10, 25,43,53,60,93]})
I would like to start from the highest value in Runoff (45), drop all values with which the difference in "ind" is less than 30 (10, 5), reupdate the DataFrame , then go to the second highest value (35): drop the indices with which the difference in "ind" is < 30 , then the the third highest value (30) and drop 26 and 23...
I wrote the following code :
pre_ind = []
for (idx1, row1) in sub_run.iterrows():
var = row1.ind
pre_ind.append(np.array(var))
for (idx2,row2) in sub_run.iterrows():
if (row2.ind != var) and (row2.ind not in pre_ind):
test = abs(row2.ind - var)
print("test" , test)
if test <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == row2.ind].index)
I expect to find as an output the values [45,35,30]. However I only find the first one.
Many thanks

Try this:
list_pre_max = []
while True:
try:
max_val = sub_run.Runoff.sort_values(ascending=False).iloc[len(list_pre_max)]
except:
break
max_ind = sub_run.loc[sub_run['Runoff'] == max_val, 'ind'].item()
list_pre_max.append(max_val)
dropped_indices = sub_run.loc[(abs(sub_run['ind']-max_ind) <= 30) & (sub_run['ind'] != max_ind) & (~sub_run.Runoff.isin(list_pre_max))].index
sub_run.drop(index=dropped_indices, inplace=True)
Output:
>>>sub_run
Runoff ind
0 45 3
4 30 53
6 35 93

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
In your case, the modification of sub_run has no effect immediately on the iteration.
Therefore, in the outer loop, after iteration on 45, 3,
the next row iterated is 35, 93, followed by 30, 53, 26, 43, 23, 60, 10, 10, 5, 25. For the inner loop, your modification works since you re-enter a new loop through iteration on the outer loop.
Here is my advice code, inspired by bubble sort.
import pandas as pd
sub_run = pd.DataFrame({'Runoff': [45,10,5,26,30,23,35],
'ind': [3,10,25,43,53,60,93]})
sub_run = sub_run.sort_values(by=['Runoff'], ascending=False)
highestRow = 0
while highestRow < len(sub_run) - 1:
cur_run = sub_run
highestRunoffInd = cur_run.iloc[highestRow].ind
for i in range(highestRow + 1, len(cur_run)):
ind = cur_run.iloc[i].ind
if abs(ind - highestRunoffInd) <= 30:
sub_run = sub_run.drop(sub_run[sub_run.ind == ind].index)
highestRow += 1
print(sub_run)
Output:
Runoff ind
0 45 3
6 35 93
4 30 53

Related

replacing integers in a data frame (logic issues)

So I am trying to change some values in a df using pandas and, having already tried with df.replace, df.mask, and df.where I got to the conclusion that it must be a logical mistake since it keeps throwing the same mistake:
ValueError: The truth value of a Series is ambiguous.
I am trying to normalize a column in a dataset, thus the function and not just a single line. I need to understand why my logic is wrong, it seems to be such a dumb mistake.
This is my function:
def overweight_normalizer():
if df[df["overweight"] > 25]:
df.where(df["overweight"] > 25, 1)
elif df[df["overweight"] < 25]:
df.where(df["overweight"] < 25, 0)
df[df["overweight"] > 25] is not a valid condition.
Try this:
def overweight_normalizer():
df = pd.DataFrame({'overweight': [2, 39, 15, 45, 9]})
df["overweight"] = [1 if i > 25 else 0 for i in df["overweight"]]
return df
overweight_normalizer()
Output:
overweight
0 0
1 1
2 0
3 1
4 0

Is there a way in Python to start a variable at 0 and then increment by 1 while in a for loop?

Below is a section of my code and I am attempting to start with day at 0 then 1, 2, 3, 4 so there are 5 total days, just starting at 0. Is there an easy way to do this because at the moment I am only having days 1, 2, 3, 4, 5 since I have day = day + 1 which doesn't allow to have a day of 0? Sorry if this is a silly question, I am still relatively new to learning Python.
density = np.zeros((6, 91, 181))
day = 0
for i,e in df.iterrows():
lat = int((e['Latitude']+90)/2)
long = int(e['Longitude']/2)
if lat == 0.0 and long == 0.0:
day = day + 1
print(day)
density[day,lat,long] = e['rho']
range() produces an iterable object which you can use for this purpose, starting at 0 and ending at whatever value you specify:
density = np.zeros((6, 91, 181))
days = range(5) # assign days to be an iterator, e.g. range()
for i,e in df.iterrows():
lat = int((e['Latitude']+90)/2)
long = int(e['Longitude']/2)
if lat == 0.0 and long == 0.0:
day = next(days) # assign day by popping the first value from that iterator
print(day)
density[day,lat,long] = e['rho']
If you instead want an infinite list of numbers ascending from zero, you can make your own infinite number generator:
def inf_ints():
i = 0
while True:
yield i
i += 1
...
days = inf_ints()
...
Not really sure what you want to achieve, but if you'd initialise day with -1 instead of 0, wouldn't that solve your problem?

Add (and calculate) rows to dataframe until condition is met:

I'm attempting to build a dataframe that adds 1 to the prior row in a column until a condition is met. In this case, I want to continue to add rows until column 'AGE' = 100.
import pandas as pd
import numpy as np
RP = {'AGE' : pd.Series([10]),
'SI' : pd.Series([60])}
RPdata = pd.DataFrame(RP)
i = RPdata.tail(1)['AGE']
RPdata2 = pd.DataFrame()
while [i < 100]:
RPdata2['AGE'] = i + 1
RPdata2['SI'] = RPdata.tail(1)['SI']
RPdata = pd.concat([RPdata, RPdata2], axis = 0)
break
print RPdata
Results
Age SI
0 10 60
0 11 60
I understand that the break statement prevents multiple iterations, but the loop appears to be infinite without it.
I'm attempting to achieve:
Age SI
0 10 60
0 11 60
0 12 60
0 13 60
0 14 60
. . 60
0 100 60
Is there a way to accomplish this with a while loop? Should I pursue a for loop solution instead?
There may be other problems, but you're going to get in an infinite loop with while [i < 100]: since a non-empty list will always evaluate to True. Change that to while (i < 100): (parens optional) and remove your break statement, which is forcing just one iteration.

What is wrong with my algorithm/code?

I am trying to figure out what is wrong with my code. Currently, I am trying to get the averages of everything with the same temp (ex temp 18 = 225 conductivity average, temp 19 = 15 conductivity average, etc).
Could someone tell me if this is a simple coding mistake or a algorithm mistake and offer some help to fix this problem?
temp = [18,18,19,19,20]
conductivity = [200,250,20,10,15]
tempcheck = temp[0];
conductivitysum = 0;
datapoint = 0;
assert len(temp) == len(conductivity)
for i in range(len(temp)):
if tempcheck == temp[i]:
datapoint+=1
conductivitysum+=conductivity[i]
else:
print conductivitysum/datapoint
datapoint=0
conductivitysum=0
tempcheck=temp[i]
For some reason, it is printing out
225
10
When it should be printing out
225
15
15
in else clause
put :
conductivitysum=0
datapoint=0
tempcheck = temp[i]
conductivitysum+=conductivity[i]
datapoint+=1
because when you go to else clause, you miss that particular conductivity of i. It doesn't get saved. So before moving to next i, save that conductivity
Change the else to:
for i in range(len(temp)):
if tempcheck == temp[i]:
datapoint+=1
conductivitysum+=conductivity[i]
else:
print conductivitysum/datapoint
datapoint=1
conductivitysum=conductivity[i]
tempcheck=temp[i]
When you get to the pair (19, 20) you need to keep them and count one datapoint, not 0 datapoints. At the moment you are skipping them and only keeping the next one - (19, 10).
Alternatively, rewrite it as
>>> temp = [18,18,19,19,20]
>>> conductivity = [200,250,20,10,15]
# build a dictionary to group the conductivities by temperature
>>> groups = {}
>>> for (t, c) in zip(temp, conductivity):
... groups[t] = groups.get(t, []) + [c]
...
# view it
>>> groups
{18: [200, 250], 19: [20, 10], 20: [15]}
# average the conductivities for each temperature
>>> for t, cs in groups.items():
print t, float(sum(cs))/len(cs)
...
18 225
19 15
20 15
>>>
When I saw this code, the first thing that popped into my head was the zip function. I hope that the following code is what you want.
temp = [18,18,19,19,20]
conductivity = [200,250,20,10,15]
assert len(temp) == len(conductivity)
# Matches each temp value to its corresponding conductivity value with zip
relations = [x for x in zip(temp, conductivity)]
for possible_temp in set(temp): # Takes each possible temparature (18,19,20)
total = 0
divide_by = 0
# The next four lines of code will check each match and figure out the
# summed total conductivity value for each temp value and how much it
# should be divided by to create an average.
for relation in relations:
if relation[0] == possible_temp:
total += relation[1]
divide_by += 1
print(int(total / divide_by))

How to efficiently join a list of values to a list of intervals?

I have a data frame which can be constructed as follows:
df = pd.DataFrame({'value':scipy.stats.norm.rvs(0, 1, size=1000),
'start':np.abs(scipy.stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(scipy.stats.norm.rvs(5, 5, size=1000))
df[:10]
start value end
0 9.521781 -0.570097 17.708335
1 3.929711 -0.927318 15.065047
2 3.990466 0.756413 4.841934
3 20.676291 -1.418172 28.284301
4 13.084246 1.280723 14.121626
5 29.784740 0.236915 32.791751
6 21.626625 1.144663 28.739413
7 18.524309 0.101871 27.271344
8 21.288152 -0.727120 27.049582
9 13.556664 0.713141 22.136275
Each row represents a value assigned to an interval (start, end)
Now, I would like to get a list of best values occuring at time 10,13,15, ... ,70. (It is similar to the geometric index in SQL if you are familiar with that.)
Below is my 1st attempt in python with pandas, it takes 18.5ms. Can any one help to improve it? (This procedure would be called 1M or more times with different data frames in my program)
def get_values(data):
data.sort_index(by='value', ascending=False, inplace=True) # this takes 0.2ms
# can we get rid of it? since we don't really need sort...
# all we need is the max value for each interval.
# But if we have to keep it for simplicity it is ok.
ret = []
#data = data[(data['end'] >= 10) & (data['start'] <= 71)]
for t in range(10, 71, 2):
interval = data[(data['end'] >= t) & (data['start'] <= t)]
if not interval.empty:
ret.append(interval['value'].values[0])
else:
for i in range(t, 71, 2):
ret.append(None)
break
return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)
The 2nd attemp involves decompose pandas into numpy as much as possible, and it takes around 0.7ms
def get_values(data):
data.sort_index(by='value', ascending=False, inplace=True)
ret = []
df_end = data['end'].values
df_start = data['start'].values
df_value = data['value'].values
for t in range(10, 71, 2):
values = df_value[(df_end >= t) & (df_start <= t)]
if len(values) != 0:
ret.append(values[0])
else:
for i in range(t, 71, 2):
ret.append(None)
break
return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)
Can we improve further? I guess the next step is algorithm level, both of the above are just naive logic implementations.
I don't understand empty process in your code, here is a faster version if ignore your empty process:
import scipy.stats as stats
import pandas as pd
import numpy as np
df = pd.DataFrame({'value':stats.norm.rvs(0, 1, size=1000),
'start':np.abs(stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(stats.norm.rvs(5, 5, size=1000))
def get_value(df, target):
value = df["value"].values
idx = np.argsort(value)[::-1]
start = df["start"].values[idx]
end = df["end"].values[idx]
value = value[idx]
mask = (target[:, None] >= start[None, :]) & (target[:, None] <= end[None, :])
index = np.argmax(mask, axis=1)
flags = mask[np.arange(len(target)), index]
result = value[index]
result[~flags] = np.nan
return result
get_value(df, np.arange(10, 71, 2))

Categories