selecting data using pandas

selecting data using pandas - python

I have a large catalog that I am selecting data from according to the following criteria:
columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))
When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)
However, I am receiving an error:
if catalog.logg[i] == -1:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone please explain what I am doing wrong and how I can fix it. Thank you
Edit 1
My dataframe looks like the following:
Data columns:
System 477 non-null values
rp 477 non-null values
mp 477 non-null values
logg 477 non-null values
dtypes: float64(37), int64(3), object(3)None
Edit 2
System rp mp logg FeH FeHu FeHl Mstar Mstaru Mstarl
0 target-01 5196 24 24 0.31 0.04 0.04 0.905 0.015 0.015
1 target-02 5950 150 150 -0.30 0.25 0.25 0.950 0.110 0.110
2 target-03 5598 50 50 0.04 0.05 0.05 0.997 0.049 0.049
3 target-04 6558 44 -1 0.14 0.04 0.04 1.403 0.061 0.061
4 target-05 6190 60 60 0.05 0.07 0.07 1.194 0.049 0.050
....
[5 rows x 43 columns]
Edit 3
My code in a format that I understand should be:
for row in range(len(catalog)):
parameter = catalog['logg'][row]
if parameter == -1:
parameter = catalog['mp'][row] / catalog['rp'][row]
if parameter > 4.0 and parameter < 5.0:
# select this row for further analysis
However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?
EDIT 4
Consider the following small example:
System rp mp logg
target-01 2 -1 2 # will NOT be selected since mp = -1
target-02 -1 3 4 # will NOT be selected since rp = -1
target-03 7 6 4.3 # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04 3.2 15 -1 # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)

you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:
catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']
which would modify the logg column inplace
As for edit 3:
rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]
which will select rows that satisfy the condition

Instead of that code:
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
You could use following:
i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
For your edit 3 you need to add that line:
your_rows = df[(df.logg > 4) & (df.logg < 5)]
Full code:
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]
EDIT
Probably I still don't understand what you want, but I get your desired output:
import pandas as pd
from io import StringIO
data = """
System rp mp logg
target-01 2 -1 2
target-02 -1 3 4
target-03 7 6 4.3
target-04 3.2 15 -1
"""
catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]
In [7]: your_rows
Out[7]:
System rp mp logg
2 target-03 7.0 6 4.3000
3 target-04 3.2 15 4.6875
Am I still wrong?

Related

looping in a data frame failing: Overriding existing column values

I am using a for loop to reuse existing data frames.
Sample Code:
for i in range(0, 5, 1):
RGU_TT_TempX = pd.DataFrame()
RGU_TT_TempX = RGU_TT_Temp
#Merging Regular Ambulance TT with MSUs TT
#Updating MSUs TT according to the Formula
RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Y_DURATION"] = 0.05 + df_temp_MSU2["MSU_Y_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Z_DURATION"] = 0.05 + df_temp_MSU3["MSU_Z_DURATION"].values + 0.25 + 0.25
This gives me the error:
---> 44 RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
ValueError: Length of values (0) does not match length of index (16622)
In each data frame, I have 16622 values. Still, this gives me the length of the index error.
Full Error Track:
ValueError Traceback (most recent call last)
Input In [21], in <cell line: 16>()
41 RGU_TT_TempX = RGU_TT_Temp
42 #Merging Regular Ambulance TT with MSUs TT
43 #Updating MSUs TT according to the Formula
---> 44 RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
45 RGU_TT_TempX["MSU_Y_DURATION"] = 0.05 + df_temp_MSU2["MSU_Y_DURATION"].values + 0.25 + 0.25
46 RGU_TT_TempX["MSU_Z_DURATION"] = 0.05 + df_temp_MSU3["MSU_Z_DURATION"].values + 0.25 + 0.25
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/frame.py:3977, in DataFrame.__setitem__(self, key, value)
3974 self._setitem_array([key], value)
3975 else:
3976 # set column
-> 3977 self._set_item(key, value)
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/frame.py:4171, in DataFrame._set_item(self, key, value)
4161 def _set_item(self, key, value) -> None:
4162 """
4163 Add series to DataFrame in specified column.
4164
(...)
4169 ensure homogeneity.
4170 """
-> 4171 value = self._sanitize_column(value)
4173 if (
4174 key in self.columns
4175 and value.ndim == 1
4176 and not is_extension_array_dtype(value)
4177 ):
4178 # broadcast across multiple columns if necessary
4179 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/frame.py:4904, in DataFrame._sanitize_column(self, value)
4901 return _reindex_for_setitem(Series(value), self.index)
4903 if is_list_like(value):
-> 4904 com.require_length_match(value, self.index)
4905 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File ~/opt/anaconda3/envs/geo_env/lib/python3.10/site-packages/pandas/core/common.py:561, in require_length_match(data, index)
557 """
558 Check the length of data matches the length of the index.
559 """
560 if len(data) != len(index):
--> 561 raise ValueError(
562 "Length of values "
563 f"({len(data)}) "
564 "does not match length of index "
565 f"({len(index)})"
566 )
ValueError: Length of values (0) does not match length of index (16622)
I am really stuck here. Any suggestions will be highly appreciated.
Data Frame (MSU_TT_Temp) Samples:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528556 38.43980
1 26 0.512511 37.38515
1 71 0.432453 32.57571
1 83 0.599486 39.26188
1 98 0.590517 35.53107
Data Frame (RGU_TT_Temp) Samples:
Ambulance_ID Centroid_ID Hospital_ID Regular_Ambu_TT
37 1 6 1.871885
39 2 13 1.599971
6 3 6 1.307165
42 4 12 1.411554
37 5 14 1.968138
The problem is, if I iterate my loop once, the code works absolutely fine.
Sample Code:
for i in range(0, 1, 1):
s = my_chrome_list[i]
MSU_X,MSU_Y,MSU_Z = s
#print (MSU_X,MSU_Y,MSU_Z)
#Three scenario
df_temp_MSU1 = pd.DataFrame()
df_temp_MSU2 = pd.DataFrame()
df_temp_MSU3 = pd.DataFrame()
df_temp_MSU1 = MSU_TT_Temp.loc[(MSU_TT_Temp['FROM_ID'] == MSU_X)]
df_temp_MSU1.rename(columns = {'DURATION_H':'MSU_X_DURATION'}, inplace = True)
#df_temp_MSU1
df_temp_MSU2 = MSU_TT_Temp.loc[(MSU_TT_Temp['FROM_ID'] == MSU_Y)]
df_temp_MSU2.rename(columns = {'DURATION_H':'MSU_Y_DURATION'}, inplace = True)
#df_temp_MSU2
df_temp_MSU3 = MSU_TT_Temp.loc[(MSU_TT_Temp['FROM_ID'] == MSU_Z)]
df_temp_MSU3.rename(columns = {'DURATION_H':'MSU_Z_DURATION'}, inplace = True)
#df_temp_MSU3
RGU_TT_TempX = pd.DataFrame()
RGU_TT_TempX = RGU_TT_Temp
#Merging Regular Ambulance TT with MSUs TT
#Updating MSUs TT according to the Formula
RGU_TT_TempX["MSU_X_DURATION"] = 0.05 + df_temp_MSU1["MSU_X_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Y_DURATION"] = 0.05 + df_temp_MSU2["MSU_Y_DURATION"].values + 0.25 + 0.25
RGU_TT_TempX["MSU_Z_DURATION"] = 0.05 + df_temp_MSU3["MSU_Z_DURATION"].values + 0.25 + 0.25
#RGU_TT_TempX
#MSUs Average Time to Treatment
MSU1=RGU_TT_TempX["MSU_X_DURATION"].mean()
MSU2=RGU_TT_TempX["MSU_Y_DURATION"].mean()
MSU3=RGU_TT_TempX["MSU_Z_DURATION"].mean()
MSU_AVG_TT = (MSU1+MSU2+MSU3)/3
parents_chromosomes_list.append(MSU_AVG_TT)
Output:
[2.0241383927258387]
Note: The data length in the three data frames are equal: Indexes are the same length
Loop for multiple iteration:Erorr
for i in range(0, 5, 1):
What is the problem?

This error occurs when you attempt to assign a NumPy array of values to a new column in a pandas DataFrame, yet the array's length does not match the current length of the index.
The easiest way to fix this error is to simply create a new column using a pandas Series instead of a NumPy array.
You can take a closer look here: How to Fix: Length of values does not match the length of index

speed up time for loop iterations in python

I have "random" points and would like to check which points can be connected by straight lines. Therefore I iterate through a list of points and draw a line at different angles. After all lines at all angles for every single point is drawn, I iterate over each line checking whether they are connecting 3 or more points. If the line connects 3 or more points, it is saved by appending it to a new list (newLines), if not the next line gets tested.
The problem which the following code is that it is way to slow... My testing image took about 30 min and my actual image was not done after about 14 hours. I read about speeding up for loops by using numpy (like in this article). I found plenty of examples for replacing for loops with numpy but in these example it was just simple iterating over a list without declaring the values as variables for usage.
Any hint for speeding up the following code is appreciated, it does not necessarily need to be numpy.
# list for saving rotated lines
lines=[]
for point in points:
# length of line is the diagonal of the point image so it still covers the whole image after rotation
length = sqrt(image.shape[0]**2+image.shape[1]**2)
start = Point(point)
end = Point(start.x+length, start.y)
line = LineString([start,end])
# rotating the generated line for 5 degrees and appeding it to the list
for a in range(0, 360, 5):
angle = np.deg2rad(a)
line = rotate(line, angle, origin=start, use_radians=True)
lines.append(line)
multiLines = MultiLineString(lines)
# list for rotated lines which connect 3 or more points
newLines = []
start = ()
for multiLine in multiLines.geoms:
lst = list(multiLine.coords)
# a: starting point of line | b: ending point of line
a = np.asarray(lst[0])
b = np.asarray(lst[1])
count = 0
# again iterating over point array to check which point is on line
for point in points:
p = np.asarray(point)
# check if point (p) is on line (a - b)
if np.cross(p-a,b-a) == 0:
if count == 0:
start = point
count += 1
else:
end = point
count += 1
if count >= 3:
line = (start, end)
newLines.append(line)

I'm not sure what your current benchmarks are, but you want to try with numpy you can do something like this. I'm using pandas which is a numpy wrapper, but it's effectively doing the same thing
I think this is doing the same thing as you want. I'm looking at each pair of points, calculating the m and c coefficients in the equation y = mx + c through the two points, then checking for cases where these match. I expect you might want some accepted error depending on your input data.
Sorry if I'm way off piste.
import pandas as pd
import numpy as np
import random
import itertools
import time
def get_matches(points):
# get all combinations of two points
combinations_of_points = ([(a[0], a[1], b[0], b[1]) for a, b in itertools.combinations(points, 2) if a != b])
data = pd.DataFrame(combinations_of_points, columns=['x1', 'y1', 'x2', 'y2'])
data['m'] = (data.y1 - data.y2) / (data.x1 - data.x2)
# swap negative gradients so all lines are in same direction
data.loc[np.isfinite(data.m) & data.m < 0, 'm'] = -(1 / data.m)
data.loc[np.isneginf(data.m), 'm'] = -data.m
# y = mx + c
data['c'] = data.y1 - (data.m * data.x1)
data = data.sort_values(['m', 'c', 'x1']).reset_index(drop=True)
# filter to items which are duplicated
filtered = data[
# matching m and c values
(np.isfinite(data.m) & data.duplicated(['m', 'c'], keep=False)) |
# infinite m and x equal (straight line up)
(np.isposinf(data.m) & data.duplicated(['m', 'x1'], keep=False))
]
return filtered
points = [(0, 0), (1, 1), (2, 2)]
print(get_matches(points))
random.seed(1)
count = 500
random_points = [(round(random.random(), 3), round(random.random(), 3)) for i in range(count)]
results = get_matches(random_points)
print(results)
print('\nPerformance with increasing points')
for i in [i ** 2 for i in range(5, 101, 5)]:
random.seed(1)
random_points = [(round(random.random(), 3), round(random.random(), 3)) for i in range(i)]
start = time.perf_counter()
results = get_matches(random_points)
stop = time.perf_counter()
print(f'{i:<9}{stop - start:03f}')
returns:
x1 y1 x2 y2 m c
0 0 0 1 1 1.0 0.0
1 0 0 2 2 1.0 0.0
2 1 1 2 2 1.0 0.0
x1 y1 x2 y2 m c
12243 0.606 0.262 0.400 0.880 -3.0 2.080
12244 0.606 0.262 0.440 0.760 -3.0 2.080
12251 0.378 0.970 0.506 0.586 -3.0 2.104
12252 0.505 0.589 0.378 0.970 -3.0 2.104
12253 0.505 0.589 0.506 0.586 -3.0 2.104
... ... ... ... ... ... ...
124741 0.971 0.382 0.971 0.716 inf -inf
124742 0.971 0.543 0.971 0.716 inf -inf
124744 0.983 0.593 0.983 0.296 inf -inf
124745 0.983 0.593 0.983 0.448 inf -inf
124746 0.983 0.296 0.983 0.448 inf -inf
[237 rows x 6 columns]
Performance with increasing points
25 0.010577
100 0.016897
225 0.045443
400 0.136834
625 0.338148
900 0.765913
1225 1.525819
1600 2.645753
2025 4.834811
2500 8.112012
3025 12.960043
3600 18.262522
4225 27.221498
4900 37.329662
5625 53.064736
6400 67.325213
7225 84.843119
8100 116.864120
9025 140.131420
10000 171.630961
As one of you comments pointed out earlier, the order of growth of the problem is approximately N ^ 2 because it is look at all the combinations of points so the performance very quickly degrades with increasing numbers of points. Note you could use this relationship to estimate how long it would take for your program to run if you know the number of points.

Attribute change with variable number of time steps

I would like to simulate individual changes in growth and mortality for a variable number of days. My dataframe is formatted as follows...
import pandas as pd
data = {'unique_id': ['2', '4', '5', '13'],
'length': ['27.7', '30.2', '25.4', '29.1'],
'no_fish': ['3195', '1894', '8', '2774'],
'days_left': ['253', '253', '254', '256'],
'growth': ['0.3898', '0.3414', '0.4080', '0.3839']
}
df = pd.DataFrame(data)
print(df)
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Ideally, I would like the initial length (i.e., length) to increase by the daily growth rate (i.e., growth) for each of the days remaining in the year (i.e., days_left).
df['final'] = df['length'] + (df['days_left'] * df['growth']
However, I would also like to update the number of fish that each individual represents (i.e., no_fish) on a daily basis using a size-specific equation. I'm fairly new to python so I initially thought to use a for-loop (I'm not sure if there is another, more efficient way). My code is as follows:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_to_forecast[indx]):
# (1) update individual length
df.lgth[indx] = df.lgth[indx] + df.linearGR[indx]
# (2) estimate daily size-specific mortality
if df.lgth[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.lgth[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.lgth[indx])
elif df.lgth[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.lgth[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
The above code now works correctly, but it is still far to inefficient to run for 40,000 individuals each for 200+ days.
I would really appreciate any advice on how to modify the following code to make it pythonic.
Thanks

Another option that was suggested to me is to use the pd.dataframe.apply function. This dramatically reduced the overall the run time and could be useful to someone else in the future.
### === RUN SIMULATION === ###
start_time = time.perf_counter() # keep track of run time -- START
#-------------------------------------------------------------------------#
def function_to_apply( df ):
df['z_instantMort'] = ''
for indx in range(int(df['days_left'])):
# (1) update individual length
df['length'] = df['length'] + df['growth']
# (2) estimate daily size-specific mortality
if df['length'] > 50.0:
df['z_instantMort'] = 0.01
else:
if df['length'] <= 50.0:
df['z_instantMort'] = 0.052857-((0.03/35)*df['length'])
elif df['length'] < 15.0:
df['z_instantMort'] = 0.728*np.exp(-0.1892*df['length'])
whole_fish = round(df['no_fish'], 0)
if whole_fish < 1.0:
df['no_fish'] = 0.0
elif whole_fish >= 1.0:
df['no_fish'] = df['no_fish']*np.exp(-(df['z_instantMort']))
return df
#-------------------------------------------------------------------------#
sim_results = df.apply(function_to_apply, axis=1)
total_elapsed_time = round(time.perf_counter() - start_time, 2) # END
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(sim_results)
### ====================== ###
output being...
Forecast iteration completed in 0.05 seconds
unique_id length no_fish days_left growth z_instantMort
0 2.0 126.3194 148.729190 253.0 0.3898 0.01
1 4.0 116.5742 93.018465 253.0 0.3414 0.01
2 5.0 129.0320 0.000000 254.0 0.4080 0.01
3 13.0 127.3784 132.864757 256.0 0.3839 0.01

As I said in my comment, a preferable alternative to for loops in this setting is using vector operations. For instance, running your code:
import pandas as pd
import time
import math
import numpy as np
data = {'unique_id': [2, 4, 5, 13],
'length': [27.7, 30.2, 25.4, 29.1],
'no_fish': [3195, 1894, 8, 2774],
'days_left': [253, 253, 254, 256],
'growth': [0.3898, 0.3414, 0.4080, 0.3839]
}
df = pd.DataFrame(data)
print(df)
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_left[indx]):
# (1) update individual length
df.length[indx] = df.length[indx] + df.growth[indx]
# (2) estimate daily size-specific mortality
if df.length[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.length[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.length[indx])
elif df.length[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.length[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output:
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Forecast iteration completed in 31.75 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.01
1 4 116.5742 93.018465 253 0.3414 0.01
2 5 129.0320 0.000000 254 0.4080 0.01
3 13 127.3784 132.864757 256 0.3839 0.01
Now with vector operations, you could do something like:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for day in range(1, df.days_left.max() + 1):
update = day <= df['days_left']
# (1) update individual length
df[update]['length'] = df[update]['length'] + df[update]['growth']
# (2) estimate daily size-specific mortality
df[update]['z'] = np.where( df[update]['length'] > 50.0, 0.01, 0.052857-( ( 0.03 / 35)*df[update]['length'] ) )
df[update]['z'] = np.where( df[update]['length'] < 15.0, 0.728 * np.exp(-0.1892*df[update]['length'] ), df[update]['z'] )
df[update]['no_fish'].round(decimals = 0)
df[update]['no_fish'] = np.where(df[update]['no_fish'] < 1.0, 0.0, df[update]['no_fish'] * np.exp(-(df[update]['z'])))
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output
Forecast iteration completed in 1.32 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.0
1 4 116.5742 93.018465 253 0.3414 0.0
2 5 129.0320 0.000000 254 0.4080 0.0
3 13 127.3784 132.864757 256 0.3839 0.0

Python: set average values to outliers

I have dataframe
ID Value
A 70
A 80
A 1000
A 100
A 200
A 130
A 60
A 300
A 800
A 200
A 150
A 250
I need to replace outliers to median value.
I use
df = pd.read_excel("test.xlsx")
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' :
grouped['Value'].quantile(.75)})
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
q3 = statBefore.loc[row.ID]['q3']
q1 = statBefore.loc[row.ID]['q1']
if row.Value > (q3 + (3 * iq_range)) or row.Value < (q1 - (3 * iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
But it returns me median - 175 and q1 - 92, but I get - 90, and it returns me q3 - 262,5, but I count and get - 275.
What wrong there?

This is simple and performant, with no Python for-loops to slow it down:
s = pd.Series([30, 31, 32, 45, 50, 999]) # example data
s.where(s.between(*s.quantile([0.25, 0.75])), s.median())
It gives you:
0 38.5
1 38.5
2 32.0
3 45.0
4 38.5
5 38.5
Unpacking that code, we have s.quantile([0.25, 0.75]) to get this:
0.25 31.25
0.75 48.75
We then use the values (31.25 and 48.75) as arguments to between(), with the * operator to unpack them because between() expects two separate arguments, not an array of length 2. That gives us:
0 False
1 False
2 True
3 True
4 False
5 False
Now that we have the binary mask, we use s.where() to choose the original values at the True locations, and fall back to s.median() otherwise.

This is just how quantiles are defined
df = pd.DataFrame(np.array([60,70,80,100,130,150,200,200,250,300,800,1000]))
print df.quantile(.25)
print df.quantile(.50)
print df.quantile(.75)
(The q1 for your data set is 95 btw)
The median is in between 150 and 200 (175)
The first quantile is 3 quarters between 80 and 100 (95)
The thrid quantile is 1 quarter in between 250 and 300 (262.5)

Finding values based on specific categories

I was wondering how would find estimated values based on several different categories. Two of the columns are categorical, one of the other columns contains two strings of interest and the last contain numeric values
I have a csv file called sports.csv
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
I'm trying to find a suggested price for a Gym that have both baseball and basketball as well as enrollment from 240 to 260 given they are from region 4 and of type 1
Region Type enroll estimates price Gym
2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4 2 100 0.26 37 Baseball|Tennis
4 1 347 0.65 61 Basketball|Baseball|Ballet
4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1 1 286 0.74 78 Swimming|Basketball
0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0 1 263 0.91 31 Tennis
2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
0 1 109 0.12 17 Football|Hockey|Volleyball
I don't know how to piece everything together. I apologize if the syntax is incorrect I'm just beginning Python. So far I have:
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')
price = df.loc[df['Gym'], 'Price']
Should I do a groupby instead? If so, how would I include the columns Type==1 Region ==4 and enrollment from 240 to 260 ?

You can create a mask with all your conditions specified and then use the mask for subsetting:
mask = (df['Region'] == 4) & (df['Type'] == 1) & \
(df['enroll'] <= 260) & (df['enroll'] >= 240) & \
df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')
df['price'][mask]
# Series([], name: price, dtype: int64)
which returns empty, since there is no record satisfying all conditions as above.

I had to add an instance that would actually meet your criteria, or else you will get an empty result. You want to use df.loc with conditions as follows:
In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region Type enroll estimates price Gym
...: 2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
...: 4 2 100 0.26 37 Baseball|Tennis
...: 4 1 247 0.65 61 Basketball|Baseball|Ballet
...: 4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
...: 1 1 286 0.74 78 Swimming|Basketball
...: 0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
...: 0 1 263 0.91 31 Tennis
...: 2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
...: 3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
...: 0 1 109 0.12 17 Football|Hockey|Volleyball""")
In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")
In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)")
...: & (df.enroll <= 260) & (df.enroll >= 240)
...: & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]:
2 61
Name: price, dtype: int64
Note I used a regex pattern for contains that essentially acts as an AND operator for regex. You could simply have done another conjunction of .contains conditions for Basketball and Baseball.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

selecting data using pandas - python

Related

looping in a data frame failing: Overriding existing column values

speed up time for loop iterations in python

Attribute change with variable number of time steps

Python: set average values to outliers

Finding values based on specific categories

Categories

Resources