Pandas dataframe - remove outliers [duplicate] - python

This question already has answers here:
Detect and exclude outliers in a pandas DataFrame
(19 answers)
Closed 1 year ago.
Given a pandas dataframe, I want to exclude rows corresponding to outliers (Z-value = 3) based on one of the columns.
The dataframe looks like this:
df.dtypes
_id object
_index object
_score object
_source.address object
_source.district object
_source.price float64
_source.roomCount float64
_source.size float64
_type object
sort object
priceSquareMeter float64
dtype: object
For the line:
dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
The following exception is raised:
-------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-68-02fb15620e33> in <module>()
----> 1 dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof)
2239 """
2240 a = np.asanyarray(a)
-> 2241 mns = a.mean(axis=axis)
2242 sstd = a.std(axis=axis, ddof=ddof)
2243 if axis and mns.ndim < a.ndim:
/opt/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
68 is_float16_result = True
69
---> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
71 if isinstance(ret, mu.ndarray):
72 ret = um.true_divide(
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'
And the return value of
np.isreal(df['_source.price']).all()
is
True
Why do I get the above exception, and how can I exclude the outliers?

If one wants to use the Interquartile Range of a given dataset (i.e. IQR, as shown by a Wikipedia image below) (Ref):
def Remove_Outlier_Indices(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
return trueList
Based on the above eliminator function, the subset of outliers according to the dataset' statistical content can be obtained:
# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)
# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]

Use this boolean whenever you have this sort of issue:
df=pd.DataFrame({'Data':np.random.normal(size=200)}) #example
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around

I believe you could create a boolean filter with the outliers and then select the oposite of it.
outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]

Related

Proximityhash Type Error ,cannot convert the series to <class 'float'>

import proximityhash
# filtering the dataset with the required columns
df_new=df.filter(['latitude', 'longitude','cell_radius'])
# assign the column values to a variable
latitude = df_new['latitude']
longitude = df_new['longitude']
radius= df_new['cell_radius']
precision = 7
# passing the variable as the parameters to the proximityhash library
# getting the values and assigning those to a new column as proximityhash
df_new['proximityhash']=df_new.apply([proximityhash.create_geohash(latitude,longitude,radius,precision=7)])
print(df_new)
I had used this code where I imported some dataset and using that dataset I tried to filter the necessary columns and assign them to 3 variables (latitude, longitude, radius) and tried to create a new column as "proximityhash" to the new dataframe but it shows an error like below:
[enter image description here][1]
[1]: https://i.stack.imgur.com/2xW8S.png
TypeError Traceback (most recent call last)
Input In [29], in <cell line: 15>()
11 import pygeohash as gh
13 import proximityhash
---> 15 df_new['proximityhash']=df_new.apply([proximityhash.create_geohash(latitude,longitude,radius,precision=7)])
17 print(df_new)
File ~\Anaconda3\lib\site-packages\proximityhash.py:57, in create_geohash(latitude, longitude, radius, precision, georaptor_flag, minlevel, maxlevel)
54 height = (grid_height[precision - 1])/2
55 width = (grid_width[precision-1])/2
---> 57 lat_moves = int(math.ceil(radius / height)) #4
58 lon_moves = int(math.ceil(radius / width)) #2
60 for i in range(0, lat_moves):
File ~\Anaconda3\lib\site-packages\pandas\core\series.py:191, in _coerce_method.<locals>.wrapper(self)
189 if len(self) == 1:
190 return converter(self.iloc[0])
--> 191 raise TypeError(f"cannot convert the series to {converter}")
TypeError: cannot convert the series to <class 'float'>
Figured out a way to solve this, Posting the answer since it might be helpful for others.
Defined a function and pass the library to that specific column
# filtering the dataset with the required columns
df_new=df[['latitude', 'longitude','cell_radius']]
# getting a specified row (Since running the whole process might kill the kernel)
df_new = df_new.iloc[:100, ]
# predefined the precision value.
precision = 7
def PH(row):
latitude = row['latitude']
longitude = row['longitude']
cell_radius = row['cell_radius']
row['proximityhash'] = [proximityhash.create_geohash(float(latitude),float(longitude),float(cell_radius),precision=7)]
return row
df_new = df_new.apply(PH, axis=1)
df_new['proximityhash'] =pd.Series(df_new['proximityhash'], dtype="string")

TypeError: '<' not supported between instances of 'str' and 'int' after converting string to float

Using: Python in Google Collab
Thanks in Advance:
I have run this code on other data I have scraped FBREF, so I am unsure why it's happening now. The only difference is the way I scraped it.
The first time I scraped it:
url_link = 'https://fbref.com/en/comps/Big5/gca/players/Big-5-European-Leagues-Stats'
The second time I scraped it:
url = 'https://fbref.com/en/comps/22/stats/Major-League-Soccer-Stats'
html_content = requests.get(url).text.replace('<!--', '').replace('-->', '')
df = pd.read_html(html_content)
I then convert the data from object to float so I can do a calculation, after I have pulled it into my dataframe:
dfstandard['90s'] = dfstandard['90s'].astype(float)
dfstandard['Gls'] = dfstandard['Gls'].astype(float)
I look and it shows they are both floats:
10 90s 743 non-null float64
11 Gls 743 non-null float64
But when I run the code that as worked previously:
dfstandard['Gls'] = dfstandard['Gls'] / dfstandard['90s']
I get the error message "TypeError: '<' not supported between instances of 'str' and 'int'"
I am fairly new to scraping, I'm stuck and don't know what to do next.
The full error message is below:
<ipython-input-152-e0ab76715b7d> in <module>()
1 #turn data into p 90
----> 2 dfstandard['Gls'] = dfstandard['Gls'] / dfstandard['90s']
3 dfstandard['Ast'] = dfstandard['Ast'] / dfstandard['90s']
4 dfstandard['G-PK'] = dfstandard['G-PK'] / dfstandard['90s']
5 dfstandard['PK'] = dfstandard['PK'] / dfstandard['90s']
8 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in _outer_indexer(self, left, right)
261
262 def _outer_indexer(self, left, right):
--> 263 return libjoin.outer_join_indexer(left, right)```
264
265 _typ = "index"
pandas/_libs/join.pyx in pandas._libs.join.outer_join_indexer()
TypeError: '<' not supported between instances of 'str' and 'int'>
There are two Gls columns in your dataframe. I think you converted only one "Gls" column to float, and when you do dfstandard['Gls'] = dfstandard['Gls'] / dfstandard['90s'], the other "Gls" column is getting considered?...
Try stripping whitespace from the column names too
df = df.rename(columns=lambda x: x.strip())
df['90s'] = pd.to_numeric(df['90s'], errors='coerce')
df['Gls'] = pd.to_numeric(df['Gls'], errors='coerce')
Thus the error.

Cannot convert the series to <class 'int'> pandas

I am working at a predicition model for sports and I encountered this problem.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-27-90981affcf54> in <module>
18 #print("Home",home_rank)
19 #print("Visitor",visitor_rank)
---> 20 row["HomeTeamRanksHigher"] = int(home_rank) > int(visitor_rank)
21 a = int(int(home_rank) < int(visitor_rank))
22 dataset.at[index,"HomeTeamRanksHigher"] = a
~\anaconda3\lib\site-packages\pandas\core\series.py in wrapper(self)
127 if len(self) == 1:
128 return converter(self.iloc[0])
--> 129 raise TypeError(f"cannot convert the series to {converter}")
130
131 wrapper.__name__ = f"__{converter.__name__}__"
TypeError: cannot convert the series to <class 'int'>
Here is my code.
dataset["HomeTeamRanksHigher"]= 0
for index, row in dataset.iterrows():
home_team = row["Home"]
visitor_team = row["Away"]
home_rank = standings[standings["Squad"] == home_team]["Rk"]
visitor_rank = standings[standings["Squad"] == visitor_team]["Rk"]
#print("Home",home_rank)
#print("Visitor",visitor_rank)
row["HomeTeamRanksHigher"] = int(home_rank) > int(visitor_rank)
a = int(int(home_rank) < int(visitor_rank))
dataset.at[index,"HomeTeamRanksHigher"] = a
I checked the dtype of "HomeTeamRanksHigher" after I casted it to int and it is int64. I used pd.to_numeric and .astype(int) too.
standings["Rk"] = pd.to_numeric(standings["Rk"])
dataset["HomeTeamRanksHigher"] = pd.to_numeric(dataset["HomeTeamRanksHigher"])
Here are my columns data types
Country object
League object
Season object
Date datetime64[ns]
Time object
Home object
Away object
HG int64
AG int64
HomeWin bool
HomeLastWin int64
VisitorLastWin int64
HomeTeamRanksHigher int64
dtype: object
In Standings most of the columns are int64. I cannot post any more code
Most likely these two selections are returning multiple results for certain teams:
home_rank = standings[standings["Squad"] == home_team]["Rk"]
visitor_rank = standings[standings["Squad"] == visitor_team]["Rk"]
int(home_rank) will work if home_rank is just 1 result, but if home_rank is a multi-value series, it will throw that TypeError.
You can verify by checking:
standings.groupby("Squad").Rk.value_counts().max()
If this output is not 1, then you have some duplicate Squad / Rk pairs.

Python-Top Ten Function

I'm trying to create a function where the user puts in the year and the output is the top ten countries by expenditures using this Lynda class as a model.
Here's the data frame
df.dtypes
Country Name object
Country Code object
Year int32
CountryYear object
Population int32
GDP float64
MilExpend float64
Percent float64
dtype: object
Country Name Country Code Year CountryYear Pop GDP Expend Percent
0 Aruba ABW 1960 ABW-1960 54208 0.0 0.0 0.0
I've tried this code and got errors:
Code:
def topten(Year):
simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
simple.index = simple.index + 1
return simple
topten(1990)
This is the rather big error I received:
Can I get some assistance? I can't even figure out what the error is. :-(
C:\Users\mycomputer\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
from ipykernel import kernelapp as app
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1738 # if kind==mergesort, it can fail for object dtype
-> 1739 return arr.argsort(kind=kind)
1740 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-105-0c974c6a1b44> in <module>()
----> 1 topten(1990)
<ipython-input-104-b8c336014d5b> in topten(Year)
1 def topten(Year):
----> 2 simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
3 simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
4 simple.index = simple.index + 1
5
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort(self, axis, ascending, kind, na_position, inplace)
1831
1832 return self.sort_values(ascending=ascending, kind=kind,
-> 1833 na_position=na_position, inplace=inplace)
1834
1835 def order(self, na_last=None, ascending=True, kind='quicksort',
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort_values(self, axis, ascending, inplace, kind, na_position)
1751 idx = _default_index(len(self))
1752
-> 1753 argsorted = _try_kind_sort(arr[good])
1754
1755 if not ascending:
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1741 # stable sort not available for object dtype
1742 # uses the argsort default quicksort
-> 1743 return arr.argsort(kind='quicksort')
1744
1745 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
The first argument to .loc is the row label.
When you call df_details_merged.loc[1960], pandas will find the row with the label 1960 and return that row as a Series. So you get back a Series with the index Country Name, Country Code, ..., with the values being the values from that row. Then your code tries to sort this by MilExpend, and that's where it fails.
What you need isn't loc, but a simple condition: df[df.Year == Year]. That is "give me the whole dataframe, but only where the 'Year' column contains whatever I've specified in the "Year" variable (1960 in your example).
sort will still work for the time being, but is being deprecated, so use sort_values instead. Putting that together:
simple = df_details_merged[df_details_merged.Year == Year].sort_values(by='MilExpend', ascending=False).reset_index()
Then you can go ahead and drop the columns, and fetch the top 10 rows as you're doing now.

geopandas AttributeError: 'MultiPolygon' object has no attribute 'exterior'

I have two GeoDataFrame. One is of the state of Iowa, while the other is of foretasted rain over the next 72 hours for North America. I want to create a GeoDataFrame of the rain forecast where it overlies the state of Iowa. But I get an error.
state_rain = gpd.overlay(NA_rain,iowa,how='intersection')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-ba8264ed63c2> in <module>()
3 #ws_usa[['WTRSHD_ID','QPF']].groupby('WTRSHD_ID').max().reset_index()
4 #state_rain = sjoin(usa_r,usa,how='inner',op='intersects')
----> 5 state_rain = gpd.overlay(usa_r,joined_states,how='intersection')
6 ws_state = gpd.overlay(ws,joined_states,how='intersection')
7 #print ws_usa.loc[ws_usa.WTRSHD_ID == 'IA-04']['QPF']
C:\Anaconda2\lib\site-packages\geopandas\tools\overlay.pyc in overlay(df1, df2, how, use_sindex)
95
96 # Collect the interior and exterior rings
---> 97 rings1 = _extract_rings(df1)
98 rings2 = _extract_rings(df2)
99 mls1 = MultiLineString(rings1)
C:\Anaconda2\lib\site-packages\geopandas\tools\overlay.pyc in _extract_rings(df)
50 # geom from layer is not valid attempting fix by buffer 0"
51 geom = geom.buffer(0)
---> 52 rings.append(geom.exterior)
53 rings.extend(geom.interiors)
54
AttributeError: 'MultiPolygon' object has no attribute 'exterior'
I checked for type == 'MultiPolygon', but neither GeoDataFrame contain any.
print NA_rain[NA_rain.geometry.type == 'MulitPolygon']
print iowa[iowa.geometry.type == 'MultiPolygon']
Empty GeoDataFrame
Columns: [END_TIME, ID, ISSUE_TIME, PRODUCT, QPF, START_TIME, UNITS, VALID_TIME, geometry]
Index: []
Empty GeoDataFrame
Columns: [sid, AFFGEOID, ALAND, AWATER, GEOID, LSAD, NAME, STATEFP, STATENS, STUSPS, geometry]
Index: []
If I do the following, the intersection works.
NA_rain.geometry = NA_rain.geometry.map(lambda x: x.convex_hull)
My question is twofold: 1.Why don't any MultiPolygons show up in my NA_rain GeoDataFrame, and 2. Besides turning every Polygon into a convex_hull, which ruins detailed contours of the Polygon, how would you suggest dealing with the MultiPolygon issue.
I agree with #jdmcbr. I suspect that at least one of the features in NA_rain is a MultiPolygon which did not get detected since the condition you showed is misspelled (MulitPolygon instead of MultiPolygon).
If your dataframe has MultiPolygons, you can convert all of them to Polygons. One dirty was is by passing the list() function to each MultiPolygon and then exploding into multiple rows:
geom = NA_rain.pop('geometry')
geom = geom.apply(lambda x: list(x) if isinstance(x, MultiPolygon) else x).explode())
NA_rain = NA_rain.join(geom, how='inner')
Note that the joining in line 3 duplicates the other attributes of the dataframe for all Polygons of the MultiPolygon, including feature identifiers, which you may want to change later, depending on your task.

Categories