Removing outlier from a single column - python

I am removing outliers from a dataset.
I decided to remove outlier from each column one-by-one. I have columns with a different number of missing values.
I used this code but it removed the whole row containg the outlier and due to many NaN values in my data, number of rows of my data reduced drastically.
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Then I decided to remove outlier from each column, and fill ouliers with NaN in each column
I wrote this code
def remove_outlier(df_in, col_name, thres=1.5):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-thres*iqr
fence_high = q3+thres*iqr
mask = (df_in[col_name] > fence_high) & (df_in[col_name] < fence_low)
df_in.loc[mask, col_name] = np.nan
return df_in
But this code doesn't filters the outliers. gave the same result.
What is wrong in this code? How can I correct it?
Is there any other elegant method to filter outlier?

Check the condition once. How can that be &. It should be |

df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
In this snipplet, you select rows based on df_in[col_name] > fence_low and df_in[col_name] < fence_high, hence each time one of these condition is not respected, the row will be removed;
As a general rule, if you have a column with 30% outliers, 30% of you dataset will disappear, and you have two options
1. Fill the missing value ffill, mean constant value ...
2. Or drop these feature, if it is not mandatory, because in some times you would better drop a feature than reduce your dataset too much
Hope it helps

Related

Attempting to find location of values in a pandas Dataframe if certain conditions are met

I have this DataFrame.
High Close
Close Time
2022-10-23 21:41:59.999 19466.02 19461.29
2022-10-23 21:42:59.999 19462.48 19457.83
2022-10-23 21:43:59.999 19463.13 19460.09
2022-10-23 21:44:59.999 19465.15 19463.76
I'm attempting to check if Close at a later date (up to 600 rows later but no more) goes above the close of an earlier date & High is lower than the High of the same earlier date then I want to get the location of both the earlier and later date and make new columns in the Dataframe with those locations.
Expected output:
High Close LC HC HH LH
Close Time
2022-10-23 21:41:59.999 19466.02 19461.29 19461.29 NaN 19466.02 NaN
2022-10-23 21:42:59.999 19462.48 19457.83 NaN NaN NaN NaN
2022-10-23 21:43:59.999 19463.13 19460.09 NaN NaN NaN NaN
2022-10-23 21:44:59.999 19465.15 19463.76 NaN 19463.76 NaN 19465.15
This is the code I have tried
# Checking if conditions are met
for i in range(len(df)):
for a in range(i,600):
if (df.iloc[i:, 1] < df.iloc[a, 1]) & (df.iloc[i:, 0] > df.iloc[a, 0]):
# Creating new DataFrame columns
df['LC'] = df.iloc[i, 1]
df['HC'] = df.get_loc[i,1]
df['HH'] = df.get_loc[a, 0]
df['LH'] = df.get_loc[a, 0]
else:
continue
This line: if (df.iloc[i:, 1] < df.iloc[a, 1]) & (df.iloc[i:, 0] > df.iloc[a, 0]):
Is causing error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I believe I should be using any() instead of an if statement but I am unsure of how to apply it. I also think that there many be an issue with the way I am using the df.get_loc[] but I am unsure. I'm a pandas beginner so if it is obvious I apologize
Here is an image to help visualise what I am attempting to do using a candlestick chart
what I want to do is check if HC is higher than LC and LH is lower than HH then add that data to new columns in the DataFrame
Here is an additional way I tried to achieve the desired output
idx_close, idx_high = map(df.columns.get_loc, ["Close", "High"])
# Check Conditions
for i in range(len(df)):
bool_l = [((df.iloc[i, idx_close] < df.iloc[a, idx_close]) &
(df.iloc[i, idx_high] > df.iloc[a, idx_high])
).any() for a in range(i, 600)]
# Creating new DataFrame columns
df.loc[i, 'LC'] = df.iloc[i,1]
df.loc[bool_l, 'HC'] = df.iloc[bool_l, 1]
# Creating new DataFrame columns
df.loc[i, 'HH'] = df.iloc[i, 0]
df.loc[bool_l, 'LH'] = df.iloc[bool_l, 0]
And I get an error IndexError: Boolean index has wrong length: 600 instead of 2867
On the line df.loc[bool_l, 'HC'] = df.iloc[bool_l, 1]
I assume the error comes from the range(i,600) but I don't know how to get around it
As mentioned by #Jimpsoni, the question is a little unclear in defining what you mean by LC, HC, HH and HL. I will use the below definitions to answer your question:
C_t is the close price on a given day, t
H_t is the high price on a given dat, t
Then if I understand correctly, you want to check the following two conditions:
close condition: is there a future close price in the next 600 days which is higher than the the current close price?
high condition: is there a future high price in the next 600 days which is lower than the high price from today?
And then for each day t, you want to find the first day in the future where both the conditions above are satisfied simultaneously.
With this understanding, I sketch a solution below.
Part 1: Setting up sample data (do this in your question next time)
import pandas as pd
import numpy as np
np.random.seed(2022)
# make example data
close = np.sin(range(610)) + 10
high = close + np.random.rand(*close.shape)
close[2] += 100 # to ensure the close condition cannot be met
dates = pd.date_range(end='2022-06-30', periods=len(close))
# insert into pd.dataframe
df = pd.DataFrame(index=dates, data=np.array([close, high]).T, columns=['close',
'high'])
The sample dataframe looks like below:
Part 2: Understand how to use rolling functions
Since at each point in time you want to subset your dataframe to a particular view in a rolling manner (always look 600 forward), it makes sense to use the built in .rolling method of pandas dataframes here. The .rolling() method is similar to a groupby in the sense that it allows you to iterate over different subsets of the dataframe, without explicitly writing the loop. rolling by default looks backwards, so we need to import a forward looking indexer to achieve the forward window. Note that you can also achieve the same forward window with some shifting, but it is less intuitive. The below chunk demonstrates how both methods give you the same result:
# create the rolling window object
# the simplest solution is to use the fixed size forward window as below
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=600)
method_1 = df.rolling(indexer, min_periods=600).sum()
# can also achieve the same thing by doing the rolling computation and then shifting backward
method_2 = df.rolling(600, min_periods=600).sum().shift(-599)
# confirm they are the same
(method_1.dropna() == method_2.dropna()).mean().mean() # returns 1.0
Part 3: Write the logic that will be applied to each rolling view
Once we have the rolling frame, we can simply apply a function that will be applied to each rolling view of the dataframe. Here is the function, below will be some comments:
def check_conditions(ser, df_full):
df_chunk = df_full.loc[ser.index, :]
today_close, today_high = df_chunk.iloc[0,:2]
future_close_series, future_high_series = df_chunk.iloc[1:, 0], df_chunk.iloc[1:, 1]
close_condition = today_close < future_close_series
high_condition = today_high > future_high_series
both_condition = close_condition & high_condition
if both_condition.sum() == 0:
return 1
first_date_satisfied = future_close_series.index[np.where(both_condition)][0]
future_vals = df_chunk.loc[first_date_satisfied, ['close','high']].values
df_full.loc[ser.index[0], ['Future_Close', 'Future_High', 'Future_Date']] = np.append(future_vals, first_date_satisfied)
return 1
Some comments: First, notice that the function takes two arguments, the first is a series, and the second is the full dataframe. Unfortunately, the .rolling() only acts on a single column / row at a time. This is in contrast to .groupby() which allows access to the full dataframe produced by each group. To get around this, I use a pattern proposed in this answer, where we use the index of the series (the series is the rolling view on a single column of the dataframe) to subset the full dataframe appropriately. Then check the conditions on appropriately subset dataframe with all columns available, and modify that dataframe when the conditions match. This may not be particularly memory efficient as pointed out in the comments to that answer linked above.
Final part: run the function
Here we set up the df_full dataframe, and then use rolling and apply the function to generate the output:
df_out = df.copy()
df_out[['Future_Close','Future_High','Future_Date']] = np.nan
_ = df.rolling(indexer, min_periods=600).apply(lambda x: check_conditions(x, df_out))
df_out
In df_out, the extra column will be NaN if no day in the next 600 days meet the criteria. If a day does meet the criteria, the close and high price on that day, as well as the date, are attached.
Output looks like:
Hope it helps.
Using regular for loop to iterate over dataframe is slow, instead you should use pandas built-in methods for dataframe because they are much faster.
Series has a built-in method diff() that iterates over that series and compares every value to previous one. So you could get 2 new serieses that each has information whether the previous close was less than todays
# You need numpy for this, if already imported, ignore this line
import numpy as np
# Make series for close and high
close = df["Close"].diff()
high = df["High"].diff()
I think you want to have lower "high" value and higher "close" value but with this logic it doesn't generate desired output.
# Then create new columns
df['LC'] = np.where((close > 0) & (high < 0), np.NaN, df["Close"])
df['HC'] = np.where((close > 0) & (high < 0), np.NaN, df["High"])
df['HH'] = np.where((close > 0) & (high < 0), np.NaN, df["Close"])
df['LH'] = np.where((close > 0) & (high < 0), np.NaN, df["High"])
if you provide more infomation about what LC, HC, HH, LH are supposed to be, or provide more examples, I can help you get correct results.

How to assess all values of a row in a pandas dataframe and write into a new column

I have a pandas dataframe of 62 rows x 10 columns. Each row contains numbers and if any of the numbers are within a certain range then return a string into the last column.
I have unsuccessfully tried the .apply method to use a function to make the assessment. I have also tried to import as a series but then the .apply method causes problems because it is a list.
df = pd.read_csv(results)
For example, in the image attached, if any value from Base 2019 to FY26 Load is between 0.95 and 1.05 then return 'Acceptable' into the last column otherwise return 'Not Acceptable'.
Any help, even a start would be much appreciated.
This should perform as expected:
results = "input.csv"
df = pd.read_csv(results)
low = 0.95
high = 1.05
# The columns to check
cols = df.columns[2:]
df['Acceptable?'] = (df[cols] > low).any(axis=1) & (df[cols] < high).all(axis=1)

Why isn't here order invariance between the two sets of operations?

I'm handling a CSV file/pandas dataframe, where the first column contains the date.
I want to do some conversion here to datetime, some filtering, sorting and reindexing.
What I experience is that if I change the order of the sets of operations, I get different results (the result of the first configuration is bigger, than the other one). Probably the first one is the "good" one.
Can anyone tell me, which sub-operations cause the difference between the results?
Which of those is the "bad" and which is the "good" solution?
Is it possible secure order independence where the user can call those two methods in any order and still got the good results? (Is it possible to get the good results by implementing interchangeable sets of operations?)
jdf1 = x.copy(deep=True)
jdf2 = x.copy(deep=True)
interval = [DATE_START, DATE_END]
dateColName = "Date"
Configuration 1:
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf1.drop_duplicates(subset=dateColName, inplace=True)
jdf1.sort_values(dateColName, inplace=True)
jdf1.reset_index(drop=True, inplace=True)
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf1[dateColName] = pd.to_datetime(jdf1[jdf1.columns[0]], format="%Y-%m-%d")
maskL = jdf1[dateColName] < interval[0]
maskR = jdf1[dateColName] > interval[1]
mask = maskL | maskR
jdf1.drop(jdf1[mask].index, inplace=True)
vs.
Configuration 2:
# Operatrion set 2: converting column type and filtering the rows in case of CSV's contents are covering a wider interval
jdf2[dateColName] = pd.to_datetime(jdf2[jdf2.columns[0]], format="%Y-%m-%d")
maskL = jdf2[dateColName] < interval[0]
maskR = jdf2[dateColName] > interval[1]
mask = maskL | maskR
jdf2.drop(jdf2[mask].index, inplace=True)
# Operation set 1: dropping duplicates, sorting and reindexing the table
jdf2.drop_duplicates(subset=dateColName, inplace=True)
jdf2.sort_values(dateColName, inplace=True)
jdf2.reset_index(drop=True, inplace=True)
Results:
val1 = set(jdf1["Date"].values)
val2 = set(jdf2["Date"].values)
# bigger:
val1 - val2
# empty:
val2 - val1
Thank you for your help!
In first look it is same, but NOT.
Because there are 2 different ways for filtering with affect each others:
drop_duplicates() -> remove M rows, together ALL rows - M
boolean indexing with mask -> remove N rows, together ALL - M - N
--
boolean indexing with mask -> remove K rows, together ALL rows - K
drop_duplicates() -> remove L rows, together ALL - K - L
K != M
L != N
And if swap this operations, result should be different, because both remove rows. And it is important order of calling them, because some rows remove only drop_duplicates, somerows only boolean indexing.
In my opinion both methods are right, it depends what need.

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()

Complex Filtering of DataFrame

I've just started working with Pandas and I am trying to figure if it is the right tool for my problem.
I have a dataset:
date, sourceid, destid, h1..h12
I am basically interested in the sum of each of the H1..H12 columns, but, I need to exclude multiple ranges from the dataset.
Examples would be to:
exclude H4, H5, H6 data where sourceid = 4944 and exclude H8, H9-H12
where destination = 481981 and ...
... this can go on for many many filters as we are
constantly removing data to get close to our final model.
I think I saw in a solution that I could build a list of the filters I would want and then create a function to test against, but I haven't found a good example to work from.
My initial thought was to create a copy of the df and just remove the data we didn't want and if we need it back - we could just copy it back in from the origin df, but that seems like the wrong road.
By using masks, you don't have to remove data from the dataframe. E.g.:
mask1 = df.sourceid == 4944
var1 = df[mask1]['H4','H5','H6'].sum()
Or directly do:
var1 = df[df.sourceid == 4944]['H4','H5','H6'].sum()
In case of multiple filters, you can combine the Boolean masks with Boolean operators:
totmask = mask1 & mask2
you can use DataFrame.ix[] to set the data to zeros.
Create a dummy DataFrame first:
N = 10000
df = pd.DataFrame(np.random.rand(N, 12), columns=["h%d" % i for i in range(1, 13)], index=["row%d" % i for i in range(1, N+1)])
df["sourceid"] = np.random.randint(0, 50, N)
df["destid"] = np.random.randint(0, 50, N)
Then for each of your filter you can call:
df.ix[df.sourceid == 10, "h4":"h6"] = 0
since you have 600k rows, create a mask array by df.sourceid == 10 maybe slow. You can create Series objects that map value to the index of the DataFrame:
sourceid = pd.Series(df.index.values, index=df["sourceid"].values).sort_index()
destid = pd.Series(df.index.values, index=df["destid"].values).sort_index()
and then exclude h4,h5,h6 where sourceid == 10 by:
df.ix[sourceid[10], "h4":"h6"] = 0
to find row ids where sourceid == 10 and destid == 20:
np.intersect1d(sourceid[10].values, destid[20].values, assume_unique=True)
to find row ids where 10 <= sourceid <= 12 and 3 <= destid <= 5:
np.intersect1d(sourceid.ix[10:12].values, destid.ix[3:5].values, assume_unique=True)
sourceid and destid are Series with duplicated index values, when the index values are in order, Pandas use searchsorted to find index. it's O(log N), faster then create mask arrays which is O(N).

Categories