I have a question regarding combining functions.
My purpose is to apply two functions at the same time. Basically, I want to cut my dataset for extreme values by looking for the 5% quantile at the lowest part of the dataset and the top % at the other end.
df = df[df.temperature >= df.temperature.quantile(.05)]
gets me values that are above the 5% quantile
df = df[df.temperature <= df.temperature.quantile(.95)]
gets me all the values that are below the 95% quantile.
My current problem is that
df = df[df.temperature >= df.temperature.quantile(.05)]
df = df[df.temperature <= df.temperature.quantile(.95)]
works but it's not precise because the 2nd function builds on top of the previous cut. So how can I cut both at once?
df = df[df.temperature >= df.temperature.quantile(.05) & <= df.temperature.quantile(.95)]
does not work.
Thanks for support!
Solved:
df = df[(df.temperature >= df.temperature.quantile(.05)) & (df.temperature <= (df.temperature.quantile(.95)))]
You need parentheses around the conditions due to operator precedence:
f = df[(df.temperature >= df.temperature.quantile(.05)) & (df.temperature <= df.temperature.quantile(.95))]
The docs show that >= has lower precedence than & so you need the parentheses, besides your code should have raised an ambiguous error.
code style wise it is more readable to have your conditions as variables so I would rewrite it to this:
low_limit = df.temperature >= df.temperature.quantile(.05)
upper_limit = df.temperature >= df.temperature.quantile(.95)
then your filtering becomes:
df[(low_limit) & (upper_limit)]
You could optionally change
low_limit = df.temperature >= df.temperature.quantile(.05)
to
low_limit = (df.temperature >= df.temperature.quantile(.05))
so you don't need the parentheses in the filtering
Related
I am removing outliers from a dataset.
I decided to remove outlier from each column one-by-one. I have columns with a different number of missing values.
I used this code but it removed the whole row containg the outlier and due to many NaN values in my data, number of rows of my data reduced drastically.
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Then I decided to remove outlier from each column, and fill ouliers with NaN in each column
I wrote this code
def remove_outlier(df_in, col_name, thres=1.5):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-thres*iqr
fence_high = q3+thres*iqr
mask = (df_in[col_name] > fence_high) & (df_in[col_name] < fence_low)
df_in.loc[mask, col_name] = np.nan
return df_in
But this code doesn't filters the outliers. gave the same result.
What is wrong in this code? How can I correct it?
Is there any other elegant method to filter outlier?
Check the condition once. How can that be &. It should be |
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
In this snipplet, you select rows based on df_in[col_name] > fence_low and df_in[col_name] < fence_high, hence each time one of these condition is not respected, the row will be removed;
As a general rule, if you have a column with 30% outliers, 30% of you dataset will disappear, and you have two options
1. Fill the missing value ffill, mean constant value ...
2. Or drop these feature, if it is not mandatory, because in some times you would better drop a feature than reduce your dataset too much
Hope it helps
I have a dataframe in which I'm trying to create a binary 1/0 column when certain conditions are met. The code I'm using is as follows:
sd_threshold = 5
df1["signal"] = np.where(np.logical_and(df1["high"] >= df1["break"], df1["low"]
<= df1["break"], df1["sd_round"] > sd_threshold), 1, 0)
The code returns TypeError: return arrays must be of ArrayType when the last condition df1["sd_round"] > sd_threshold is included, otherwise it works fine. There isn't any issue with the data in the df1["sd_round"] column.
Any insight would be much appreciated, thank you!
check the documentation -- np.logical_and() compares the first two arguments you give it and writes the output to the third. you could use a nested call but i would just go with & (pandas boolean indexing):
df1["signal"] = np.where((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold),
1, 0)
EDIT: you could actually just skip numpy and cast your boolean Series to int to yield 1s and 0s:
mask = ((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold))
df1["signal"] = mask.astype(int)
Using pandas and numpy. How may I achieve the following:
df['thecol'] = np.where(
(df["a"] >= df["a"].shift(1)) &
(df["a"] >= df["a"].shift(2)) &
(df["a"] >= df["a"].shift(3)) &
(df["a"] >= df["a"].shift(4)) &
(df["a"] >= df["a"].shift(5)) &
(df["a"] >= df["a"].shift(6)) &
(df["a"] >= df["a"].shift(7)) &
(df["a"] >= df["a"].shift(8)) &
(df["a"] >= df["a"].shift(9)) &
(df["a"] >= df["a"].shift(10))
,'istrue','isnottrue')
Without such ugly repetition of code, if it is only the number that is changing? I would like to have the same code with any number that I provide without typing it all out manually?
It is meant to compare the current value in column "a" to a value in same column one row above, and two rows above, etc, and result in "istrue" if all of these conditions are true
I tried shifting the dataframe in a for loop then appending the value to a list and calculating the maximum of it to only have (df["a"] >= maxvalue) once but it wouldn't work for me either. I am a novice at Python and will likely ask more silly questions in the near future
This works but I would like it to also work without this much repetetive code so I can learn to code properly. I tried examples with yield generator but could not manage to get it working either
#Edit:
Answered by Wen. I needed rolling.
In the end I came up with this terrible terrible approach:
def whereconditions(n):
s1 = 'df["thecol"] = np.where('
L = []
while n > 0:
s2 = '(df["a"] >= df["a"].shift('+str(n)+')) &'
L.append(s2)
n = n -1
s3 = ",'istrue','isnottrue')"
r = s1+str([x for x in L]).replace("'","").replace(",","").replace("&]","")+s3
return str(r.replace("([(","(("))
call = whereconditions(10)
exec(call)
Sounds Like you need rolling
np.where(df['a']==df['a'].rolling(10).max(),'istrue','isnottrue')
I have a pandas dataframe with a column that indicates which hour of the day a particular action was performed. So df['hour'] is many rows each with a value from 0 to 23.
I am trying to create dummy variables for things like 'is_morning', for example:
if df['hour'] >= 5 and < 12 then return 1, else return 0
A for loop doesn't work given the size of the data set, and I've tried some other stuff like
df['is_morning'] = df['hour'] >= 5 and < 12
Any suggestions??
You can just do:
df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12)
i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.
so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?
You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]
You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")