Pandas dataframe numpy where multiple conditions - python

Using pandas and numpy. How may I achieve the following:
df['thecol'] = np.where(
(df["a"] >= df["a"].shift(1)) &
(df["a"] >= df["a"].shift(2)) &
(df["a"] >= df["a"].shift(3)) &
(df["a"] >= df["a"].shift(4)) &
(df["a"] >= df["a"].shift(5)) &
(df["a"] >= df["a"].shift(6)) &
(df["a"] >= df["a"].shift(7)) &
(df["a"] >= df["a"].shift(8)) &
(df["a"] >= df["a"].shift(9)) &
(df["a"] >= df["a"].shift(10))
,'istrue','isnottrue')
Without such ugly repetition of code, if it is only the number that is changing? I would like to have the same code with any number that I provide without typing it all out manually?
It is meant to compare the current value in column "a" to a value in same column one row above, and two rows above, etc, and result in "istrue" if all of these conditions are true
I tried shifting the dataframe in a for loop then appending the value to a list and calculating the maximum of it to only have (df["a"] >= maxvalue) once but it wouldn't work for me either. I am a novice at Python and will likely ask more silly questions in the near future
This works but I would like it to also work without this much repetetive code so I can learn to code properly. I tried examples with yield generator but could not manage to get it working either
#Edit:
Answered by Wen. I needed rolling.
In the end I came up with this terrible terrible approach:
def whereconditions(n):
s1 = 'df["thecol"] = np.where('
L = []
while n > 0:
s2 = '(df["a"] >= df["a"].shift('+str(n)+')) &'
L.append(s2)
n = n -1
s3 = ",'istrue','isnottrue')"
r = s1+str([x for x in L]).replace("'","").replace(",","").replace("&]","")+s3
return str(r.replace("([(","(("))
call = whereconditions(10)
exec(call)

Sounds Like you need rolling
np.where(df['a']==df['a'].rolling(10).max(),'istrue','isnottrue')

Related

Is it possible to filter and Dataframe by two conditions?

I am wondering if there is a way to filter a column in a dataframe with two conditions
matches_df = matches_df[matches_df['similairity'] < 0.9999999 AND matches_df['similairity'] > 0.9 ]
Yes, we can do that. Your code itself will work with a little modification.
Replace AND to &
Wrap the two conditions with braces.
matches_df = matches_df[(matches_df['similairity'] < 0.9999999) & (matches_df['similairity'] > 0.9)]
Hope it helps :)

Issue with creating a column using np.where, ArrayType error

I have a dataframe in which I'm trying to create a binary 1/0 column when certain conditions are met. The code I'm using is as follows:
sd_threshold = 5
df1["signal"] = np.where(np.logical_and(df1["high"] >= df1["break"], df1["low"]
<= df1["break"], df1["sd_round"] > sd_threshold), 1, 0)
The code returns TypeError: return arrays must be of ArrayType when the last condition df1["sd_round"] > sd_threshold is included, otherwise it works fine. There isn't any issue with the data in the df1["sd_round"] column.
Any insight would be much appreciated, thank you!
check the documentation -- np.logical_and() compares the first two arguments you give it and writes the output to the third. you could use a nested call but i would just go with & (pandas boolean indexing):
df1["signal"] = np.where((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold),
1, 0)
EDIT: you could actually just skip numpy and cast your boolean Series to int to yield 1s and 0s:
mask = ((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold))
df1["signal"] = mask.astype(int)

Combining functions (AND)

I have a question regarding combining functions.
My purpose is to apply two functions at the same time. Basically, I want to cut my dataset for extreme values by looking for the 5% quantile at the lowest part of the dataset and the top % at the other end.
df = df[df.temperature >= df.temperature.quantile(.05)]
gets me values that are above the 5% quantile
df = df[df.temperature <= df.temperature.quantile(.95)]
gets me all the values that are below the 95% quantile.
My current problem is that
df = df[df.temperature >= df.temperature.quantile(.05)]
df = df[df.temperature <= df.temperature.quantile(.95)]
works but it's not precise because the 2nd function builds on top of the previous cut. So how can I cut both at once?
df = df[df.temperature >= df.temperature.quantile(.05) & <= df.temperature.quantile(.95)]
does not work.
Thanks for support!
Solved:
df = df[(df.temperature >= df.temperature.quantile(.05)) & (df.temperature <= (df.temperature.quantile(.95)))]
You need parentheses around the conditions due to operator precedence:
f = df[(df.temperature >= df.temperature.quantile(.05)) & (df.temperature <= df.temperature.quantile(.95))]
The docs show that >= has lower precedence than & so you need the parentheses, besides your code should have raised an ambiguous error.
code style wise it is more readable to have your conditions as variables so I would rewrite it to this:
low_limit = df.temperature >= df.temperature.quantile(.05)
upper_limit = df.temperature >= df.temperature.quantile(.95)
then your filtering becomes:
df[(low_limit) & (upper_limit)]
You could optionally change
low_limit = df.temperature >= df.temperature.quantile(.05)
to
low_limit = (df.temperature >= df.temperature.quantile(.05))
so you don't need the parentheses in the filtering

Pandas Python - Create dummy variables for multiple conditions

I have a pandas dataframe with a column that indicates which hour of the day a particular action was performed. So df['hour'] is many rows each with a value from 0 to 23.
I am trying to create dummy variables for things like 'is_morning', for example:
if df['hour'] >= 5 and < 12 then return 1, else return 0
A for loop doesn't work given the size of the data set, and I've tried some other stuff like
df['is_morning'] = df['hour'] >= 5 and < 12
Any suggestions??
You can just do:
df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12)
i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.

Grabbing selection between specific dates in a DataFrame

so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?
You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]
You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")

Categories