Calculating Quantiles based on a column value? - python

I am trying to figure out a way in which I can calculate quantiles in pandas or python based on a column value? Also can I calculate multiple different quantiles in one output?
For example I want to calculate the 0.25, 0.50 and 0.9 quantiles for
Column Minutes in df where it is <= 5 and where it is > 5 and <=10
df[df['Minutes'] <=5]
df[(df['Minutes'] >5) & (df['Minutes']<=10)]
where column Minutes is just a column containing value of numerical minutes
Thanks!

DataFrame.quantile accepts values in array,
Try
df['minute'].quantile([0.25, 0.50 , 0.9])
Or filter the data first,
df.loc[df['minute'] <= 5, 'minute'].quantile([0.25, 0.50 , 0.9])

Related

Weighted average on pandas

I want to add a new column which is a weighted average of the 2. and 3. columns.
column * 0.6 + 3.column*0.4, the issue is that the columns change weekly so I can't just use the column names, I want it to be the 2. and 3. columns every week.
I heard that df.iloc is the way to go, but I am not sure how to apply it on this problem.
Python counts from 0, so for select second cand third column use 1,2 in DataFrame.iloc:
df.iloc[:, 1] * 0.6 + df.iloc[:, 2] * 0.4

Delete all rows bellow a certain condition in pandas

I have a dataframe with multiple columns. One of the columns (denoted as B in the example) works as a trigger, i.e.,
I have to drop all rows after the first value bigger than 0.5. However, I have to conserve this first number.
An example is given above. All rows after 0.59 (which is the first that obeys to the condition of being bigger than 0.5) are deleted.
initial_df = pd.DataFrame([[1,0.4], [5,0.43], [4,0.59], [11,0.41], [9,0.61]], columns = ['A', 'B'])
Bellow the blue box indicates the trigger and the red box the values that have to dropped.
In the end we will have:
The final goal is to obtain the following dataframe:
Is it possible to do it in pandas in a efficient way (not using a for loop)?
You can use np.where with Boolean indexing to extract the positional index of the first value matching a condition. Then feed this to iloc:
idx = np.where(df['B'].gt(0.5))[0][0]
res = df.iloc[:idx+1]
print(res)
A B
0 1 0.40
1 5 0.43
2 4 0.59
For very large dataframes where the condition is likely to met early on, more optimal would be to use next with a generator expression to calculate idx:
idx = next((idx for idx, val in enumerate(df['B']) if val > 0.5), len(df.index))
For better performance, see Efficiently return the index of the first value satisfying condition in array.
So this works if your index is the same as your iloc:
first_occurence = initial_df[initial_df.B>0.5].index[0]
initial_df.iloc[:first_occurence+1]
EDIT: this is a more general solution
first_occurence = initial_df.index.get_loc(initial_df[initial_df.B>0.5].iloc[0].name)
final_df = initial_df.iloc[:first_occurence+1]
I found a solution similar to the one shown by jpp:
indices = initial_df.index
trigger = initial_df[initial_df.B > 0.5].index[0]
initial_df[initial_df.index.isin(indices[indices<=trigger])]
Since the real dataframe has multiple indices, this is the only solution that I found.
I am assuming you want to delete all rows where "B" column value is less than 0.5.
Try this:
initial_df = pd.DataFrame([[1, 0.4], [5, 0.43], [4, 0.59], [11, 0.41], [9, 0.61]], columns=['A', 'B'])
final_df = initial_df[initial_df['B'] >= 0.5]
The resulting data frame, final_df is:
A B
2 4 0.59
4 9 0.61

Pandas data frame excluding rows in a particular range

I have a DataFrame df as below. I am just wondering to exclude rows in a particular column, say Vader_Sentiment, which has values in range -0.1 to 0.1 and keep the remaining.
I have tried df = [df['Vader_Sentiment'] < -0.1 & df['Vader_Sentiment] > 0.1] but it doesn't seem to work.
Text Vader_Sentiment
A -0.010
B 0.206
C 0.003
D -0.089
E 0.025
You can use Series.between():
df.loc[~df.Vader_Sentiment.between(-0.1, 0.1)]
Text Vader_Sentiment
1 B 0.206
Three things:
The tilde (~) operator denotes an inverse/complement.
Make sure you have numeric data. df.dtypes should show float for Vader_Sentiment, not "object"
You can specify an inclusive parameter to note if you want intervals to be closed or open

Calculating standard deviation on list ignoring zeros using numpy

I am having a list pct_change. I need to calculate std deviation on the list ignoring the zeros. I tried below code, but it is not working as expected.
import numpy as np
m = np.ma.masked_equal(pct_change, 0)
value = m.mask.std()
Input value: pct_change
0 0.00
1 0.00
2 0.00
3 18523.94
4 15501.94
5 14437.03
6 13402.43
7 18986.14
Code has to ignore 3 zero values and then calculate standard deviation.
Filter for values unequal to zero first:
>>> a
array([ 0. , 0. , 0. , 18523.94, 15501.94, 14437.03,
13402.43, 18986.14])
>>> a[a!=0].std()
2217.2329816471693
One approach would be to convert the zeros to NaNs and then use np.nanstd that would ignore the NaNs for the standard deviation computation -
np.nanstd(np.where(np.isclose(a,0), np.nan, a))
Sample run -
In [296]: a
Out[296]: [0.0, 0.0, 0.0, 18523.94, 15501.94, 14437.03, 13402.43, 18986.14]
In [297]: np.nanstd(np.where(np.isclose(a,0), np.nan, a))
Out[297]: 2217.2329816471693
Note that we are using np.isclose(a,0) because we are dealing with floating-pt numbers here and it's not a good idea to simply compare against zeros to detect those in a float dtype array.

Pandas Check for mulitple minimum consequtive criterias

I have a scenario where I want to check for a minimum criterion (0.6) being met over consecutive pandas dataframe rows in a column (Col1), which also meets a criterion when the starting value is at least (0.7) i.e.:
Col1
0.3
0.5
0.55
0.8 = true
0.65 = true
0.9 = true
0.61 = true
0.3
0.6
0.67
0.74 = true
0.63 = true
0.61 = true
In other words, the check would be True if the value is at least 0.7, or if the value is at least 0.6 and the previous values are at least 0.6 with the first value in the consecutive series being at least 0.7.
It will be running off a very large data set so needs to be efficient. I am thinking something with shift() would work...but can't get it quite right.
You can use Series.where() to construct the logical Series.
Steps:
initialize the Series with nan values;
assign True for all values larger than 0.7
assign False for all values smaller than 0.6
forward fill values between 0.6~0.7 as it depends on previous values
fill possible missing values at the beginning of the Series
convert the dtype to boolean (optional)
so:
import pandas as pd
import numpy as np
df['check'] = np.nan
df['check'] = (df['check'].where(df.Col1 < 0.7, True)
.where(df.Col1 > 0.6, False)
.ffill().fillna(False)
.astype(bool))

Categories