Weighted average on pandas - python

I want to add a new column which is a weighted average of the 2. and 3. columns.
column * 0.6 + 3.column*0.4, the issue is that the columns change weekly so I can't just use the column names, I want it to be the 2. and 3. columns every week.
I heard that df.iloc is the way to go, but I am not sure how to apply it on this problem.

Python counts from 0, so for select second cand third column use 1,2 in DataFrame.iloc:
df.iloc[:, 1] * 0.6 + df.iloc[:, 2] * 0.4

Related

Efficient method to find nearest datetime's for large dataframes

I have a pandas dataframe with two columns, both are datetime instances. The first column is made of measurement timings and the second column is the sum of the first column with a constant offset. E.g assuming a constant offset of 1 gives:
index
Measurement_time
offset_time
0
0.1
1.2
1
0.5
1.5
2
1.2
2.2
3
2.4
3.4
I would like to find the index of each measurement_time that closest matches the offset_time with the condition that the measurement_time must be smaller than or equal to the offset_time. The solution to the given example would therefore be:
index = [2, 2, 2, 3]
I have tried using get_loc and making a mask but because my dataframe is large, these solutions are too inefficient.
Any help would be greatly appreciated!
Let us use np.searchsorted to find the indices of closest matches
s = df['Measurement_time'].sort_values()
np.searchsorted(s, df['offset_time'], side='right') - 1
Result:
array([2, 2, 2, 3], dtype=int64)
Note: You may skip the .sort_values part if your dataframe is already sorted on the column Measurement_time

Calculating Quantiles based on a column value?

I am trying to figure out a way in which I can calculate quantiles in pandas or python based on a column value? Also can I calculate multiple different quantiles in one output?
For example I want to calculate the 0.25, 0.50 and 0.9 quantiles for
Column Minutes in df where it is <= 5 and where it is > 5 and <=10
df[df['Minutes'] <=5]
df[(df['Minutes'] >5) & (df['Minutes']<=10)]
where column Minutes is just a column containing value of numerical minutes
Thanks!
DataFrame.quantile accepts values in array,
Try
df['minute'].quantile([0.25, 0.50 , 0.9])
Or filter the data first,
df.loc[df['minute'] <= 5, 'minute'].quantile([0.25, 0.50 , 0.9])

Delete all rows bellow a certain condition in pandas

I have a dataframe with multiple columns. One of the columns (denoted as B in the example) works as a trigger, i.e.,
I have to drop all rows after the first value bigger than 0.5. However, I have to conserve this first number.
An example is given above. All rows after 0.59 (which is the first that obeys to the condition of being bigger than 0.5) are deleted.
initial_df = pd.DataFrame([[1,0.4], [5,0.43], [4,0.59], [11,0.41], [9,0.61]], columns = ['A', 'B'])
Bellow the blue box indicates the trigger and the red box the values that have to dropped.
In the end we will have:
The final goal is to obtain the following dataframe:
Is it possible to do it in pandas in a efficient way (not using a for loop)?
You can use np.where with Boolean indexing to extract the positional index of the first value matching a condition. Then feed this to iloc:
idx = np.where(df['B'].gt(0.5))[0][0]
res = df.iloc[:idx+1]
print(res)
A B
0 1 0.40
1 5 0.43
2 4 0.59
For very large dataframes where the condition is likely to met early on, more optimal would be to use next with a generator expression to calculate idx:
idx = next((idx for idx, val in enumerate(df['B']) if val > 0.5), len(df.index))
For better performance, see Efficiently return the index of the first value satisfying condition in array.
So this works if your index is the same as your iloc:
first_occurence = initial_df[initial_df.B>0.5].index[0]
initial_df.iloc[:first_occurence+1]
EDIT: this is a more general solution
first_occurence = initial_df.index.get_loc(initial_df[initial_df.B>0.5].iloc[0].name)
final_df = initial_df.iloc[:first_occurence+1]
I found a solution similar to the one shown by jpp:
indices = initial_df.index
trigger = initial_df[initial_df.B > 0.5].index[0]
initial_df[initial_df.index.isin(indices[indices<=trigger])]
Since the real dataframe has multiple indices, this is the only solution that I found.
I am assuming you want to delete all rows where "B" column value is less than 0.5.
Try this:
initial_df = pd.DataFrame([[1, 0.4], [5, 0.43], [4, 0.59], [11, 0.41], [9, 0.61]], columns=['A', 'B'])
final_df = initial_df[initial_df['B'] >= 0.5]
The resulting data frame, final_df is:
A B
2 4 0.59
4 9 0.61

Conditional rolling computation in pandas

I would like to compute a quantity called "downside beta".
Let's suppose I have a dataframe df:
df = pd.DataFrame({'A': [-0.1,0.3,-0.4, 0.8,-0.5],'B': [-0.2,0.5,0.3,-0.5,0.1]},index=[0, 1, 2, 3,4])
I would like to add a column, 'C' that computes this downside beta defined as the covariance between the columns A and B considering only the negative values of column A with the corresponding values of B. This covariance should be then divided by the variance of column A considering only the negative values.
In the above example, it should be equivalent of computing the covariance between the two series: [-0.1,-0.4,-0.5] and [-0.2,0.3,0.1]. Divided by the variance of the series [-0.1,-0.4,-0.5].
Next step would be to roll this metric over the index of an initial large dataframe df.
Is there an efficient way to do that? In a vectorized manner. I guess combining pd.rolling_cov and np.where?
Thank you!
Is this what you're looking for? You can filter out positive values and then call pandas cov and var functions accordingly:
v = df[df.A.lt(0)]
v.cov() / v.A.var()
A B
A 1.000000 -0.961538
B -0.961538 1.461538
If you just want the value at the diagonal,
np.diag(v.cov() / v.A.var(), k=-1)
array([-0.96153846])
For a rolling window, you may need to jump through a few hoops, but this should be doable;
v = df[df.A.lt(0)]
i = v.rolling(3).cov().A.groupby(level=0).last()
j = v.rolling(3).A.var()
i / j
0 NaN
2 NaN
4 -0.961538
Name: A, dtype: float64

Applying Operation to Pandas column if other column meets criteria

I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.
There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2

Categories