Average Pandas Dataframe with condition other Dataframe - python

I have two dataframes. One only contains binary values, the other floats between 0 and 1.
Eg.
df1:
col 1 col 2 col 3 col 4 col 5 col 6 col 7
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 1.0 0.0 0.0
df2:
col 1 col 2 col 3 col 4 col 5 col 6 col 7
0 0.068467 0.099870 0.090778 0.087500 0.612955 0.081495 0.570557
1 0.091651 0.084946 0.082704 0.103070 0.517317 0.092595 0.603526
2 0.070380 0.104353 0.103062 0.086780 0.598848 0.101543 0.570064
3 0.052239 0.123760 0.215329 0.087608 0.581883 0.080650 0.574241
4 0.087564 0.104460 0.125887 0.079945 0.646284 0.081015 0.609308
What I need is to compute the average of df1 where df2 >= 0.5 (or any other number)
All I could find on this topic is for columns only and I could not get it to work on the entire dataframe.
Any help is appreciated.

First is necessary same index and same columns names in both DataFrames.
Then use DataFrame.where for set missing values to False values by mask and then get mean:
df = df1.where(df2 >= 0.5).mean()
If need mean of all values use numpy.nanmean for exclude missing values:
mean = np.nanmean(df1.where(df2 >= 0.5))
Another idea is convert all values to Series with DataFrame.stack and then get mean:
mean = df1.where(df2 >= 0.5).stack().mean()

What about creating a dataframe with the values above 0.5 included and NaN everywhere else:
df = df1.where(df2 >= 0.5)
We then calculate the sum of the values and count the number of values to get the mean:
sum_values = df.sum().sum()
count_values = df.count().sum()
mean_value = sum_values / count_values

Related

Accessing multiple columns in 3d pandas dataframe [duplicate]

This question already has answers here:
Subset multi-indexed DataFrame based on multiple level 1 columns
(4 answers)
Closed 1 year ago.
After constructing a 3-d pandas dataframe I have difficulty accessing only specific columns.
The scenario is as follows:
head = ["h1", "h2"]
cols = ["col_1", "col_2", "col_3"]
heads = len(cols) * [head[0]] + len(cols) * [head[1]] # -> ['h1','h1','h1','h2','h2','h2']
no_of_rows = 4
A = np.array(heads)
B = np.array(cols * len(head)) # -> ['col_1','col_2','col_3','col_1','col_2','col_3']
C = np.array([np.zeros(no_of_rows)] * len(head) * len(cols)) # -> shape=(6, 4)
df = pd.DataFrame(data=C.T,
columns=pd.MultiIndex.from_tuples(zip(A,B)))
yielding
h1 h2
col_1 col_2 col_3 col_1 col_2 col_3
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
Now I would like to get e.g. all col_1, meaning col_1 of h1 and col_1 of h2. The output should look like this
h1 h2
col_1 col_1
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
Any suggestions how I could access those two columns?
You can use df.loc with slice(None), as follows:
df.loc[:, (slice(None), 'col_1')]
or use pd.IndexSlice, as follows:
idx = pd.IndexSlice
df.loc[:, idx[:, 'col_1']]
or simply:
df.loc[:, pd.IndexSlice[:, 'col_1']]
(Defining extra variable idx for pd.IndexSlice is useful as a shorthand if you are going to use pd.IndexSlice multiple times. )
Result:
h1 h2
col_1 col_1
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
You can also do it with .xs() as follows:
df.xs('col_1', level=1, axis=1)
Result:
h1 h2
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
Slightly different output without the repeating col_1 column labels shown.
The first 2 ways support selecting multiple columns too, e.g. ['col_1', 'col_3']:
df.loc[:, (slice(None), ['col_1', 'col_3'])]
and also:
df.loc[:, pd.IndexSlice[:, ['col_1', 'col_3']]]
Result:
h1 h2
col_1 col_3 col_1 col_3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
You can use loc with get_level_values(1), as your columns col1, col2, col3 are in the first level of your index:
>>> df.loc[:,df.columns.get_level_values(1).isin(['col_1'])]
h1 h2
col_1 col_1
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
If you want to grab all columns under h1, you can set get_level_values(0), and grab h1:
>>> df.loc[:,df.columns.get_level_values(0).isin(['h1'])]
h1
col_1 col_2 col_3
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0

Operations on multiple slices of columns in Pandas DF basing on row values

I am working with some very huge arrays of data that I have organized in a Pandas DataFrame. An example of what I have is more or less like this
>>> pd.DataFrame({'vp':aux_vp,'vs':aux_vs,'den':aux_den,'layer':facies_vol})
vp layer
0 5163.788741 0.0
1 5062.234019 0.0
2 4869.894684 0.0
3 9126.546268 1.0
4 5566.053159 1.0
... ...
1254523 6177.467626 0.0
1254524 4756.891403 0.0
1254525 6244.816685 2.0
Now, what I need is to calculate the average of the values in columns "vp", in slices defined by the values of "layer", so that the expected output would be
vp layer averages
0 5163 0.0 5031.3
1 5062 0.0 5031.3
2 4869 0.0 5031.3
3 9126 1.0 7346
4 5566 1.0 7346
... ... ...
1254523 6177 0.0 5466.5
1254524 4756 0.0 5466.5
1254525 6244 2.0 6244
The repetition of the average value in each slice is a bonus. What I really cannot do is to do this operation without parsing through all the rows. I have already tried to do this using numpy, identifing the indexes where the the "layer" array changes and then calculating it with a for loop:
vp= np.array(...) #same as vp in pandas column
layer= np.array(...) #same as layer in pandas column
averages= np.zeros((len(vp))
indexes= np.add(np.where(layer[:-1] != layer[1:])[0],1) #here I compare the adjacent values of layer and store the index where they are different
for i in range(1,len(indexes)):
mean= np.mean(vp[indexes[i-1]:indexes[i]])
averages[indexes[i-1]:indexes[i]]=mean
but it takes an eternity, considering the volume of data I have.
Thanks a lot!
IIUC you can use groupby and transform('mean') after you create a key column to group on using shift and cumsum
df['averages'] = df.assign(key=(df['layer'] != df['layer'].shift()).cumsum()).groupby('key')['vp'].transform('mean')
vp layer averages
0 5163.788741 0.0 5031.972481
1 5062.234019 0.0 5031.972481
2 4869.894684 0.0 5031.972481
3 9126.546268 1.0 7346.299714
4 5566.053159 1.0 7346.299714
1254523 6177.467626 0.0 5467.179514
1254524 4756.891403 0.0 5467.179514
1254525 6244.816685 2.0 6244.816685
You can group to get the means then merge the tables.
mean = df.groupby(by='layer').agg({'value': 'mean'}).rename(columns={'value': 'mean'}).reset_index()
df.merge(mean, on='layer')
value layer mean
0 4998.663295 0.0 5000.727034
1 4999.460336 0.0 5000.727034
2 5002.241608 0.0 5000.727034
3 5000.057680 0.0 5000.727034
4 5000.036647 0.0 5000.727034
5 4999.525570 0.0 5000.727034
6 5002.602431 0.0 5000.727034
7 5000.774510 0.0 5000.727034
8 5001.872130 0.0 5000.727034
9 5002.036133 0.0 5000.727034
10 9999.825662 1.0 9999.648100
11 9999.707490 1.0 9999.648100
12 9999.601844 1.0 9999.648100
13 9999.137278 1.0 9999.648100
14 9999.544681 1.0 9999.648100
15 9999.971940 1.0 9999.648100
16 10001.009895 1.0 9999.648100
17 9999.212977 1.0 9999.648100
18 10001.271304 1.0 9999.648100
19 9997.197929 1.0 9999.648100

Python - Create a column equal to the value of another column but if two consecutive values occur in the first column then make new column equal to 0

I have a column ('bet_made1') of ones and zeros where 1 represents taking a long position in a stock.
I want to create a new column that will equal the value of 'bet_made1'. However, if two consecutive 1s occur in 'bet_made1', I would like the new column to equal 0.
I would like to repeat so that a long position is followed by 4 days without another long being taken.
In other words, if I have a long position , I do not want to take another one till the 5th day after the initial buy order so that I only place a trade at a minimum of every 5 days.
Hope that makes sense. I've included a table below showing what I'm aiming for.
Cheers!
bet_made1 long_pos
date
02/01/2019 0.0 0.0
03/01/2019 0.0 0.0
04/01/2019 0.0 0.0
07/01/2019 0.0 0.0
08/01/2019 0.0 0.0
09/01/2019 0.0 0.0
10/01/2019 0.0 0.0
11/01/2019 0.0 0.0
14/01/2019 0.0 0.0
15/01/2019 0.0 0.0
16/01/2019 1.0 1.0
17/01/2019 1.0 0.0
18/01/2019 0.0 0.0
22/01/2019 1.0 0.0
23/01/2019 1.0 0.0
24/01/2019 1.0 1.0
25/01/2019 1.0 0.0
28/01/2019 1.0 0.0
29/01/2019 0.0 0.0
30/01/2019 0.0 0.0
There might be a more compact and tricky way, but creating a list with the new values and than add it as a new column may be a good idea:
import pandas as pd
df = pd.DataFrame(data=dict(bet_made1=[0,1,0,1,1,1,0,0,1,1,1,1,1,1]))
long_pos = []
remains = 0
for b in df.bet_made1:
if remains > 0:
remains -= 1
long_pos.append(0)
elif b == 1:
remains = 4
long_pos.append(1)
else:
long_pos.append(0)
df['long_pos'] = long_pos
This gives me the result:
bet_made1 long_pos
0 0
1 1
0 0
1 0
1 0
1 0
0 0
0 0
1 1
1 0
1 0
1 0
1 0
1 1

Is there a way to compare two columns of a dataframe containing float values and create a new column to add labels based on it?

I have a dataframe which looks like below:
df
column_A column_B
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 0.0
5 1.0 0.0
I want to create a if condition like:
if(df['column_A'] & df['column_b'] = 0.0:
df['label]='OK'
else:
df['label']='NO'
I tried this:
if((0.0 in df['column_A'] ) & (0.0 in df['column_B']))
for index, row in df.iterrows():
(df[((df['column_A'] == 0.0) & (df['column_B']== 0.0))])
Nothing really gave the expected outcome
I expect my output to be:
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
You can use np.where in order to create an array with either OK or NO depending on the result of the condition:
import numpy as np
df['label'] = np.where(df.column_A.add(df.column_B).eq(0), 'OK', 'NO')
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO
Use numpy.where with DataFrame.any:
#solution if only 1.0, 0.0 values
df['label'] = np.where(df[['column_A', 'column_B']].any(axis=1), 'NO','OK')
#general solution with compare 0
#df['label'] = np.where(df[['column_A', 'column_B']].eq(0).all(axis=1),'OK','NO')
print (df)
column_A column_B label
0 0.0 0.0 OK
1 0.0 0.0 OK
2 0.0 1.0 NO
3 0.0 0.0 OK
4 0.0 0.0 OK
5 1.0 0.0 NO

How to merge two data frames together?

I have two data frames:
Pre_data_inputs with the size of (4760,2)
Diff_course_Precourse with size of (4760,1).
I want to merge these two data frames together with name data_inputs. This new data frame should be (4760,3). I have this code so far:
data_inputs = pd.concat([pre_data_inputs, Diff_Course_PreCourse], axis=1)
But the size of data_inputs now is (4950,3).
I don't know what is the problem. I would be appreciated if anybody can help me. Thanks.
Well if your index matches in both cases you can go with:
pre_data_inputs.merge(Diff_Course_PreCourse, left_index=True, right_index=True)
Otherwise you might want to reset_index() on both dataframes.
As #Parfait commented, the index of your data frames has to match for concat to work as you describe it.
For example:
d1 = pd.DataFrame(np.zeros(shape = (3,1)))
0
0 0.0
1 0.0
2 0.0
d2 = pd.DataFrame(np.ones(shape = (3,2)), index = range(2,5))
0 1
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
Since the index doesn't match the result data frame will have a number of rows equal to the unique index set (0,1,2,3,4)
pd.concat([d1, d2], axis = 1)
0 0 1
0 0.0 NaN NaN
1 0.0 NaN NaN
2 0.0 1.0 1.0
3 NaN 1.0 1.0
4 NaN 1.0 1.0
You could use reset_index before the concat or force one of the data frames to use the index of the other
pd.concat([d1, d2.set_index(d1.index)], axis = 1)
0 0 1
0 0.0 1.0 1.0
1 0.0 1.0 1.0
2 0.0 1.0 1.0

Categories