I have two dfs. One contains rates, one contains empty values that need to be calculated based on df1-rate table. The two tables look like:
I want to calculate each column from 0 to 2 in df2 by using the following equation
(1+rate)^(-age)
so my results table should look like this:
df1
0 1 2
rate 0.54 0.45 0.25
df2
Age 0 1 2
0 1 NaN NaN NaN
1 2 NaN NaN NaN
...
29 30 NaN NaN NaN
results
Age 0 1 2
0 1 (1+0.54)^(-1) (1+0.45)^(-1) (1+0.25)^(-1)
1 2 (1+0.54)^(-2) (1+0.45)^(-2) (1+0.25)^(-2)
...
29 30 (1+0.54)^(-30) (1+0.45)^(-30) (1+0.25)^(-30)
I tried my code
y=np.power(1+rate.to_numpy(),-(df.Age))
but I got error message:
"operands could not be broadcast together with shapes (1,6) (30,)"
How can I fix the code?
Based on your sample, you want to broadcast Age:
np.power(1 + df1.loc['rate'].to_numpy(), -df2['Age'].to_numpy()[:,None])
Output (for 3 rows)
array([[6.49350649e-01, 6.89655172e-01, 8.00000000e-01],
[4.21656266e-01, 4.75624257e-01, 6.40000000e-01],
[2.36798188e-06, 1.44198231e-05, 1.23794004e-03]])
Related
I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)
I'm currently counting the number of missing columns across my full dataset with:
missing_cols = X.apply(lambda x: x.shape[0] - x.dropna().shape[0], axis=1).value_counts().to_frame()
When I run this, my RAM usage dramatically increases. In Kaggle, it's enough to crash the machine. After the operation and a gc.collect(), I don't seem to get all of the memory back, hinting at some sort of leak.
I'm trying to get a feel for the number of rows missing 1 column of data, 2 columns of data, 3 columns of data, etc.
Is there a more efficient way to perform this calculation?
to get the output you would get with your code you could use:
df.isnull().sum(axis=1).value_counts().to_frame()
This is an example:
df=pd.DataFrame()
df['col1']=[np.nan,1,3,5,np.nan]
df['col2']=[2,np.nan,np.nan,3,6]
df['col3']=[1,3,np.nan,4,np.nan]
print(df)
print(df.isnull().sum(axis=1))
print(df.isnull().sum(axis=0))
col1 col2 col3
0 NaN 2.0 1.0
1 1.0 NaN 3.0
2 3.0 NaN NaN
3 5.0 3.0 4.0
4 NaN 6.0 NaN
0 1
1 1
2 2
3 0
4 2
dtype: int64
col1 2
col2 2
col3 2
dtype: int64
as you can see you can get the count of NaN values by rows and by columns
Now doing:
df.isnull().sum(axis=1).value_counts().to_frame()
0
2 2
1 2
0 1
You can count na values by row using the following:
df.isna().count(axis='rows')
If this is causing your machine to crash, I would suggest iterating chunk-wise.
I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:
The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0
Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.
I have two Pandas DataFrames indexed by a timeline. We'll call the first df_A, in which the 'epoch' corresponds to the index.
df_A:
timeline epoch price z-value
0 1476336104 0 434.313 1
1 1476336120 1 434.312 false
2 1476336134 2 434.312 false
3 1476336149 3 435.900 false
4 1476336165 4 435.900 1
5 1476336178 5 435.500 1
The second, df_B, has entries that may have one, none, or multiple entries per index of df_A, as you can see by the 'epoch' column.
df_B:
timeline epoch send-value tx-in
0 1476336123 1 10000 False
1 1476336169 4 299950000 False
2 1476336187 5 22879033493 False
3 1476336194 5 130000000 False
4 1476336212 7 10000000000 False
How can I merge these on the index of df_A, and add extra values contained in df_B as columns? I'd like to also add a suffix to differentiate the additional columns. The two example datasets should create a new DataFrame, df_AB that looks like this:
timeline epoch price z-value send-value tx-in send-value_1 tx-in_1
0 1476336104 0 434.313 1 NaN NaN NaN NaN
1 1476336120 1 434.312 false 10000 False NaN NaN
2 1476336134 2 434.312 false NaN NaN NaN NaN
3 1476336149 3 435.900 false NaN NaN NaN NaN
4 1476336165 4 435.900 1 299950000 False NaN NaN
5 1476336178 5 435.500 1 22879033493 False 130000000 False
It looks like there are a few different methods where I might be able to reindex and then merge on 'timeline', or use something like merge_asof, but I can't seem to get any of them to produce the result I am looking for.
How can I do this?
I've got a pandas dataframe and I want to calculate percentiles based on the value of the calc_value column, unless calc_value is null, in which case percentile should also be null.
I'm using scipy's rankdata to calculate the percentiles, because it handles repeated values better than pandas's qcut.
However, rankdata has one flaw, which is that it will happily include null values, and there doesn't seem to be an option to exclude them.
df = pd.DataFrame({'calc_value': [0, 0.081928, 0.94444, None, None]})
df['rank_val'] = rankdata(df.calc_value.values, method='min')
df.rank_val = df.rank_val - 1
df['percentile'] = (df.rank_val / float(len(df)-1)) * 100
This produces obviously wrong results:
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN 3 75
4 NaN 4 100
I can calculate the percentiles for all non-null values by slicing the dataframe, and doing the same calculations on the slice:
df_without_nan = df[df.calc_value.notnull()]
But what I don't know is how to push these values back into the main dataframe as df['percentile'], setting percentile and rank_val to be null on any rows where calc_value is also null.
Can anyone advise? I'm looking for the following results:
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN NaN NaN
4 NaN NaN NaN
Use pd.merge:
df_nonan = df[df['calc_value'].notnull()]
df_nonan['rank_val'] = stats.rankdata(df_nonan.calc_value.values, method='min')
df_nonan['rank_val'] = df_nonan['rank_val'] - 1
df_nonan['percentile'] = (df_nonan.rank_val / float(len(df)-1)) * 100
df_merge = pd.merge(df, df_nonan, left_index=True, right_index=True, how='left')
(This will give a SettingWithCopyWarning; if that's a problem you can do reset_index on both dataframes and use the column that generates named index instead: pd.merge(df, df_nonan, on='index', how='left'), and drop the index column after the merge.) The merged dataframe at this point is
calc_value_x calc_value_y rank_val percentile
0 0.000000 0.000000 0 0
1 0.081928 0.081928 1 25
2 0.944440 0.944440 2 50
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Then do a bit of cleanup on the redundant columns:
del df_merge['calc_value_x']
df_merge = df_merge.rename(columns = {'calc_value_y' : 'calc_value'})
to wind up with
calc_value rank_val percentile
0 0.000000 0 0
1 0.081928 1 25
2 0.944440 2 50
3 NaN NaN NaN
4 NaN NaN NaN