In EXCEL, calculating a geomean of size 2 on Col1, would result in a 6 in row 1 of Geo_2 as the geomean of 4 and 9 is 6. In Pandas or numpy it appears to be the reverse, with a min_period = 1, the first row reflects the calculation of just 1 value and the subsequent calculations use the previous and current row of Col1 to calculate the geomean.
I want the caluclation window to be the current and the next row of col1, so that the first value of Geo_2 is 6 and the last value should be 2.
DASeries = [4,9,3,3,5,7,8,4,2]
import numpy as np
import pandas as pd
from scipy.stats.mstats import gmean
DA_df = pd.DataFrame(DASeries)
geoMA2 = [2,3]
geo_df = pd.DataFrame([pd.Series(DASeries).rolling(window =elem, min_periods = 1).apply(gmean, raw =True) for elem in geoMA2]).T
Final = pd.concat([DA_df,geo_df],axis=1)
labs = ['Col1','Geo_2','Geo_3']
Final.columns = labs
Final
Using .iloc[::-1]
pd.Series(DASeries).iloc[::-1].rolling(window =2, min_periods = 1).apply(gmean).iloc[::-1]
0 6.000000
1 5.196152
2 3.000000
3 3.872983
4 5.916080
5 7.483315
6 5.656854
7 2.828427
8 2.000000
dtype: float64
Related
I'm trying to figure out if the value in my dataframe is increasing in the tens/hundreds place. For example I created a dataframe with a few values, I duplicate the values and shifted them and now i'm able to compare them. But how do i code and find out if the tens place is increasing or if it just increasing by a little, for example 0.02 points.
import pandas as pd
import numpy as np
data = {'value':['9','10','19','22','31']}
df = pd.DataFrame(data)
df['value_copy'] = df['value'].shift(1)
df['Increase'] = np.where(df['value']<df['value_copy'],1,0)
output should be in this case:
[nan,1,0,1,1]
IIUC, divide by 10, get the floor, then compare the successive values (diff(1)) to see if the difference is exactly 1:
np.floor(df['value'].astype(float).div(10)).diff(1).eq(1).astype(int)
If you want a jump to at least the next tens (or more) use ge (≥):
np.floor(df['value'].astype(float).div(10)).diff(1).ge(1).astype(int)
output:
0 0
1 1
2 0
3 1
4 1
Name: value, dtype: int64
NB. if you insist on the NaN:
s = np.floor(df['value'].astype(float).div(10)).diff(1)
s.eq(1).astype(int).mask(s.isna())
output:
0 NaN
1 1.0
2 0.0
3 1.0
4 1.0
Name: value, dtype: float64
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
I have a pandas dataframe with more than 50 columns. All the data except the 1st column is float. I want to replace any value greater than 5.75 with 100. Can someone advise any function to do the same.
The replace function is not working as to_value can only take "=" function, and not the greater than function.
This can be done using
df['ColumnName'] = np.where(df['ColumnName'] > 5.75, 100, df['First Season'])
You can make a custom function and pass it to apply:
import pandas as pd
import random
df = pd.DataFrame({'col_name': [random.randint(0,10) for x in range(100)]})
def f(x):
if x >= 5.75:
return 100
return x
df['modified'] = df['col_name'].apply(f)
print(df.head())
col_name modified
0 2 2
1 5 5
2 7 100
3 1 1
4 9 100
If you have a dataframe:
import pandas as pd
import random
df = pd.DataFrame({'first_column': [random.uniform(5,6) for x in range(10)]})
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 5.923234
8 5.489964
9 5.127154
Then check if the value is greater than 5.75:
df[df > 5.75] = 100
print(df)
Gives me:
first_column
0 5.620439
1 5.640604
2 5.286608
3 5.642898
4 5.742910
5 5.096862
6 5.360492
7 100.000000
8 5.489964
9 5.127154
import numpy as np
import pandas as pd
#Create df
np.random.seed(0)
df = pd.DataFrame(2*np.random.randn(100,50))
for col_name in df.columns[1:]: #Skip first column
df.loc[:,col_name][df.loc[:,col_name] > 5.75] = 100
np.where(df.value > 5.75, 100, df.value)
How could I randomly introduce NaN values into my dataset for each column taking into account the null values already in my starting data.
I want to have for example 20% of NaN values by column.
For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values by column to reach 20% per column:
A: 10% nan
B: 15% nan
C: 8% nan
For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:
df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.
import numpy as np
import pandas as pd
A = pd.Series(np.arange(99))
# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN
###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()
# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN
A.isna().mean()
Obviously, it will not always be exactly 20%...
Update
Applying it for the whole dataframe
for col in df:
ori_rat = df[col].isna().mean()
if ori_rat >= 0.2: continue
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
df.loc[vals_to_nan, col] = np.NaN
Update 2
I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.
Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))
nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}
for col in df:
for i, row_value in df[col].iteritems():
if random.random() <= nan_percent[col]:
df[col][i] = np.nan
#after adding nan
print(df.head(10))
Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)
I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:
nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}
for col, perc in nan_percent.items():
df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
df.loc[df['null'] == 1, col] = np.nan
df.drop(columns=['null'], inplace=True)
I have dataframe below.
I want to even row value substract from odd row value.
and make new dataframe.
How can I do it?
import pandas as pd
import numpy as np
raw_data = {'Time': [281.54385, 436.55295, 441.74910, 528.36445,
974.48405, 980.67895, 986.65435, 1026.02485]}
data = pd.DataFrame(raw_data)
data
dataframe
Time
0 281.54385
1 436.55295
2 441.74910
3 528.36445
4 974.48405
5 980.67895
6 986.65435
7 1026.02485
Wanted result
ON_TIME
0 155.00910
1 86.61535
2 6.19490
3 39.37050
You can use NumPy indexing:
res = pd.DataFrame(data.values[1::2] - data.values[::2], columns=['Time'])
print(res)
Time
0 155.00910
1 86.61535
2 6.19490
3 39.37050
you can use shift for the subtraction, and then pick every 2nd element, starting with the 2nd element (index = 1)
(data.Time - data.Time.shift())[1::2].rename('On Time').reset_index(drop=True)
outputs:
0 155.00910
1 86.61535
2 6.19490
3 39.37050
Name: On Time, dtype: float64