Difference between numpy var() and pandas var() - python

I recently encountered a thing which made me notice that numpy.var() and pandas.DataFrame.var() or pandas.Series.var() are giving different values. I want to know if there is any difference between them or not?
Here is my dataset.
Country GDP Area Continent
0 India 2.79 3.287 Asia
1 USA 20.54 9.840 North America
2 China 13.61 9.590 Asia
Here is my code:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
catDf.iloc[:,1:-1] = ss.fit_transform(catDf.iloc[:,1:-1])
Now checking Pandas Variance
# Pandas Variance
print(catDf.var())
print(catDf.iloc[:,1:-1].var())
print(catDf.iloc[:,1].var())
print(catDf.iloc[:,2].var())
The output is
GDP 1.5
Area 1.5
dtype: float64
GDP 1.5
Area 1.5
dtype: float64
1.5000000000000002
1.5000000000000002
Whereas it should be 1 as I have used StandardScaler on it.
And for numpy Variance
print(catDf.iloc[:,1:-1].values.var())
print(catDf.iloc[:,1].values.var())
print(catDf.iloc[:,2].values.var())
THe output is
1.0000000000000002
1.0000000000000002
1.0000000000000002
Which seems correct.

pandas var has ddof of 1 by default, numpy has it at 0.
The get the same var in pandas as you're getting in numpy do
catDf.iloc[:,1:-1].var(ddof=0)
This comes down to the difference between population variance and sample variance.
Note the sklearn standard scaler explicitly mention they use a ddof of 0 and that as it is unlikely to affect model performance (as it is just for scaling), they haven't exposed it as a configurable parameter.

Related

How to calculate weighted mean and median in python?

I have data in pandas DataFrame or NumPy array and want to calculate the weighted mean(average) or weighted median based on some weights in another column or array. I am looking for a simple solution rather than writing functions from scratch or copy-paste them everywhere I need them.
The data looks like this -
state.head()
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
And I want to calculate the weighted mean or median of murder rate which takes into account the different populations in the states.
How can I do that?
First, install the weightedstats library in python.
pip install weightedstats
Then, do the following -
Weighted Mean
ws.weighted_mean(state['Murder.Rate'], weights=state['Population'])
4.445833981123394
Weighted Median
ws.weighted_median(state['Murder.Rate'], weights=state['Population'])
4.4
It also has special weighted mean and median methods to use with numpy arrays. The above methods will work but in case if you need it.
my_data = [1, 2, 3, 4, 5]
my_weights = [10, 1, 1, 1, 9]
ws.numpy_weighted_mean(my_data, weights=my_weights)
ws.numpy_weighted_median(my_data, weights=my_weights)

Python - Pandas: how can I interpolate between values that grow exponentially?

I have a Pandas Series that contains the price evolution of a product (my country has high inflation), or say, the amount of coronavirus infected people in a certain country. The values in both of these datasets grows exponentially; that means that if you had something like [3, NaN, 27] you'd want to interpolate so that the missing value is filled with 9 in this case. I checked the interpolation method in the Pandas documentation but unless I missed something, I didn't find anything about this type of interpolation.
I can do it manually, you just take the geometric mean, or in the case of more values, get the average growth rate by doing (final value/initial value)^(1/distance between them) and then multiply accordingly. But there's a lot of values to fill in in my Series, so how do I do this automatically? I guess I'm missing something since this seems to be something very basic.
Thank you.
You could take the logarithm of your series, interpolate lineraly and then transform it back to your exponential scale.
import pandas as pd
import numpy as np
arr = np.exp(np.arange(1,10))
arr = pd.Series(arr)
arr[3] = None
0 2.718282
1 7.389056
2 20.085537
3 NaN
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64
arr = np.log(arr) # Transform according to assumed process.
arr = arr.interpolate('linear') # Interpolate.
np.exp(arr) # Invert previous transformation.
0 2.718282
1 7.389056
2 20.085537
3 54.598150
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64

Linear regression and plots through each numerical independent variable and target variable

I would like to know is there a way where can I get one on one( 1 independent variable vs target variable) linear regression analysis ,its p value, R2 value and the plot to show how linearly it is related or not. And I want this to run on all independent variables separately. As far as I know it is possible to get OLS regression analysis from Python statsmodel library. It runs on whole dataset and give the result, and there are no plots to understand it visually.
To very quickly visualize the regression you can try the below using sns:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
data = load_iris()
df = pd.DataFrame(data.data, columns=['sepal.length','sepal.width','petal.length','petal.width'])
df = pd.melt(df,id_vars='sepal.length')
df[:5]
sepal.length variable value
0 5.1 sepal.width 3.5
1 4.9 sepal.width 3.0
2 4.7 sepal.width 3.2
3 4.6 sepal.width 3.1
4 5.0 sepal.width 3.6
sns.lmplot(x ='sepal.length', y ='value', data = df,col='variable',
col_wrap=2,aspect = 0.6, height,= 4, palette ='coolwarm')

How can I calculate correlation between two sets of data within two columns?

I will like to first create two columns and then use LOG() to calculate the periodic daily returns for column Price and column Adjusted Close. Thereafter using the periodic return to find the correlation between periodic daily returns calculated.
I tried
Combine_data['log_return'] = np.log(1 + Combine_data.pct_change)
Combine_data.head()
but it is not working.
Combine_data= pd.merge(XAU_USD,SP500, on='Date',suffixes=
('(GOLD)','(SP500)'))
Combine_data.set_index('Date',inplace=True)
Combine_data.head()
This is what my output looks like:
You can try that as follows:
ser1= (df['gold']+1).apply(np.log)
ser2= (df['silver']+1).apply(np.log)
np.corrcoef(ser1, ser2)
The result looks like:
Out[431]:
array([[1. , 0.30121126],
[0.30121126, 1. ]])
A correlation of 0.301 is not bad given the fact, the data is randomly generated :-)
Out[430]:
gold silver date
0 793.559641 19.112793 2019-08-23
1 1428.329390 17.758924 2019-08-24
2 1044.061092 17.962435 2019-08-25
3 1222.397539 17.638691 2019-08-26
4 890.945841 11.593497 2019-08-27
5 1224.616916 15.759736 2019-08-28
6 1059.684075 12.900665 2019-08-29
7 1147.011421 20.274250 2019-08-30
8 929.638993 12.244630 2019-08-31
9 515.545695 14.609073 2019-09-01
Here are two methods to do this:
Use Pearsonr correlation:
from scipy.stats.stats import pearsonr
coeff = pearsonr(x, y)[0] # 0 is the coefficient, 1 is the p-value
Use numpy correlation:
import numpy as np
coeff = np.corrcoef(x,y)[0,1]

What is the best way to oversample a dataframe preserving its statistical properties in Python 3?

I have the following toy df:
FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar)
0 0.156 1 29.5 28.4 29.6 28.4
2 0.149 1.3 29.567 28.9
3 0.149 1 29.567 28.9
4 0.148 1.6 29.6 29.4
This is just a sample. The original have over 1200 rows. What's the best way to oversample it preserving its statistical propierties?
I have googled it for some time and i hve only come across resampling algorithms for imbalalnced classes. but that's not what i want, i'm not interested in balancing thr data anyhow, i just would like to produce more samples in a way that more or less preserves the original data distributions and statistical properties.
Thanks in advance
Using scipy.stats.rv_histogram(np.histogram(data)).isf(np.random.random(size=n)) will create n new samples randomly chosen from the distribution (histogram) of the data. You can do this for each column:
Example:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'x': np.random.random(100)*3, 'y': np.random.random(100) * 4 -2})
n = 5
new_values = pd.DataFrame({s: stats.rv_histogram(np.histogram(df[s])).isf(np.random.random(size=n)) for s in df.columns})
df = df.assign(data_type='original').append(new_values.assign(data_type='oversampled'))
df.tail(7)
>> x y data_type
98 1.176073 -0.207858 original
99 0.734781 -0.223110 original
0 2.014739 -0.369475 oversampled
1 2.825933 -1.122614 oversampled
2 0.155204 1.421869 oversampled
3 1.072144 -1.834163 oversampled
4 1.251650 1.353681 oversampled

Categories