I have data in pandas DataFrame or NumPy array and want to calculate the weighted mean(average) or weighted median based on some weights in another column or array. I am looking for a simple solution rather than writing functions from scratch or copy-paste them everywhere I need them.
The data looks like this -
state.head()
State Population Murder.Rate Abbreviation
0 Alabama 4779736 5.7 AL
1 Alaska 710231 5.6 AK
2 Arizona 6392017 4.7 AZ
3 Arkansas 2915918 5.6 AR
4 California 37253956 4.4 CA
And I want to calculate the weighted mean or median of murder rate which takes into account the different populations in the states.
How can I do that?
First, install the weightedstats library in python.
pip install weightedstats
Then, do the following -
Weighted Mean
ws.weighted_mean(state['Murder.Rate'], weights=state['Population'])
4.445833981123394
Weighted Median
ws.weighted_median(state['Murder.Rate'], weights=state['Population'])
4.4
It also has special weighted mean and median methods to use with numpy arrays. The above methods will work but in case if you need it.
my_data = [1, 2, 3, 4, 5]
my_weights = [10, 1, 1, 1, 9]
ws.numpy_weighted_mean(my_data, weights=my_weights)
ws.numpy_weighted_median(my_data, weights=my_weights)
Related
I'm sure I'm not doing this right. I have a dataframe with a series of data, basically year and a value. I want to smoothen the curve and was looking to use spline to test results.
Basically I was trying to take a column and return the new datapoints into another column:
df['smooth'] = df['value'].interpolate(method='spline', order=3, s=0.)
but the results between smooth and value are the same.
value periodDate smooth diffSmooth
6 422976.72 2019 422976.72 0.0
7 190865.94 2018 190865.94 0.0
8 188440.89 2017 188440.89 0.0
9 192481.64 2016 192481.64 0.0
10 191958.64 2015 191958.64 0.0
11 681376.60 2014 681376.60 0.0
Any suggestions of what I'm doing wrong?
According to the Pandas docs, the interpolate function fills missing values in a sequence, so for example linear interpolation would be [0, 1, NaN, 3] -> [0, 1, 2, 3]. In short, you're using the wrong function. If you want to fit a spline, sklearn or scipy or numpy may be better bets.
I have a Pandas Series that contains the price evolution of a product (my country has high inflation), or say, the amount of coronavirus infected people in a certain country. The values in both of these datasets grows exponentially; that means that if you had something like [3, NaN, 27] you'd want to interpolate so that the missing value is filled with 9 in this case. I checked the interpolation method in the Pandas documentation but unless I missed something, I didn't find anything about this type of interpolation.
I can do it manually, you just take the geometric mean, or in the case of more values, get the average growth rate by doing (final value/initial value)^(1/distance between them) and then multiply accordingly. But there's a lot of values to fill in in my Series, so how do I do this automatically? I guess I'm missing something since this seems to be something very basic.
Thank you.
You could take the logarithm of your series, interpolate lineraly and then transform it back to your exponential scale.
import pandas as pd
import numpy as np
arr = np.exp(np.arange(1,10))
arr = pd.Series(arr)
arr[3] = None
0 2.718282
1 7.389056
2 20.085537
3 NaN
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64
arr = np.log(arr) # Transform according to assumed process.
arr = arr.interpolate('linear') # Interpolate.
np.exp(arr) # Invert previous transformation.
0 2.718282
1 7.389056
2 20.085537
3 54.598150
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64
Suppose I have the following data of repeat observations for US states with some value of interest:
US_State Value
Alabama 1
Alabama 10
Alabama 9
Michigan 8
Michigan 9
Michigan 2
...
How can I generate pairwise correlations for Value between all the US_State combinations? I've tried a few different things (pivot, groupby, and more), but I can't seem to wrap my head around the proper approach.
The ideal output would look like:
Alabama Michigan ...
Alabama 1 0.5
Michigan 0.5 1
...
There is a way utilising Pandas to its extents, but this is only under the assumption that each state in the input dataset has the same number of observations, otherwise correlation coefficient does not really make sense and the results will become a bit funky.
import pandas as pd
df = pd.DataFrame()
df['US_State'] = ["Alabama", "Alabama", "Alabama", "Michigan", "Michigan", "Michigan", "Oregon", "Oregon", "Oregon"]
df['Value'] = [1, 10, 9, 8, 9, 2, 6, 1, 2]
pd.DataFrame(df.groupby("US_State")['Value'].apply(lambda x: list(x))).T.apply(lambda x: pd.Series(*x), axis=0).corr()
which results into
US_State Alabama Michigan Oregon
US_State
Alabama 1.000000 -0.285578 -0.996078
Michigan -0.285578 1.000000 0.199667
Oregon -0.996078 0.199667 1.000000
What the code basically does is it collects the data for each state into a single cell as a list, transposes the dataframe to make the states columns and then expands the collected cell of list data into dataframe rows for each state. Then you can just call the standard corr() method of pandas dataframe.
Pandas DataFrame has a built-in correlation matrix function. You will somehow need to get your data into a DataFrame (takes numpy objects, plain dict (shown), etc).
from pandas import DataFrame
data = {'AL': [1,10,9],
'MI': [8,9,2],
'CO': [11,5,17]
}
df = DataFrame(data)
corrMatrix = df.corr()
print(corrMatrix)
# optional heatmap
import seaborn as sn
sn.heatmap(corrMatrix, annot=True, cmap='coolwarm')
AL MI CO
AL 1.000000 -0.285578 -0.101361
MI -0.285578 1.000000 -0.924473
CO -0.101361 -0.924473 1.000000
I have a DataFrame for a fast Fourier transformed signal.
There is one column for the frequency in Hz and another column for the corresponding amplitude.
I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.
df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around
The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.
Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?
This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.
r = df.rolling(window=20) # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std() # Combine a mean and stdev on that object
print(df[df.Data > mps.Data]) # Boolean filter
# Data
# 55 40.0
# 80 40.0
To add a new column filtering only to outliers, with NaN elsewhere:
df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)
print(df.iloc[50:60])
Data Peaks
50 -1.29409 NaN
51 -1.03879 NaN
52 1.74371 NaN
53 -0.79806 NaN
54 0.02968 NaN
55 40.00000 40.0
56 0.89071 NaN
57 1.75489 NaN
58 1.49564 NaN
59 1.06939 NaN
Here .where returns
An object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.
I have a dataset that looks like this
1908 January 5.0 -1.4
1908 February 7.3 1.9
1908 March 6.2 0.3
1908 April NaN 2.1
1908 May NaN 7.7
1908 June 17.7 8.7
1908 July NaN 11.0
1908 August 17.5 9.7
1908 September 16.3 8.4
1908 October 14.6 8.0
1908 November 9.6 3.4
1908 December 5.8 NaN
1909 January 5.0 0.1
1909 February 5.5 -0.3
1909 March 5.6 -0.3
1909 April 12.2 3.3
1909 May 14.7 4.8
1909 June 15.0 7.5
1909 July 17.3 10.8
1909 August 18.8 10.7
I want to replace the NaNs using KNN as the method. I looked up sklearns Imputer class but it supports only mean, median and mode imputation. There is a feature request here but I don't think that's been implemented as of now. Any ideas on how to replace the NaNs from the last two columns using KNN?
Edit:
Since I need to run codes in another environment, I don't have the luxury of installing packages. Sklearn, pandas, numpy, and other standard packages are the only ones I can use.
fancyimpute package supports such kind of imputation, using the following API:
from fancyimpute import KNN
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).complete(X_incomplete)
Here are the imputations supported by this package:
•SimpleFill: Replaces missing entries with the mean or median of each
column.
•KNN: Nearest neighbor imputations which weights samples using the
mean squared difference on features for which two rows both have
observed data.
•SoftImpute: Matrix completion by iterative soft thresholding of SVD
decompositions. Inspired by the softImpute package for R, which is
based on Spectral Regularization Algorithms for Learning Large
Incomplete Matrices by Mazumder et. al.
•IterativeSVD: Matrix completion by iterative low-rank SVD
decomposition. Should be similar to SVDimpute from Missing value
estimation methods for DNA microarrays by Troyanskaya et. al.
•MICE: Reimplementation of Multiple Imputation by Chained Equations.
•MatrixFactorization: Direct factorization of the incomplete matrix
into low-rank U and V, with an L1 sparsity penalty on the elements of
U and an L2 penalty on the elements of V. Solved by gradient descent.
•NuclearNormMinimization: Simple implementation of Exact Matrix
Completion via Convex Optimization by Emmanuel Candes and Benjamin
Recht using cvxpy. Too slow for large matrices.
•BiScaler: Iterative estimation of row/column means and standard
deviations to get doubly normalized matrix. Not guaranteed to converge
but works well in practice. Taken from Matrix Completion and Low-Rank
SVD via Fast Alternating Least Squares.
fancyimpute's KNN imputation no more supports the complete function as suggested by other answer, we need to now use fit_transform
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Use 3 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).fit_transform(X_incomplete)
reference https://github.com/iskandr/fancyimpute
scikit-learn v0.22 supports native KNN Imputation
import numpy as np
from sklearn.impute import KNNImputer
X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2)
print(imputer.fit_transform(X))
This pull request to sklearn adds KNN support. You can get the code from here.