Understanding percentile= calculation in describes () of python - python

I am trying to understand the following:
1)how the percentiles are calculated.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
3) My requirement is to know actual value below which x% of population lies. How to do that?
Thanks
Python-2
new=pd.DataFrame({'a':range(10),'b':[60510,60053,54968,62269,91107,29812,45503,6460,62521,37128]})
print new.describe(percentiles=[ 0,0.1 ,0.2,0.3,0.4, 0.50, 0.6,0.7,0.8 ,0.90,1 ])

1)how the percentiles are calculated
90% percentile/quantile means 10% of the data is greater than that value, 90% of the data falls below that value. By default, it's based on a linear interpolation. This is why in your a column, values increment by 0.9instead of original data values of [0, 1, 2 ...]. If you want to use nearest values instead of interpolation, you can use the quantile method instead of describe and change the interpolation parameter.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
Your question is unclear here. It does return values in a sorted order, indexed based on the output of the .describe method output: count, mean, std, min, quantiles from low to high, max. If you only want quantiles and not the other statistics, you can use the quantile method instead.
3) My requirement is to know actual value below which x% of population lies. How to do that?
Nothing is wrong with the output. Those quantiles are accurate, although they aren't very meaningful when your data only has 10 observations.
Edit: It wasn't originally clear to me that you were attempting to do stats on a frequency table. I don't know of a direct solution in pandas that don't involve moving your data over to a numpy array. You could use numpy.repeat like to get a raw list of observations to put back into pandas and do descriptive stats on.
vals = np.array(new.a)
freqs = np.array(new.b)
observations = np.repeat(vals, freqs)

Related

Multiply by unique number based on which pandas interval a number falls within

I am trying to take a number multiply it by a unique number given which interval it falls within.
I did a groupby on my pandas dataframe according to which bins a value fell into
bins = pd.cut(df['A'], 50)
grouped = df['B'].groupby(bins)
interval_averages = grouped.mean()
A
(0.00548, 0.0209] 0.010970
(0.0209, 0.0357] 0.019546
(0.0357, 0.0504] 0.036205
(0.0504, 0.0651] 0.053656
(0.0651, 0.0798] 0.068580
(0.0798, 0.0946] 0.086754
(0.0946, 0.109] 0.094038
(0.109, 0.124] 0.114710
(0.124, 0.139] 0.136236
(0.139, 0.153] 0.142115
(0.153, 0.168] 0.161752
(0.168, 0.183] 0.185066
(0.183, 0.198] 0.205451
I need to be able to check which interval a number falls into, and then multiply it by the average value of the B column for that interval range.
From the docs I know I can use the in keyword to check if a number is in an interval, but I cannot find how to access the value for a given interval. In addition, I don't want to have to loop through the Series checking if the number is in each interval, that seems quite slow.
Does anybody know how to do this efficiently?
Thanks a lot.
You can store the numbers being tested in an array, and use the cut() method with your bins to sort the values into their respective intervals. This will return an array with the bins that each number has fallen into. You can use this array to determine where the value in the dataframe (the mean) that you need to access is located (you will know the correct row) and access the value via iloc.
Hopefully this helps a bit

Python Pandas DataFrame Describe Gives Wrong Results?

While I was looking at the results of the describe() method, I realized something very strange. Data is the House Price data from kaggle . Below, you can see the code and the result for "Condition2" feature:
train.groupby(train["Condition2"].fillna('None'))["SalePrice"].describe()
On the other hand, when I look at data in Excel, the quantiles do not match.
So, while 33% of data points 85K SalePrice, how can 25% of data points be 95.5K SalePrice? It is really weird, or may be I'm missing something. Could anybody explain this?
Quantiles seek to divide the data into four equal groups so the value of the 25% quantile isn't going to be the value of the 25% of your data with small sample sizes like this where n=6. There are different methods of calculating quantile values. Describe uses the linear method described in the docs as
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
If you switch to the lower method it produces different results.
>>> feedr = train_df[train_df.Condition2 == 'Feedr']
>>> feedr.SalePrice.quantile(.50)
127500.0
>>> feedr.SalePrice.quantile(.50, interpolation='lower')
127000
>>> feedr.SalePrice.quantile(.25, interpolation='lower')
85000
>>> feedr.SalePrice.quantile(.25)
95500.0

How to reverse a seasonal log difference of timeseries in python

Could you please help me with this issue as I made many searches but cannot solve it. I have a multivariate dataframe for electricity consumption and I am doing a forecasting using VAR (Vector Auto-regression) model for time series.
I made the predictions but I need to reverse the time series (energy_log_diff) as I applied a seasonal log difference to make the serie stationary, in order to get the real energy value:
df['energy_log'] = np.log(df['energy'])
df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(1)
For that, I did first:
df['energy'] = np.exp(df['energy_log_diff'])
This is supposed to give the energy difference between 2 values lagged by 365 days but I am not sure for this neither.
How can I do this?
The reason we use log diff is that they are additive so we can use cumulative sum then multiply by the last observed value.
last_energy=df['energy'].iloc[-1]
df['energy']=(np.exp(df['energy'].cumsum())*last_energy)
As per seasonality: if you de-seasoned the log diff simply add(or multiply) before you do the above step if you de-seasoned the original series then add after
Short answer - you have to run inverse transformations in the reversed order which in your case means:
Inverse transform of differencing
Inverse transform of log
How to convert differenced forecasts back is described e.g. here (it has R flag but there is no code and the idea is the same even for Python). In your post, you calculate the exponential, but you have to reverse differencing at first before doing that.
You could try this:
energy_log_diff_rev = []
v_prev = v_0
for v in df['energy_log_diff']:
v_prev += v
energy_log_diff_rev.append(v_prev)
Or, if you prefer pandas way, you can try this (only for the first order difference):
energy_log_diff_rev = df['energy_log_diff'].expanding(min_periods=0).sum() + v_0
Note the v_0 value, which is the original value (after log transformation before difference), it is described in the link above.
Then, after this step, you can do the exponential (inverse of log):
energy_orig = np.exp(energy_log_diff_rev)
Notes/Questions:
You mention lagged values by 365 but you are shifting data by 1. Does it mean you have yearly data? Or would you like to do this - df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(365) instead (in case of daily granularity of data)?
You want to get the reverse time series from predictions, is that right? Or am I missing something? In such a case you would make inverse transformations on prediction not on the data I used above for explanation.

Generate missing values on the dataset based on ZIPF distribution

Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

Categories