How to handle pct_change with negative values - python

I am calculating percentage change for a panel dataset, which has both positive and negative values. If both values of n and n+1 date are negative and values of n > n+1, for instance, n=-2, n+1=-4. The calculated percentage change is ((n+1)-n)/n=((-4)-(-2))/-2=1. As you can see the change should be a downtrend, which is expected to be negative but the result is the opposite. I normally set the denominators the absolute values ((n+1)-n)/abs(n) in other software to ensure the direction of the trend. Just wondering if I can do so in Python pandas pct_change to set up the denominator to be absolute values. Many thanks. I have solved the question based on the answer from Leo.
Here is a data example if one wants to play around.
import pandas as pd
df= {'id':[1,1,2,2,3,3],'values':[-2,-4,-2,2,1,5]}
df=pd.DataFrame(data=df)
df['pecdiff']=(df.groupby('id')['values'].apply(lambda x:x.diff()/x.abs().shift()
)).fillna(method=bfill)

If I understood correctly, the line for expected change should solve your problem. For comparison, I put side by side pandas' method and what you need.
The following code:
import pandas as pd
df = pd.DataFrame([-2,-4,-2,2,1], columns = ['Values'])
df['pct_change'] = df['Values'].pct_change()
# This should help you out:
df['expected_change'] = df['Values'].diff() / df['Values'].abs().shift()
df
Gives this output. Note that the signs are different for lines 1 through 3
Values pct_change expected_change
0 -2 NaN NaN
1 -4 1.0 -1.0
2 -2 -0.5 0.5
3 2 -2.0 2.0
4 1 -0.5 -0.5

Related

Applying the hampel filter to a df in python

I am currently trying to apply the hampel filter to my dataframe in python, I have looked around and there isn't a lot of documentation for its implementation in python. I found one post but it looks like it was created before there was an actual hampel package/function and someone created a function to do a rolling mean calculation not using the filter from the package itself, even the site for Hampel the package is minimal. I am looking at the number of Covid cases per day by fips code. I have 470 time series (in days) data frame, each column is a different FIPS code and each row has the number of Covid cases per day (with dates, not the day number from start). The package for Hampel is very straight forward, it has two options for outputs, it will either return a list of the indices where it thinks there are outliers or it will replace the outliers with the median with in the data.
the two codes for using the hampel are:
[IN]:
ts = pd.Series([1, 2, 1 , 1 , 1, 2, 13, 2, 1, 2, 15, 1, 2])
[IN]: # to return indices:
outlier_indices = hampel(ts, window_size=5, n=3)
print("Outlier Indices: ", outlier_indices)
[OUT]:
Outlier Indices: [6, 10]
[IN]: # to return series with rolling medians replaced*** I'm using this format
ts_imputation = hampel(ts, window_size=5, n=3, imputation=True)
ts_imputation
[OUT]:
0 1.0
1 2.0
2 1.0
3 1.0
4 1.0
5 2.0
6 2.0
7 2.0
8 1.0
9 2.0
10 2.0
11 1.0
12 2.0
dtype: float64
So with my data frame I want it to replace the outliers in each column with the column median, I am using a window = 21 and a threshold = 6 (b/c of the data setup). I should mention each of the column starts with a different number of 0's for the rows. So for example the value for the first 80 rows for column one may be 0's and then for column 2 the first 95 rows may have 0's because each FIPs code has a diffferent number of days Given this I tried to use the .apply method with the following fx:
[IN]:
def hamp(col):
no_out = hampel(col, window_size=21, n=6, imputation=True)
return (no_out)
[IN]:
df = df.apply(hamp2, axis=1)
However, when I printed my data frame is now just all 0's. Can someone tell me what I am doing wrong?
Thank you!
recently SKTIME added HampelFilter
from sktime.transformations.series.outlier_detection import HampelFilter
y = your_data
transformer = HampelFilter(window_length=10)
y_hat = transformer.fit_transform(y)
also you can read the documentation here HampelFilter_sktime

Python avoid dividing by zero in pandas dataframe

Apologies that this has been asked before, but I cannot get those solutions to work for me (am native MATLAB user coming to Python).
I have a dataframe where I am taking the row-wise mean of the first 7 columns of one df and dividing it by another. However, there are many zeros in this dataset and I want to replace the zero divion errors with zeros (as that's meaningful to me) instead of the naturally returned nan (as I'm implementing it).
My code so far:
col_ind = list(range(0,7))
df.iloc[:,col_ind].mean(axis=1)/other.iloc[:,col_ind].mean(axis=1)
Here, if other = 0, it returns nan, but if df = 0 it returns 0. I have tried a lot of proposed solutions but none seem to register. For instance:
def foo(x,y):
try:
return x/y
except ZeroDivisionError:
return 0
foo(df.iloc[:,col_ind].mean(axis1),other.iloc[:,col_ind].mean(axis=1))
However this returns the same values without using the defined foo. I'm suspecting this is because I am operating on series rather than single values, but I'm not sure nor how to fix it. There are also actual nans in these dataframes as well. Any help appreciated.
you can use np.where to conditionally do this as a vectorised calc.
import numpy as np
df = pd.DataFrame(data=np.concatenate([np.random.randint(1,10, (10,7)), np.random.randint(0,3,(10,1))], axis=1),
columns=[f"col_{i}" for i in range(7)]+["div"])
np.where(df["div"].gt(0), (df.loc[:,[c for c in df.columns if "col" in c]].mean(axis=1) / df["div"]), 0)
It's not clear which version you're using and I don't know if the behavior is version-dependent, but in Python 3.8.5 / Pandas 1.2.4, a 0 / 0 in a dataframe/series will evaluate to NaN, while a non-zero / 0 will evaluate to inf. Neither will raise an error, so a try/except wouldn't have anything to catch.
>>> import pandas as pd
>>> import numpy as np
>>> x = pd.DataFrame({'a': [0, 1, 2], 'b': [0, 0, 2]})
>>> x
a b
0 0 0
1 1 0
2 2 2
>>> x.a / x.b
0 NaN
1 inf
2 1.0
dtype: float64
You can replace NaN values in a pandas DataFrame or Series with the fillna() method, and you can replace inf using a standard replace():
>>> (x.a / x.b).replace(np.inf, np.nan)
0 NaN
1 NaN
2 1.0
dtype: float64
>>> (x.a / x.b).replace(np.inf, np.nan).fillna(0)
0 0.0
1 0.0
2 1.0
dtype: float64
(Note: A negative value divided by zero will evaluate to -inf, which would need to be replaced separately.)
You could replace nan after the calculation using df.fillna(0)

Fill missing data with random values from categorical column - Python

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace
.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7

why pandas.DataFrame.sum(axis=0) returns sum of values in each column where axis =0 represent rows?

In pandas, axis=0 represent rows and axis=1 represent columns.
Therefore to get the sum of values in each row in pandas, df.sum(axis=0) is called.
But it returns a sum of values in each columns and vice-versa. Why???
import pandas as pd
df=pd.DataFrame({"x":[1,2,3,4,5],"y":[2,4,6,8,10]})
df.sum(axis=0)
Dataframe:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
Output:
x 15
y 30
Expected Output:
0 3
1 6
2 9
3 12
4 15
I think the right way to interpret the axis parameter is what axis you sum 'over' (or 'across'), rather than the 'direction' the sum is computed in. Specifying axis = 0 computes the sum over the rows, giving you a total for each column; axis = 1 computes the sum across the columns, giving you a total for each row.
I was a reading the source code in pandas project, and I think that this come from Numpy, in this library is used in that way(0 sum vertically and 1 horizonally), and additionally Pandas use under the hood numpy in order to make this sum.
In this link you could check that pandas use numpy.cumsum function to make the sum.
And this link is for numpy documentation.
If you are looking a way to remember how to use the axis parameter, the 'anant' answer, its a good approach, interpreting the sum over the axis instead across. So when is specified 0 you are computing the sum over the rows(iterating over the index in order to be more pandas doc complaint). When axis is 1 you are iterating over the columns.

Python: take log difference for each column in a dataframe

I have a list of dataframes and would like to take log for every element in these dataframes and find the first difference. In time series econometrics, this procedure gives an approximate growth rate. The following codes
for i in [0, 1, 2, 5]:
df1_list[i] = 100 * np.log(df_list[i]).diff()
gives an error
__main__:7: RuntimeWarning: divide by zero encountered in log
__main__:7: RuntimeWarning: invalid value encountered in log
When I look at the result, many of the elements resulting dataframes are nan. How can I fix the codes? Thanks !!
The problem is not with your code, but with your data. You do not get an error but two warnings. The most likely reasons are the following kinds of values within your DataFrames:
Zeros
Negative numbers
Non-numerical values
The logarithm of any of these is just not defined, so you get NaN.
Some test data
df = pd.DataFrame(np.random.rand(5, 5))
df = df.mask(np.random.random(df.shape) < .1)
0 1 2 3 4
0 0.579643 0.614592 0.333945 0.241791 0.426162
1 0.576076 0.841264 0.235148 0.577707 0.278260
2 0.735097 0.594789 0.640693 0.913639 0.620021
3 0.015446 NaN 0.062203 0.253076 0.042025
4 0.401775 0.522634 0.521139 0.032310 NaN
Applying your code
for c in df:
print(100 * np.log(df[c]).diff())
yields output like this (for c = 1):
0 NaN
1 31.394708
2 -34.670002
3 NaN
4 NaN
You can remove nans with .dropna()
for c in df:
print(100 * np.log(df[c].dropna()).diff())
which yields (for c = 1)
0 NaN
1 31.394708
2 -34.670002
4 -12.932474
As you can see, we have "lost" one row as a consequence of .dropna() and your 0th row will always be nan as there is no difference to take.
If you are interested in replacing nans with other values, there are different techniques such as fillna or imputation.

Categories