I have the following python code:
consumos=df.iloc[:,0]
df['media_movel'] = rolling_median(consumos, window=30, center=True).fillna(method='bfill').fillna(method='ffill')
desv_padrao=df.stack().std()
threshold = 1000
difference = np.abs(consumos - df['media_movel'])
corr=np.abs(df['media_movel']-desv_padrao)
df['corr']=pd.DataFrame(corr)
outlier = difference > threshold
df.mask(outlier, df['corr'], axis=1)
So, I have a dataframe containing a time series and my aim is to correct the outliers (by admiting that the difference between the reference data and the rolling median has to be greater than 1000, which is the threshold).
For that, I've created the boolean variable outlier(which is True when there is an outlier based on the previous explanation) and I am trying to replace those outliers with: (rolling mediam column - standard deviation) into a mask, but the result is the time series with NaNs. I don't know why those NaNs appear, but I need to obtain the correct data.
I think the replacement of the masked values may be failing due to a shape mismatch. Try replacing your last line with this:
df.mask(outlier, df['corr'].values.reshape(-1, 1), axis=1)
If that fails, try this:
df.iloc[:,0].mask(outlier, df['corr'].values.reshape(-1, 1), axis=1)
Related
I have a huge dataframe with a lot of zero values. And, I want to calculate the average of the numbers between the zero values. To make it simple, the data shows for example 10 consecutive values then it renders zeros then values again. I just want to tell python to calculate the average of each patch of the data.
The pic shows an example
first of all I'm a little bit confused why you are using a DataFrame. This is more likely being stored in a pd.Series while I would suggest storing numeric data in an numpy array. Assuming that you are having a pd.Series in front of you and you are trying to calculate the moving average between two consecutive points, there are two approaches you can follow.
zero-paddding for the last integer:
assuming circularity and taking the average between the first and the last value
Here is the expected code:
import numpy as np
import pandas as pd
data_series = pd.Series([0,0,0.76231, 0.77669,0,0,0,0,0,0,0,0,0.66772, 1.37964, 2.11833, 2.29178, 0,0,0,0,0])
np_array = np.array(data_series)
#assuming zero_padding
np_array_zero_pad = np.hstack((np_array, 0))
mvavrg_zeropad = [np.mean([np_array_zero_pad[i], np_array_zero_pad[i+1]]) for i in range(len(np_array_zero_pad)-1)]
#asssuming circularity
np_array_circ_arr = np.hstack((np_array, np_array[-1]))
np_array_circ_arr = [np.mean([np_array_circ_arr[i], np_array_circ_arr[i+1]]) for i in range(len(np_array_circ_arr)-1)]
Using np.interp(query, x, y) produces the same results as I calculate in Excel sometimes. Here is a case where np.interp() and Excel agree:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'x': [-9.210,-6.908,-4.605,-2.303,0.000,2.303],
'y': [-1.867,-1.867,-2.027,-3.667,-7.850,-21.112]}
)
val = -7.313
test1 = np.interp(val, df['x'], df['y'])
And print(test1) yields -1.867. This is exactly as I calculate in Excel and it looks right (our query value is between the yellow values):
However, test2 = np.interp(val, df['y'], df['x']) yields 2.303. In Excel, I calculate -0.2956, which looks right because our query value is between the yellow values.
Is there some kind of weird behavior in numpy where it gets confused going from negative to zero to positive when trying to interpolate? I have tried this with a much more descritized dataframe (50 rows instead of these 6), and the values are always in increasing order, and I get the same issue.
The values in the predictor column must be in increasing order. (Note: -21 is less than -1.8 on the number line, as is -1 less than 1.) Use sort_values to sort the data frame in ascending order by column y, and then the output matches your Excel output.
df1=df.sort_values(by="y")
test3= np.interp(val, df1["y"], df1["x"])
print(test3)
-0.29565168539325837
I have a dataframe with time series.
I'd like to compute the rolling correlation (periods=20) between columns.
store_corr=[] #empty list to store the rolling correlation of each pairs
names=[] #empty list to store the column name
df=df.pct_change(periods=1).dropna(axis=0) #Prepate dataframe of time series
for i in range(0,len(df.columns)):
for j in range(i,len(df.columns)):
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
names.append('col '+str(i)+' -col '+str(j))
store_corr.append(corr)
df_corr=pd.DataFrame(np.transpose(np.array(store_corr)),columns=names)
This solution is working and gives me the rolling correlation.This solution is with the help of Austin Mackillop (comments).
Is there another faster way? (I.e. I want to avoid this double for loop.)
This line:
corr=df.rolling(20).corr(df[df.columns[i]],df[df.columns[j]])
will produce an error because the second argument of corr expects a Bool but you passed a DataFrame which has an ambiguous truth value. You can view the docs here.
Does applying the rolling method to the first DataFrame in the second line of code that you provided achieve what you are trying to do?
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)
I need to create a third set of columns based on two other sets of columns and based on the additional IF THEN condition. It works, but I am not sure how to introduce a condition that would limit division to only those rows where both list_A and list_B are higher than zero. In other words, I want to execute computation only if the numbers in both sets A and B are higher than zero because that's the condition that makes the division meaningful. And, if either list A or B is zero then I would like list C be zero. I tried NumPy where approach, but I am getting error. Thank you. Here is example
dates = pd.date_range('1/1/2000', periods=100, freq='M')
dft = pd.DataFrame(np.random.randn(100, 4),index=dates, columns=['var_A1',
'var_A2', 'var_B1', 'var_B2'])
list_A=['var_A1', 'var_A2']
list_B=['var_B1', 'var_B2']
list_C=['var_C1', 'var_C2']
C=pd.DataFrame(data=dft[list_A].values/dft[list_B].values,columns=list_C,
index=dft.index)
dft=pd.concat([dft,C],axis=1)
c = pd.DataFrame(data=np.where(((dft[list_A].values>0) & (dft[list_B].values>0)), dft[list_A].values/dft[list_B].values, 0),columns=list_C,index=dft.index)
dft=pd.concat([dft,c],axis=1)