Filtering outliers from DataFrame - python

I have a big problem filtering my data. I've read a lot here on stackoverflow and ion other pages and tutorials, but I could not solve my specific problem...
The first part of my code, where I load my data into python looks as follow:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from arch import arch_model
spotmarket = pd.read_excel("./data/external/Spotmarket_dhp.xlsx", index=True)
r = spotmarket['Price'].pct_change().dropna()
returns = 100 * r
df = pd.DataFrame(returns)
The excel table has 43.000 values in one column and includes the hourly prices. I use this data to calculate the percentage change from hour to hour and the problem is, that there are sometimes big changes between 1000 to 40000%. The dataframe looks as follow:
df
Out[12]:
Price
1 20.608229
2 -2.046870
3 6.147789
4 16.519258
...
43827 -16.079874
43828 -0.438322
43829 -40.314465
43830 -100.105374
43831 700.000000
43832 -62.500000
43833 -40400.000000
43834 1.240695
43835 52.124183
43836 12.996778
43837 -17.157795
43838 -30.349971
43839 6.177924
43840 45.073701
43841 76.470588
43842 2.363636
43843 -2.161042
43844 -6.444781
43845 -14.877102
43846 6.762918
43847 -38.790036
[43847 rows x 1 columns]
I wanna exclude this outliers. I've tried different ways like calculating the meanand the std and exclude all values which are + and - three times the std away from the mean. It works for a small part of the data, but for the complete data, the mean and std are both NaN. Has someone an idea, how I can filter my dataframe?

I think need filter by percentiles by quantile:
r = spotmarket['Price'].pct_change() * 100
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-1.5*(Q3-Q1)
q3 = Q3+1.5*(Q3-Q1)
df = spotmarket[r.between(q1, q3)]

may you should first discard all the values that are giving those fluctuations and then create the dataframe. One way is to use the filter()

Related

Python Pandas: How to get the maximum value per peak in multiple cycles

I am importing data from a machine that has thousands of cycles on it. Each cycle lasts a few minutes and has two peaks in pressure that I need to record. One example can be seen in the graph below.
In this cycle you can see there are two peaks, one at 807 psi and one at 936 psi. I need to record these values. I have sorted the data so i can determine when a cycle is on or off already, but now I need to figure out how to record these two maxes. I previouly tried this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group', 'row_index'])
to get the maxes, but realized this will only give me the two largest values which in some cycles happen right before the the peak.
In this example dataframe I have provided one cycle:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,1320,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333]}
df = pd.DataFrame(data)
The peak values for this should be 1320, and 2303 whilke ignoring the slow increase to these peaks.
Thanks for any help!
(This is also for a ton of cycles, so i need it to be able to go through and record the peaks for each cycle)
Alright, I had a go, using the simple heuristic I suggested in my comment.
def filter_peaks(df):
df["before"] = df["Pressure"].shift(1)
df["after"] = df["Pressure"].shift(-1)
df["max"] = df.max(axis=1)
df = df.fillna(0)
return df[df["Pressure"] == df["max"]]["Pressure"].to_frame()
filter_peaks(df) # test one application
If you apply this once to your test dataframe, you get the following result:
You can see, that it almost doesn't work: the value at line 21 only needed to be a little higher for it to exceed the true second peak at line 8.
You can get round this by iterating, ie., with filter_peaks(filter_peaks(df)). You then do end up with a clean dataframe that you can apply your .nlargest strategy to.
EDIT
Complete code example:
import pandas as pd
data = {'Pressure' : [100,112,114,120,123,420,123,1230,1320,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333]}
df = pd.DataFrame(data)
def filter_peaks(df):
df["before"] = df["Pressure"].shift(1)
df["after"] = df["Pressure"].shift(-1)
df["max"] = df.max(axis=1)
df = df.fillna(0)
return df[df["Pressure"] == df["max"]]["Pressure"].to_frame()
df2 = filter_peaks(df) # or do it twice if you want to be sure: filter_peaks(filter_peaks(df))
df2["Pressure"].nlargest(2)
Output:
19 2303
8 1320
Name: Pressure, dtype: int64

Import and Print Json objects in Dataframe in Pandas

I have a json file which looks like the one shown in the picture.
How can I import and print all the Quantity and Rate in Pandas?
How can I print the sum of all the Quantity for Buy and Sell separately?
How can I print the sum of all the Quantity who's values is greater than x. For eg:SUM(Qty > 5)
In raw format, the data is like this
{"success":true,"message":"","result":{"buy":[{"Quantity":199538.30948659,"Rate":0.00000970},{"Quantity":62142.31715449,"Rate":0.00000968},{"Quantity":233476.03486058,"Rate":0.00000967},{"Quantity":75613.30879931,"Rate":0.00000966},{"Quantity":3109.14961399,"Rate":0.00000965},{"Quantity":66.22406639,"Rate":0.00000964},{"Quantity":401.06420081,"Rate":0.00000963},{"Quantity":186.93339628,"Rate":0.00000961},{"Quantity":122731.01165366,"Rate":0.00000960},{"Quantity":7718.27750144,"Rate":0.00000959},{"Quantity":802.00000000,"Rate":0.00000958},{"Quantity":2050.72163419,"Rate":0.00000956},{"Quantity":1000.00000000,"Rate":0.00000955}
import pandas as pd
#change 'buy' for other results
data = pd.DataFrame(pd.read_json('file.json')['result']['buy'])
#for filtering
print(data.query('Quantity > 5').query('Rate > 0.00000966').sum())
You can use the pandas.read_json() command to do this. Just pass it your json file in the function and pandas will create a dataframe out of it for you.
Here's the link to the documentation for it where you can pass extra parameters like orient='records' and so on to tell pandas on what to use as the dataframe columns and what to use as row data etc.
Here's the link: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Once in the dataframe, then you can run various commands to calculate sums of quantity for buy and sell. Having the data in a dataframe makes life a bit more easier when running math calculations in my opinion.
Use json_normalize and pass the meta path :
import json
import pandas as pd
with open('data.json') as f:
data = json.load(f)
buy_df = pd.io.json.json_normalize(data['result'],'buy')
#Similarly for sell data if you have a separte entity named `sell`.
sell_df = pd.io.json.json_normalize(data['result'],'sell')
Output:
Quantity Rate
0 199538.309487 0.00001
1 62142.317154 0.00001
2 233476.034861 0.00001
3 75613.308799 0.00001
4 3109.149614 0.00001
For sum you can do
buy_df['Quantity'].sum()
From now for selection and indexing the data refer this - Indexing and Selecting Data - Pandas

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

Python Pandas Dataframe: portfolio backtesting investing cashflows on different dates

let's suppose to have the following DataFrame of returns:
import numpy as np
import pandas as pd
import pandas.io.data as web
data = web.DataReader(['AAPL','GOOG'],data_source='google')
returns = data['Close'].pct_change()
Now let's say I want to backtest an investment on the two assets, and let also suppose that cashflows are not invested at the same time:
positions = {}
positions['APPL'] = {returns.index[10]: 20000.0}
positions['GOOG'] = {returns.index[20]: 80000.0}
wealth = pd.DataFrame.from_dict(positions).reindex(returns.index).fillna(0.0)
My question is: is there a pythonic way to let the 20k dollars of positive cashflow on Apple and the 80k dollars on Google grow, based on their respective daily returns?
At the moment I'm doing this iterating by each position (column) and then by i-th row:
wealth.ix[i] = wealth.ix[i-1] * (1 + returns[i])
but I know that with Python and Pandas this kind of iteration can be often avoided.
Thanks for the time you will spend for this.
link to iPython Notebook
Simone
First you need to change your position to forward fill, since you keep the investment.
pos = pd.DataFrame.from_dict(positions).reindex(returns.index).fillna(method="ffill")
Then you need cumprod
wealth = pos.shift() * (1+returns).cumprod(axis=0)
The shift is necessary since you do not get the return on the first day.

Categories