get every season data from serveral years data - python

I'm using python and trying to calculate trends of SIC of different season. So i need to cut every season from all the months from 1979 to 2009
print sic.shape
(372, 180, 360)
sics=sic[90,:,:]
sicm=[]
for i in range(0,12):
sicj=sic[i::12,:,:]
sicm.append(sicj)
del sicj
sics[0::3,:,:]=sicm[11][:30,:,:]
sics[1::3,:,:]=sicm[0][1:,:,:]
sics[2::3,:,:]=sicm[1][1:,:,:]
then the result showed that
IndexErrorTraceback (most recent call last)
in ()
----> 1 sics[0::3,:,:]=sicm[11][:30,:,:]
/home/charcoalp/anaconda2/envs/pyn_test/lib/python2.7/site-packages/numpy/ma/core.pyc
in setitem(self, indx, value) 3299 _mask =
self._mask 3300 # Set the data, then the mask
-> 3301 _data[indx] = dval 3302 _mask[indx] = mval 3303 elif hasattr(indx, 'dtype') and (indx.dtype == MaskType):
IndexError: too many indices for array
My way is to cut every Jan,Feb,Mar...and make a new array to combine 3 months as the same season data. Is the problem can be solved or just my way is wrong?
Thanks a lot if you can help me

Related

Print the total population in from specific category within region column

I have a Population column with numbers and a Region column with locations. Im only using pandas. How would I go about finding the total population of a specific location (wellington) within the Region column?
Place = [data[‘Region’] == ‘Wellington’]
Place[data[‘Population’]]
an error came up
TypeError Traceback (most recent call last)
Input In [70], in <cell line: 4>()
1 #Q1.e
3 Place = [data['Region']=='Wellington']
----> 4 Place[data['Population']]
TypeError: list indices must be integers or slices, not Series
Try this:
data_groups = data.groupby("Region")['Population'].sum()
Output:
data_groups
Region
Northland 4750
Wellington 7580
WestCoast 1550
If you want to call some specific region, you can do:
data_groups.loc['WestCoast'] # 1550
Use DataFrame.loc with sum:
Place = data.loc[data['Region'] == 'Wellington', 'Population'].sum()
print (Place)
7190
Another idea is convert Region to index, select by Series.loc and then sum:
Place = data.set_index('Region')['Population'].loc['Wellington'].sum()
print (Place)
7190

Decomposing Time Series using STL gives error

My code
stl_fcast = forecast(nottem_stl, steps=12, fc_func=seasonal_naive, seasonal = True)
Error Msg
ValueError Traceback (most recent call last)
<ipython-input-95-39c1ef0e911d> in <module>
1 stl_fcast = forecast(nottem_stl, steps=12, fc_func=seasonal_naive,
----> 2 seasonal = True)
3
4 stl_fcast.head()
~/opt/anaconda3/lib/python3.7/site-packages/stldecompose/stl.py in forecast(stl, fc_func, steps, seasonal, **fc_func_kwargs)
102
103 # forecast index starts one unit beyond observed series
--> 104 ix_start = stl.observed.index[-1] + pd.Timedelta(1, stl.observed.index.freqstr)
105 forecast_idx = pd.DatetimeIndex(freq=stl.observed.index.freqstr,
106 start=ix_start,
pandas/_libs/tslibs/timedeltas.pyx in > pandas._libs.tslibs.timedeltas.Timedelta.__new__()
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
This code used to work in older versions of Pandas - 0.25.
Appreciate any help, thanks.

BitCoin Algo not iterating through historical data correctly

I'm creating a simple trading backtester on Bitcoin, yet I'm having trouble with the for loops in my code. The current code is based on 2 simple moving averages q and z (currently for learning purposes no real strategy). info is a dataframe holding Bitcoin historical data from a csv file. There seems to be an outofbounce error and I can't figure it out. Any help would be appreciated.
import pandas as pd
import numpy as np
cash = 10000
file = 'BTC-USD.csv'
data = pd.read_csv(file)
y = data['Adj Close'][1000:]
x = data['Date'][1000:]
v = data['Volume'][1000:]
h = data['High'][1000:]
l = data['Low'][1000:]
def movAvg(values,time):
times=np.repeat(1.0,time)/time
sma = np.convolve(values,times,'valid')
return sma
z = movAvg(y,12)
q = movAvg(y,9)
SP = len(x[50-1:])
def AlgoCal(account,info):
#i = 1050
bought = False
test = []
for x in info.index:
if q[x]<z[x]:
if bought == False:
temp = info[x]
account = account-info[x]
test.append(account)
bought = True
elif q[x]>z[x]:
if bought == True:
temp = info[x]
account = account + info[x]
test.append(account)
bought = False
else:
print("Error")
return(test)
money = AlgoCal(cash,y)
print(money)
Sample Data from Yahoo Bitcoin csv
Date,Open,High,Low,Close,Adj Close,Volume
2014-09-17,465.864014,468.174011,452.421997,457.334015,457.334015,21056800
2014-09-18,456.859985,456.859985,413.104004,424.440002,424.440002,34483200
........
........
2020-05-21,9522.740234,9555.242188,8869.930664,9081.761719,9081.761719,39326160532
2020-05-22,9080.334961,9232.936523,9008.638672,9182.577148,9182.577148,29810773699
2020-05-23,9185.062500,9302.501953,9118.108398,9209.287109,9209.287109,27727866812
2020-05-24,9196.930664,9268.914063,9165.896484,9268.914063,9268.914063,27658280960
Error:
Traceback (most recent call last):
File "main.py", line 47, in <module>
money = AlgoCal(cash,y)
File "main.py", line 31, in AlgoCal
if q[x]<z[x]:
IndexError: index 1066 is out of bounds for axis 0 with size 1066
Your moving averages have two different lengths. One is 12 periods and the other is 9 periods. When you try to compare them in AlgoCal your short one runs out and gives you the out of bounds error.
If you are going to compare moving averages in this way, you need to add a minimum period at the beginning to only start when both averages are available.

Key error when using .pivot in python pandas

I have looked at lots of pivot table related questions and found none that addressed this specific problem. I have a data frame like this:
Drug Timepoint Tumor Volume (mm3)
Capomulin 0 45.000000
5 44.266086
10 43.084291
15 42.064317
20 40.716325
... ... ...
Zoniferol 25 55.432935
30 57.713531
35 60.089372
40 62.916692
45 65.960888
I am trying to pivot the data so that the name of the drug becomes the column headings, timepoint becomes the new index, and the tumor volume is the value. Everything I have looked up online tells me to use:
mean_tumor_volume_gp.pivot(index = "Timepoint",
columns = "Drug",
values = "Tumor Volume (mm3)")
However, when I run this cell, I get the error message:
KeyError Traceback (most recent call last)
<ipython-input-15-788b92ba981e> in <module>
2 mean_tumor_volume_gp.pivot(index = "Timepoint",
3 columns = "Drug",
----> 4 values = "Tumor Volume (mm3)")
5
KeyError: 'Timepoint'
How is this a key error? The key "Timepoint" is a column in the original DF.

Pandas dataframe - remove outliers [duplicate]

This question already has answers here:
Detect and exclude outliers in a pandas DataFrame
(19 answers)
Closed 1 year ago.
Given a pandas dataframe, I want to exclude rows corresponding to outliers (Z-value = 3) based on one of the columns.
The dataframe looks like this:
df.dtypes
_id object
_index object
_score object
_source.address object
_source.district object
_source.price float64
_source.roomCount float64
_source.size float64
_type object
sort object
priceSquareMeter float64
dtype: object
For the line:
dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
The following exception is raised:
-------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-68-02fb15620e33> in <module>()
----> 1 dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]
/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof)
2239 """
2240 a = np.asanyarray(a)
-> 2241 mns = a.mean(axis=axis)
2242 sstd = a.std(axis=axis, ddof=ddof)
2243 if axis and mns.ndim < a.ndim:
/opt/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
68 is_float16_result = True
69
---> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
71 if isinstance(ret, mu.ndarray):
72 ret = um.true_divide(
TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'
And the return value of
np.isreal(df['_source.price']).all()
is
True
Why do I get the above exception, and how can I exclude the outliers?
If one wants to use the Interquartile Range of a given dataset (i.e. IQR, as shown by a Wikipedia image below) (Ref):
def Remove_Outlier_Indices(df):
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
return trueList
Based on the above eliminator function, the subset of outliers according to the dataset' statistical content can be obtained:
# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)
# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]
Use this boolean whenever you have this sort of issue:
df=pd.DataFrame({'Data':np.random.normal(size=200)}) #example
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around
I believe you could create a boolean filter with the outliers and then select the oposite of it.
outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]

Categories