How to grab data based on two parameters using a function - python

I am attempting to create a few plots of some wind data, however, I am having trouble selecting specific data using two parameters, that being the hour of day and month. I am attempting to use a function to find grab the specific data but instead get the error
Traceback (most recent call last):
File "/Users/Cpower18/Documents/Tryong_again.py", line 47, in <module>
plt.plot(hr, hdh(hr, mn2))
File "/Users/Cpower18/Documents/Tryong_again.py", line 37, in hdh
for n, k in hr, mn2:
ValueError: too many values to unpack (expected 2)
I am currently using dataframes to sort the data based on date and a function to grab the specific data. I have managed to do so with only one variable, that being the hour of the day, however, not for two variables.
import csv
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
warnings.simplefilter(action='ignore', category=FutureWarning)
data = pd.read_csv('merged_1.csv')
df = pd.DataFrame(data)
df['Wind Spd (km/h)'] = pd.to_numeric(df['Wind Spd (km/h)'], errors ='coerce')
df['Date/Time'] = pd.to_datetime(df['Date/Time'], errors = 'coerce')
df = df.set_index(pd.DatetimeIndex(df['Date/Time']))
df['hour'] = df.index.hour
df['month'] = df.index.month
mn1 = np.linspace(1, 2, 2)
mn2 = np.linspace(3, 5, 3)
mn3 = np.linspace(6, 8, 3)
mn4 = np.linspace(9, 11, 3)
mn5 = np.linspace(12)
hr = np.linspace(0, 23, 24)
def hdh(hr, mn2):
out = []
for n, k in hr, mn2:
t = (df['hour'] == n) & (df['month'] == k)
s = t['Wind Spd (km/h)'].mean(axis = 0) / 3.6
out.append(s)
return out
plt.plot(hr, hdh(hr, mn2))
plt.xlabel('Hour')
plt.ylabel('Wind Speed (m/s)')
plt.xlim(0, 24)
plt.ylim(2.85, 4.75)
plt.title('ShearENV Anual Average Hourly Wind Speed')
plt.grid(which = 'both', axis='both')
plt.show()`
The expected result should be a list of the data conforming to a specific hour (for example 01:00) and a specific season (for example months 3 to 5). As of now, I am only getting errors, thank you for any help.

Related

Compute the rolling mean in pandas

I ran the following code:
import numpy as np
import pandas as pd
#make this example reproducible
np.random.seed(0)
#create dataset
period = np.arange(1, 101, 1)
leads = np.random.uniform(1, 20, 100)
sales = 60 + 2*period + np.random.normal(loc=0, scale=.5*period, size=100)
df = pd.DataFrame({'period': period, 'leads': leads, 'sales': sales})
#view first 10 rows
df.head(10)
df['rolling_sales_5'] = df['sales'].rolling(5,center=True, min_periods=1).mean()
df.head(10)
But I do not understand how the first two obs and last two obs for the rolling_sales_5 variable are generated. Any idea?

What is invalid index to scalar variable error in python?

I am quite new to python so please bear with me.
Currently, this is my code
import statistics
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from datetime import datetime
df = pd.read_csv(r"/Users/aaronhuang/Documents/Desktop/ffp/exfileCLEAN2.csv", skiprows=[1]) # replace this with wherever the file is.
start_time = datetime.now()
magnitudes = df['Magnitude '].values
times = df['Time '].values
average = statistics.mean(magnitudes)
sd = statistics.stdev(magnitudes)
below = sd*3
i = 0
while(i < len(df['Magnitude '])):
if(abs(df['Magnitude '][i]) <= (average - below)):
print(df['Time '][i])
outlier_indicies=(df['Time '][i])
i += 1
window = 2
num = 1
x = times[outlier_indicies[num]-window:outlier_indicies[num]+window+1]
y = magnitudes[outlier_indicies[num]-window:outlier_indicies[num]+window+1]
plt.plot(x, y)
plt.xlabel('Time (units)')
plt.ylabel('Magnitude (units)')
plt.show()
fig = plt.figure()
It outputs this:
/Users/aaronhuang/.conda/envs/EXTTEst/bin/python "/Users/aaronhuang/PycharmProjects/EXTTEst/Code sandbox.py"
2456116.494
2456116.535
2456116.576
2456116.624
2456116.673
2456116.714
2456116.799
2456123.527
2456166.634
2456570.526
2456595.515
2457485.722
2457497.93
2457500.674
2457566.874
2457567.877
Traceback (most recent call last):
File "/Users/aaronhuang/PycharmProjects/EXTTEst/Code sandbox.py", line 38, in <module>
x = times[outlier_indicies[num]-window:outlier_indicies[num]+window+1]
IndexError: invalid index to scalar variable.
Process finished with exit code 1
How can I solve this error? I would like my code to take the "time" values printed, and graph them to their "magnitude" values. If there are any questions please leave a comment.
Thank you
Can't tell exactly what you are trying to do. But the indexing format you are using should evaluate to something like times[10:20], going from the 10th to the 20th index of times. The problem is that (I'm guessing) the numbers you have in there aren't ints, but possibly timestamps?
Maybe you want something like:
mask = (times > outlier_indicies[num-window]) & (times < outlier_indicies[num+window+1])
x = times[mask]
y = magnitude[mask]
But I'm really just guessing, and obv can't see your data.

Select smoothing parameter and implement non-parametric regression in Python

I'm working in R to estimate non-parametric regression. The complete project: https://systematicinvestor.wordpress.com/2012/05/22/classical-technical-patterns
My R code is the following , relying on the sm package's h.select and sm.regression.
library(sm)
y = as.vector( last( Cl(data), 190) )
t = 1:len(y)
h = h.select(t, y, method = 'cv')
temp = sm.regression(t, y, h=h, display = 'none')
I now would like to do the same in Python. I managed to set up the data (see below) but do not know how to select the smoothing parameter and estimate the non-parametric regression.
import pandas as pd
import datetime
import pandas_datareader.data as web
from pandas import Series, DataFrame
start = datetime.datetime(1970, 1, 1)
end = datetime.datetime(2020, 3, 24)
df = web.DataReader("^GSPC", 'yahoo', start, end)
y = df['Close'].tail(190).values
t = list(range(1, len(y) + 1))

Why do I have different array dimensions only when not using todays date?

I am trying to get stock data for a company and predict stock prices in the future. I know this isn't accurate, but I am using it as a learning tool. When using today's date as the end date and the predicted date as a date in the future my code appears to work. However, when using a past date and predicting the future this produces an error:
"ValueError: x and y must have same first dimension, but have shapes (220,) and (221,)"
I want to do this as then I would be able to compare predictions to actual prices.
import numpy as np
import datetime
import pandas_datareader as web
import statistics
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import style
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
stock_name = 'BP.L'
prices = web.DataReader(stock_name, 'yahoo', start = '2019-01-01', end = '2019-11-05').reset_index(drop = False)[['Date', 'Adj Close']]
#plt.plot(prices['Date'], prices['Adj Close'])
#plt.xlabel('Days')
#plt.ylabel('Stock Prices')
#plt.show()
# Parameter Definitions
# So : initial stock price
# dt : time increment -> a day in our case
# T : length of the prediction time horizon(how many time points to predict, same unit with dt(days))
# N : number of time points in the prediction time horizon -> T/dt
# t : array for time points in the prediction time horizon [1, 2, 3, .. , N]
# mu : mean of historical daily returns
# sigma : standard deviation of historical daily returns
# b : array for brownian increments
# W : array for brownian path
start_date = '2018-01-01'
end_date = '2019-01-01'
pred_end_date = '2019-11-05'
# We get daily closing stock prices
S_eon = web.DataReader(stock_name, 'yahoo', start_date, end_date).reset_index(drop = False)[['Date', 'Adj Close']]
So = S_eon.loc[S_eon.shape[0] -1, "Adj Close"]
dt = 1
n_of_wkdays = pd.date_range(start = pd.to_datetime(end_date,
format = "%Y-%m-%d") + pd.Timedelta('1 days'),
end = pd.to_datetime(pred_end_date,
format = "%Y-%m-%d")).to_series(
).map(lambda x:
1 if x.isoweekday() in range(1,6) else 0).sum()
T = n_of_wkdays
N = T / dt
t = np.arange(1, int(N) + 1)
returns = (S_eon.loc[1:, 'Adj Close'] - \
S_eon.shift(1).loc[1:, 'Adj Close']) / \
S_eon.shift(1).loc[1:, 'Adj Close']
mu = np.mean(returns)
sigma = np.std(returns)
scen_size = 10000
b = {str(scen): np.random.normal(0, 1, int(N)) for scen in range(1, scen_size + 1)}
W = {str(scen): b[str(scen)].cumsum() for scen in range(1, scen_size + 1)}
drift = (mu - 0.5 * sigma**2) * t
diffusion = {str(scen): sigma * W[str(scen)] for scen in range(1, scen_size + 1)}
S = np.array([So * np.exp(drift + diffusion[str(scen)]) for scen in range(1, scen_size + 1)])
S = np.hstack((np.array([[So] for scen in range(scen_size)]), S))
S_avg = np.mean(S)
print(S_avg)
#Plotting
plt.figure(figsize = (20,10))
for i in range(scen_size):
plt.title("Daily Volatility: " + str(sigma))
plt.plot(pd.date_range(start = S_eon["Date"].max(),
end = pred_end_date, freq = 'D').map(lambda x:
x if x.isoweekday() in range(1, 6) else np.nan).dropna(), S[i, :])
plt.ylabel('Stock Prices, €')
plt.xlabel('Prediction Days')
plt.show()
The error shows:
"File "C:\Users\User\Anaconda3\lib\site-packages\matplotlib\axes_base.py", line 270, in _xy_from_xy
"have shapes {} and {}".format(x.shape, y.shape))"
Could you try to add one day more to the prediction end date?
pred_end_date = '2019-11-06'
Your error is just a shape mismatch and your date series miss only one value
According to the documentation:
date.isoweekday()
Return the day of the week as an integer, where
Monday is 1 and Sunday is 7. For example, date(2002, 12,
4).isoweekday() == 3, a Wednesday. See also weekday(), isocalendar().
This returns a number between 1 and 7, and you're checking the range 1 to 6, converting other values to na. Then you dropna them, so you lose a value.
Change it to x if x.isoweekday() in range(1, 7) and it should work.
I changed the following and it now works:
"x if x.isoweekday() in range(1, 6) else np.nan).dropna(), S[i, :])"
to:
"x if x.isoweekday() in range(1, 6) else np.nan).dropna(), S[i, :-1])"

Python ValueError. Don't understand error or how to fix

I am following the tutorial here; https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price-machine-learningnd-deep-learning-techniques-python/#comment-155692
Instead of using the provided dataset I am using one needed for my assignment.
The code used is
#import packages
import pandas as pd
import numpy as np
#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline
#setting figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20,10
#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
#read the file
df = pd.read_csv('C:/Users/Usert/Downloads/stock-20050101-to-20171231/stock-20050101-to-20171231/IBM_2006-01-01_to_2018-01-01.csv')
#print the head
df.head()
#setting index as date
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']
#plot
plt.figure(figsize=(16,8))
plt.plot(df['Close'], label='Close Price history')
#creating dataframe with date and the target variable
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Close'][i] = data['Close'][i]
#splitting into train and validation
train = new_data[:987]
valid = new_data[987:]
new_data.shape, train.shape, valid.shape
((1235, 2), (987, 2), (248, 2))
train['Date'].min(), train['Date'].max(), valid['Date'].min(), valid['Date'].max()
#make predictions
preds = []
for i in range(0,248):
a = train['Close'][len(train)-248+i:].sum() + sum(preds)
b = a/248
preds.append(b)
#calculate rmse
rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-preds),2)))
rms
#plot
valid['Predictions'] = 0
valid['Predictions'] = preds
plt.plot(train['Close'])
plt.plot(valid[['Close', 'Predictions']])
This runs fine until "#Calculate RMSE" when it hits the error.
File "<ipython-input-92-1256d885493e>", line 65, in <module>
rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-preds),2)))
ValueError: operands could not be broadcast together with shapes (2033,) (248,)
Using "print(valid.shape)" and "print(len(preds))" as requested returns "(604, 3)" and "248".
Any idea how I change the numbers to fit my dataset as each time I change the numbers I create more errors?
Just FYI;
The dataset I am using has 7 columns named "Date, Open, High, Low, Close, Volume and Name" with 3021 rows of data including headers.
Whilst the one in the tutorial has 8 columns being "date, open, high, low, last, close, total_trade_quantity, and turnover" with 1236 rows including headers.

Categories