Programming a prediction model, code runs but doesnt give output - python

My code runs properly but it will not provide output as it should. I am not sure where the issue is occurring. Could someone help me correct it? Do you need the CSV too?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("/content/drive/MyDrive/replicates/Replicate 3 Gilts just measures.csv")
df.info()
df.head()
# removing the irrelevant columns
cols_to_drop = ["animal"]
df = df.drop(columns=cols_to_drop,axis=1)
# first five rows of data frame after removing columns
df.head()
deep_df = df.copy(deep = True)
numerical_columns = [col for col in df.columns if (df[col].dtype=='int64' or
df[col].dtype=='float64')]
df[numerical_columns].describe().loc[['min','max', 'mean','50%'],:]
df[df['i1000.0'] == df['i1000.0'].min()]
This is where the issue occurs
i1000_bucket = df.groupby(pd.cut(df["i1000.0"],bins=[10,20,30,40,50,60,70,80,90,100]))
number_bucket = df.groupby(pd.cut(df["i1000.0"],bins=[10,20,30,40,50,60,70,80,90,100]))
i1000_bucket = ((i1000_bucket.sum()["i1000.0"] / i1000_bucket.size())*100 , 2)
number_bucket = round((number_bucket.sum()["i1000.0"] / number_bucket.size())*100 , 2)
The graph appears but nothing actually plots
x = [str(i)+"-"+str(i+10) for i in range(10,91,10)]
plt.plot(x,number_bucket.values)
plt.xlabel("i1000.0")
plt.ylabel("p1000.0")
plt.title("1000.0 comparisons")

Related

Split data frame in python based on one parameter shape

I have a data frame which is like the following :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
df_input = pd.read_csv('combine_input.csv', delimiter=',')
df_output = pd.read_csv('combine_output.csv', delimiter=',')
In this data frame, there are many repeated rows for example the first row is repeated more than 1000 times, and so on for the other rows
when I plot the time distribution I got that figure which shows that the frequency of the time parameter
df_input.plot(y='time',kind = 'hist',figsize=(10,10))
plt.grid()
plt.show()
My question is how can I take the data only in the following red rectangular for example at time = 0.006 and frequency = 0.75 1e6 ( check the following pic )
Note: InPlace of target you have to write time as your column name Is time,or change column name to target
def calRows(df,x,y):
#df For consideration
df1 = pd.DataFrame(df.target[df.target<=x])
minCount = len(df1)
targets = df1.target.unique()
for i in targets:
count = int(df1[df1.target == i].count())
if minCount > count:
minCount = count
if minCount > y:
minCount = int(y)
return minCount
You have To pass your data frame, x-intercept of the graph, y-intercept of graph to calRows(df,x,y) function which will return the number of rows to take for each target.
rows = CalRows(df,6,75)
print(rows)
takeFeatures(df,rows,x) function will take dataframe, rows (result of first function), x-intercept of graph and will return you the final dataframe.
def takeFeatures(df,rows,x):
finalDf = pd.DataFrame(columns = df.columns)
df1 = df[df.target<=x]
targets = df1.target.unique()
for i in targets:
targeti = df1[df1.target==i]
sample = targeti.sample(rows)
finalDf = pd.concat([finalDf,sample])
return finalDf
Calling takeFeature() Function
final = takeFeatures(df,rows,6)
print(final)
Your Final DataFrame will have the Values ThatYou expected in Graph
And After Plotting this final dataframe you will get like this graph

Weird Time-Series Graph Using Pycaret and plotly

I am trying to visualize Air Quality Data as time-series charts using pycaret and plotly dash python libraries , but i am getting very weird graphs, below is my code:
import pandas as pd
import plotly.express as px
data = pd.read_csv('E:/Self Learning/Djang_Dash/2019-2020_5.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%d/%m/%Y')
#data.set_index('Date', inplace=True)
# combine store and item column as time_series
data['OBJECTID'] = ['Location_' + str(i) for i in data['OBJECTID']]
#data['AQI_Bins_AI'] = ['Bin_' + str(i) for i in data['AQI_Bins_AI']]
data['time_series'] = data[['OBJECTID']].apply(lambda x: '_'.join(x), axis=1)
data.drop(['OBJECTID'], axis=1, inplace=True)
# extract features from date
data['month'] = [i.month for i in data['Date']]
data['year'] = [i.year for i in data['Date']]
data['day_of_week'] = [i.dayofweek for i in data['Date']]
data['day_of_year'] = [i.dayofyear for i in data['Date']]
data.head(4000)
data['time_series'].nunique()
for i in data['time_series'].unique():
subset = data[data['time_series'] == i]
subset['moving_average'] = subset['CO'].rolling(window = 30).mean()
fig = px.line(subset, x="Date", y=["CO","moving_average"], title = i, template = 'plotly_dark')
fig.show()
require needful help in this regard,
here is my sample data Google Drive Link
data has not been provided in a usable way. Sought out publicly available similar data. found: https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=station_hour.csv
using this data, with a couple of cleanups of your code, no issues with plots. I suspect your data has one of these issues
date is not datetime64[ns] in your data frame
date is not sorted, leading to lines being drawn in way you have noted
by refactoring way moving average is calculated, you can use animation instead of lots of separate figures
get some data
import kaggle.cli
import sys, math
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import plotly.express as px
# download data set
# https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=station_hour.csv
sys.argv = [
sys.argv[0]
] + "datasets download rohanrao/air-quality-data-in-india".split(
" "
)
kaggle.cli.main()
zfile = ZipFile("air-quality-data-in-india.zip")
print([f.filename for f in zfile.infolist()])
plot using code from question
import pandas as pd
import plotly.express as px
from pathlib import Path
from distutils.version import StrictVersion
# data = pd.read_csv('E:/Self Learning/Djang_Dash/2019-2020_5.csv')
# use kaggle data
# dfs = {f.filename:pd.read_csv(zfile.open(f)) for f in zfile.infolist() if f.filename in ['station_day.csv',"stations.csv"]}
# data = pd.merge(dfs['station_day.csv'],dfs["stations.csv"], on="StationId")
# data['Date'] = pd.to_datetime(data['Date'])
# # kaggle data is different from question, make it compatible with questions data
# data = data.assign(OBJECTID=lambda d: d["StationId"])
# sample data from google drive link
data2 = pd.read_csv(Path.home().joinpath("Downloads").joinpath("AQI.csv"))
data2["Date"] = pd.to_datetime(data2["Date"])
data = data2
# as per very first commment - it's important data is ordered !
data = data.sort_values(["Date","OBJECTID"])
data['time_series'] = "Location_" + data["OBJECTID"].astype(str)
# clean up data, remove rows where there is no CO value
data = data.dropna(subset=["CO"])
# can do moving average in one step (can also be used by animation)
if StrictVersion(pd.__version__) < StrictVersion("1.3.0"):
data["moving_average"] = data.groupby("time_series",as_index=False)["CO"].rolling(window=30).mean().to_frame()["CO"].values
else:
data["moving_average"] = data.groupby("time_series",as_index=False)["CO"].rolling(window=30).mean()["CO"]
# just first two for purpose of demonstration
for i in data['time_series'].unique()[0:3]:
subset = data.loc[data['time_series'] == i]
fig = px.line(subset, x="Date", y=["CO","moving_average"], title = i, template = 'plotly_dark')
fig.show()
can use animation
px.line(
data,
x="Date",
y=["CO", "moving_average"],
animation_frame="time_series",
template="plotly_dark",
).update_layout(yaxis={"range":[data["CO"].min(), data["CO"].quantile(.97)]})

Python ValueError. Don't understand error or how to fix

I am following the tutorial here; https://www.analyticsvidhya.com/blog/2018/10/predicting-stock-price-machine-learningnd-deep-learning-techniques-python/#comment-155692
Instead of using the provided dataset I am using one needed for my assignment.
The code used is
#import packages
import pandas as pd
import numpy as np
#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline
#setting figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20,10
#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
#read the file
df = pd.read_csv('C:/Users/Usert/Downloads/stock-20050101-to-20171231/stock-20050101-to-20171231/IBM_2006-01-01_to_2018-01-01.csv')
#print the head
df.head()
#setting index as date
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']
#plot
plt.figure(figsize=(16,8))
plt.plot(df['Close'], label='Close Price history')
#creating dataframe with date and the target variable
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Close'][i] = data['Close'][i]
#splitting into train and validation
train = new_data[:987]
valid = new_data[987:]
new_data.shape, train.shape, valid.shape
((1235, 2), (987, 2), (248, 2))
train['Date'].min(), train['Date'].max(), valid['Date'].min(), valid['Date'].max()
#make predictions
preds = []
for i in range(0,248):
a = train['Close'][len(train)-248+i:].sum() + sum(preds)
b = a/248
preds.append(b)
#calculate rmse
rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-preds),2)))
rms
#plot
valid['Predictions'] = 0
valid['Predictions'] = preds
plt.plot(train['Close'])
plt.plot(valid[['Close', 'Predictions']])
This runs fine until "#Calculate RMSE" when it hits the error.
File "<ipython-input-92-1256d885493e>", line 65, in <module>
rms=np.sqrt(np.mean(np.power((np.array(valid['Close'])-preds),2)))
ValueError: operands could not be broadcast together with shapes (2033,) (248,)
Using "print(valid.shape)" and "print(len(preds))" as requested returns "(604, 3)" and "248".
Any idea how I change the numbers to fit my dataset as each time I change the numbers I create more errors?
Just FYI;
The dataset I am using has 7 columns named "Date, Open, High, Low, Close, Volume and Name" with 3021 rows of data including headers.
Whilst the one in the tutorial has 8 columns being "date, open, high, low, last, close, total_trade_quantity, and turnover" with 1236 rows including headers.

How to apply euclidean distance to dataframe. Calculate each row

Please help me, I have the problem. It's been about 2 weeks but I don't get it yet.
So, I want to use "apply" in dataframe, which I got from Alphavantage API.
I want to apply euclidean distance to each row of dataframe.
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.neighbors import KNeighborsRegressor
from alpha_vantage.timeseries import TimeSeries
from services.KEY import getApiKey
ts = TimeSeries(key=getApiKey(), output_format='pandas')
And in my picture I got this
My chart (sorry can't post image because of my reputation)
In my code
stock, meta_data = ts.get_daily_adjusted(symbol, outputsize='full')
stock = stock.sort_values('date')
open = stock['1. open'].values
low = stock['3. low'].values
high = stock['2. high'].values
close = stock['4. close'].values
sorted_date = stock.index.get_level_values(level='date')
stock_numpy_format = np.stack((sorted_date, open, low
,high, close), axis=1)
df = pd.DataFrame(stock_numpy_format, columns=['date', 'open', 'low', 'high', 'close'])
df = df[df['open']>0]
df = df[(df['date'] >= "2016-01-01") & (df['date'] <= "2018-12-31")]
df = df.reset_index(drop=True)
df['close_next'] = df['close'].shift(-1)
df['daily_return'] = df['close'].pct_change(1)
df['daily_return'].fillna(0, inplace=True)
stock_numeric_close_dailyreturn = df['close', 'daily_return']
stock_normalized = (stock_numeric_close_dailyreturn - stock_numeric_close_dailyreturn.mean()) / stock_numeric_close_dailyreturn.std()
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized) , axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx":euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = df.loc[int(second_smallest)]["date"]
And I want that my chart like this
The chart that I want
And the code from this picture
distance_columns = ['Close', 'DailyReturn']
stock_numeric = stock[distance_columns]
stock_normalized = (stock_numeric - stock_numeric.mean()) / stock_numeric.std()
stock_normalized.fillna(0, inplace = True)
date_normalized = stock_normalized[stock["Date"] == "2016-06-29"]
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized), axis = 1)
distance_frame = pandas.DataFrame(data = {"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = stock.loc[int(second_smallest)]["Date"]
I tried to figure it out, the "apply" in the df.apply from pandas format and from pandas.csv_reader is different.
Is there any alternative to have same output in different format (pandas and csv)
Thank you!
nb: sorry if my english bad.

Python Index column doesn't freeze while scrolling to the right

my problem is that I have a Dataframe of 200 rows and 200 columns, while I scroll to the right the index column stay fixed ( I can still see it) as it should be.
However when I select a column or value into the Dataframe (for example to order the values in ascending or descending order), the index column change and becomes the same as the column I selected.
I would like to still see the index column.
I am using Spyder 3.3.0 and Python 3.6
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import operator
# Importing the dataset
dataset = pd.read_csv('1992_2014.csv', index_col =0)
nations_all = dataset.iloc[:, 0].values
nations = [nations_all[0]]
for i in range(0, len(nations_all)):
if nations_all[i] not in nations:
nations.append(nations_all[i])
Year = dataset.iloc[:, 1].values
CO2 = dataset.iloc[:, 8].values
# Creating the Trend Matrix between two nations
trend_matrix = pd.DataFrame(index = nations, columns = nations)
for i in nations:
n = dataset[dataset["Nation"] == i].index.values.astype(int)
for k in nations:
kn = dataset[dataset["Nation"] == k].index.values.astype(int)
div_n = CO2[n[0]]
div_kn = CO2[kn[0]]
CO2_n = (CO2[n]/div_n)
CO2_kn = (CO2[kn]/div_kn)
trend_matrix.loc[i, k] = sum(list(map(abs,list(map(operator.sub, CO2_n, CO2_kn)))))
Thanks!

Categories