Find the best R2 by lopping through a dataframe - python

I am learning pandas and NumPy. I am trying to write a script that will loop through a dataframe and calculate the R2 of an increasingly larger number of rows. This is what I came up with for now:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
df=pd.DataFrame()
a=[1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b=[2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df['a']=a
df['b']=b
print(df)
plt.scatter(x=df['a'], y=df['b'])
lr = LinearRegression()
for i in range(len(df)):
n=0
X=np.column_stack([np.ones(len(df), dtype=np.float32),(df['a'].loc[0+n]).values()])
y=(df['b'].loc[0+n])
n=n+1
model = lr.fit(X,y)
print(f'R Squared: {model.score(X,y)}')
But I only get the error:
'numpy.float64' object has no attribute 'values'
When I use .values without the for-loop, it converts the values without any problem.

I don't fully understand your goal, but (df['a'].loc[0+n]) results in a single number, not a Series, so you can't call .values() on it.
I've added some hopefully helpful comments to parts of your code
Can you please clarify what you expect X to be in each iteration of the loop?
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
df=pd.DataFrame()
a=[1, 1.5, 2, 4, 5, 6, 7, 8, 9, 10]
b=[2, 4, 6, 7, 8, 9, 10, 11, 12, 13]
df['a']=a
df['b']=b
print(df)
plt.scatter(x=df['a'], y=df['b'])
lr = LinearRegression()
n=0 #I moved this outside the for-loop. before n would always be reset to 0 in each iteration of the for-loop
#you can also consider using i instead of n since i
for i in range(len(df)):
print('n =',n,'i =',i) #keep track of what n and i are
X=np.column_stack([
np.ones(len(df), dtype=np.float32), #first column is all ones
#(df['a'].loc[0+n]).values() #do you need to add 0?
np.repeat(df['a'].loc[n], len(df)), #did you want the second column to all be the n-th value of column A?
])
y=(df['b'].loc[0+n]) #do you need to add 0?
n=n+1
print(X)
#skipping the below for now
#model = lr.fit(X,y) #Error! complains it can't fit since "TypeError: Singleton array array(2) cannot be considered a valid collection."
#print(f'R Squared: {model.score(X,y)}')

Related

How to make a graph with x and y of different length

I'm trying to make a Python app that shows a graph after the input of the data by the user, but the problem is that the y_array and the x_array do not have the same dimensions. When I run the program, this error is raised:
ValueError: x and y must have same first dimension, but have shapes () and ()
How can I draw a graph with the X and Y axis of different length?
Here is a minimal example code that will lead to the same error I got
:
import matplotlib.pyplot as plt
y = [0, 8, 9, 3, 0]
x = [1, 2, 3, 4, 5, 6, 7]
plt.plot(x, y)
plt.show()
This is virtually a copy/paste of the answer found here, but I'll show what I did to get these to match.
First, we need to decide which array to use- the x_array of length 7, or the y_array of length 5. I'll show both, starting with the former. Note that I am using numpy arrays, not lists.
Let's load the modules
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate as interp
and the arrays
y = np.array([0, 8, 9, 3, 0])
x = np.array([1, 2, 3, 4, 5, 6, 7])
In both cases, we use interp.interp1d which is described in detail in the documentation.
For the x_array to be reduced to the length of the y_array:
x_inter = interp.interp1d(np.arange(x.size), x)
x_ = x_inter(np.linspace(0,x.size-1,y.size))
print(len(x_), len(y))
# Prints 5,5
plt.plot(x_,y)
plt.show()
Which gives
and for the y_array to be increased to the length of the x_array:
y_inter = interp.interp1d(np.arange(y.size), y)
y_ = y_inter(np.linspace(0,y.size-1,x.size))
print(len(x), len(y_))
# Prints 7,7
plt.plot(x,y_)
plt.show()
Which gives

Problem when splitting data: KeyError: "None of [Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')] are in the [columns]"

I am attempting to execute a train test split on some data, wine.data but when initializing x and y:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data")
print(wine.shape)
wine.head()
X = wine[np.arange(1,14)]
y = wine[0]
The rest of the code below this segment will not run as I get the error message:
KeyError: "None of [Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')] are in the [columns]"
I have attempted to resolve this by changing the range of the X value or changing the np.arange function but neither help the problem.
Any help or advice would be greatly appreciated, thank you!
You forgot to add header=None to the dataframe constructor. The csv you are downloading doesn't have a header line. So, if you don't specify header=None, the first line of data will be used as the header.
Try with
wine = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",
header=None
)
Are you trying to select the columns by their positions ?
If that so, try :
X = wine.iloc[:,np.arange(1,14)]
y = wine.iloc[:, 0]

Plotting a histogram from a database using matplot and python

So from the database, I'm trying to plot a histogram using the matplot lib library in python.
as shown here:
cnx = sqlite3.connect('practice.db')
sql = pd.read_sql_query('''
SELECT CAST((deliverydistance/1)as int)*1 as bin, count(*)
FROM orders
group by 1
order by 1;
''',cnx)
which outputs
This
From the sql table, I try to extract the columns using a for loop and place them in array.
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
print(distance)
print(counts)
OUTPUT:
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
When I plot a histogram
plt.hist(counts,bins=distance)
I get this out put:
click here
My question is, how do I make it so that the count is on the Y axis and the distance is on the X axis? It doesn't seem to allow me to put it there.
you could also skip the for loop and plot direct from your pandas dataframe using
sql.bin.plot(kind='hist', weights=sql['count(*)'])
or with the for loop
import matplotlib.pyplot as plt
import pandas as pd
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
plt.hist(distance, bins=distance, weights=counts)
You can skip the middle section where you count the instances of each distance. Check out this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'distance':np.round(20 * np.random.random(100))})
df['distance'].hist(bins = np.arange(0,21,1))
Pandas has a built-in histogram plot which counts, then plots the occurences of each distance. You can specify the bins (in this case 0-20 with a width of 1).
If you are not looking for a bar chart and are looking for a horizontal histogram, then you are looking to pass orientation='horizontal':
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
# plt.style.use('dark_background')
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
plt.hist(counts,bins=distance, orientation='horizontal')
Use :
plt.bar(distance,counts)

"int 'object is not subscriptable"

i'm starting to learn GEKKO. Now, I am solving a knapsak problem to learn, but this time I get the error "int 'object is not subscriptable". can you look at this code? what is the source of the problem How should I define the 1.10 matrices?
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
x = m.Var((10),lb=0,ub=1,integer=True)
#x = m.Array(m.Var,(1,10),lb=0,ub=1,integer=True)
v=np.array([2, 2, 7, 8, 2, 1, 7, 9, 4, 10])
w=np.array([2, 2, 2, 2, 2, 1, 6, 7, 3, 3])
capacity=16
for j in range(10):
m.Maximize(v[j]*x[j])
for i in range(10):
m.Equation(m.sum(x[i]*w[i])<=capacity)
m.options.solver = 1
m.solve()
#print('Objective Function: ' + str(m.options.objfcnval))
print(x)
My second question is that there is a function called "showproblem ()" in MATLAB. Does GEKKO have this function?
thanks for help.
new question that according to answer.
can i write here this style(that doesnt work, if i can do it, please write working style)(i want to write this style, because i think this style is easier to understand.),
for i in range(10):
xw = x[i]*w[i]
m.Equation(m.sum(xw)<=capacity)
instead of this.
xw = [x[i]*w[i] for i in range(10)]
m.Equation(m.sum(xw)<=capacity)
Here is a modified version that solves the mixed integer problem in gekko.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
x = m.Array(m.Var,10,lb=0,ub=1,integer=True)
v=np.array([2, 2, 7, 8, 2, 1, 7, 9, 4, 10])
w=np.array([2, 2, 2, 2, 2, 1, 6, 7, 3, 3])
capacity=16
for j in range(10):
m.Maximize(v[j]*x[j])
xw = [x[i]*w[i] for i in range(10)]
m.Equation(m.sum(xw)<=capacity)
m.options.solver = 1
m.solve()
print('Objective Function: ' + str(-m.options.objfcnval))
print(x)
Your problem formulation was close. You just needed to define a list xw that you use to form the capacity constraint.
If you want to use a loop instead of a list comprehension then I recommend the following instead of xw = [x[i]*w[i] for i in range(10)].
xw = []
for i in range(10):
xw.append(x[i]*w[i])

Pytorch median - is it bug or am I using it wrong

I am trying to get median of each row of 2D torch.tensor. But the result is not what I expect when compared to working with standard array or numpy
import torch
import numpy as np
from statistics import median
print(torch.__version__)
>>> 0.4.1
y = [[1, 2, 3, 5, 9, 1],[1, 2, 3, 5, 9, 1]]
median(y[0])
>>> 2.5
np.median(y,axis=1)
>>> array([2.5, 2.5])
yt = torch.tensor(y,dtype=torch.float32)
yt.median(1)[0]
>>> tensor([2., 2.])
Looks like this is the intended behaviour of Torch as mentioned in this issue
https://github.com/pytorch/pytorch/issues/1837
https://github.com/torch/torch7/pull/182
The reasoning as mentioned in the link above
Median returns 'middle' element in case of odd-many elements, otherwise one-before-middle element (could also do the other convention to take mean of the two around-the-middle elements, but that would be twice more expensive, so I decided for this one).
You can emulate numpy median with pytorch:
import torch
import numpy as np
y =[1, 2, 3, 5, 9, 1]
print("numpy=",np.median(y))
print(sorted([1, 2, 3, 5, 9, 1]))
yt = torch.tensor(y,dtype=torch.float32)
ymax = torch.tensor([yt.max()])
print("torch=",yt.median())
print("torch_fixed=",(torch.cat((yt,ymax)).median()+yt.median())/2.)

Categories