Batch gradient descent algorithm implementation in python - python

I learned the Batch gradient descent algorithm recently and tried implementing it in Python. I used a data set which is not random. When I tried running the below code, the process is converging after 3 iterations but with a big error. Can someone guide me in a right way?
Sample Data set:(original data set length is 600.)
6203.75 1 173.8 43.6 0.0 183.0
6329.75 1 115.0 60.1 0.0 236.2
5830.75 1 159.5 94.1 21.0 275.8
4061.75 1 82.5 45.0 11.0 75.7
3311 1 185.5 46.1 4.0 0.0
4349.75 1 169.5 40.3 5.0 73.5
5695.25 1 138.5 68.9 6.0 204.2
5633.5 1 50.0 117.3 4.0 263.9
First column is the output. Second column is the constant value. Rest are features.
Thank you
data = open('Data_trial.txt','r')
import time
lines=data.readlines()
dataSet=[]
for line in lines:
dataSet.append(line.split())
original_output=[]
features=[]
for i in range(0,len(dataSet)):
features.append([])
predict=[]
grad=[]
weights=[0,0,0,0,0]
learning_factor=0.01
for i in range(0,len(dataSet)):
for j in range(0,len(dataSet[i])):
if j==0:
original_output.append(float(dataSet[i][j]))
else:
features[i].append(float(dataSet[i][j]))
def prediction(predict,weights,original_output,features):
for count in range(0,len(original_output)):
predict.append(sum(weights[i]*features[count][i] for i in range(0,len(features[count]))))
print("predicted values",predict)
def gradient(predict,grad,original_output,features):
for count in range(0,len(weights)):
grad.append(sum((predict[i]-original_output[i])*features[i][count]
for i in range(0,len(original_output))))
print("Gradient values",grad)
def weights_update(grad,learning_factor,weights):
for i in range(0,len(weights)):
weights[i]-=learning_factor*grad[i]
print("Updated weights",weights)
if __name__=="__main__":
while True:
prediction(predict,weights,original_output,features)
gradient(predict,grad,original_output,features)
weights_update(grad,learning_factor,weights)
time.sleep(1)
predict=[]
grad=[]
print()

Related

ValueError: z array must not contain non-finite values within the triangulation

Hi I am trying to plot an irregular grid in basemap with tri. However, I have now got the error:
tri, z = self._contour_args(args, kwargs)
File "C:\Users\OName\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\matplotlib\tri\tricontour.py", line 72, in _contour_args
raise ValueError('z array must not contain non-finite values '
ValueError: z array must not contain non-finite values within the triangulation
My database looks like this:
Latitude Longitude Altitude Value O18d
0 30.0 -30.0 0.0 0.522199 0.522199
1 30.0 -29.9 0.0 0.531214 0.531214
2 30.0 -29.8 0.0 0.540248 0.540248
3 30.0 -29.7 0.0 0.549301 0.549301
4 30.0 -29.6 0.0 0.558374 0.558374
... ... ... ... ... ...
359995 69.9 59.5 68.0 -1.489262 -1.625262
359996 69.9 59.6 74.0 -1.487915 -1.635915
359997 69.9 59.7 52.0 -1.486524 -1.590524
359998 69.9 59.8 71.0 -1.485089 -1.627089
359999 69.9 59.9 68.0 -1.483611 -1.619611
Where can my error be?

Wrong value in text comparison

I am having some difficulties in finding text matching in the below dataset (note that Sim is my current output and it is generated by running the code below. It shows the wrong match).
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
As shown above, Sim does not give the ID who wrote the text that match.
For example, add should match with gsd and vice versa. But my output says that add matches with gwe and this is not true.
The code I am using is the following:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
Could you please help me understand the error in my code? Unfortunately I cannot see it.
My expected output would be as follows:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
as it is set a perfect match (=1) in sim function.
Initial assumption
First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.
Syntactic problems
So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The sim() function should read as follows:
def sim (nm, df):
matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
notice the df instead of dataset as well as the == instead of the =. I also removed the redundant parentheses for better readability.
Semantic problems
If i then run your code and print t (which does not seem to be the end result), this gives me the following:
txt0 txt1 txt2 txt3 txt4 txt5 txt6 txt7 txt8 txt9
0 1.0 27.0 12.0 45.0 45.0 12.0 12.0 12.0 27.0 64.0
1 27.0 1.0 33.0 33.0 42.0 33.0 33.0 33.0 52.0 44.0
2 12.0 33.0 1.0 22.0 100.0 100.0 100.0 100.0 22.0 33.0
3 45.0 33.0 22.0 1.0 41.0 22.0 22.0 22.0 40.0 30.0
4 45.0 42.0 100.0 41.0 1.0 100.0 100.0 100.0 35.0 47.0
5 12.0 33.0 100.0 22.0 100.0 1.0 100.0 100.0 22.0 33.0
6 12.0 33.0 100.0 22.0 100.0 100.0 1.0 100.0 22.0 33.0
7 12.0 33.0 100.0 22.0 100.0 100.0 100.0 1.0 22.0 33.0
8 27.0 52.0 22.0 40.0 35.0 22.0 22.0 22.0 1.0 34.0
9 64.0 44.0 33.0 30.0 47.0 33.0 33.0 33.0 34.0 1.0
which seems correct to me, as fuzz.partial_ratio("wonderful end ever seen you", "wonderful") returns 100 (as a partial match is already considered a score of 100).
For consistency reasons you could change
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
to
t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
as all elements should perfectly match themselves. So when you said
But my output says that add matches with gwe and this is not true.
this would be true in the sense that fuzz.partial_ratio(), you might want to consider using fuzz.ratio() instead. Also, there might be an error when converting t to the new Sim column, but there seems to be no code in the provided example.
Alternative implementation
Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:
import re
import pandas as pd
from fuzzywuzzy import fuzz
data = """
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
"""
rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"]) # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID') # Assuming that the "ID" column holds a unique ID
comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']] # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100] # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID') # Cleanup
result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())
gives:
Text Sim
ID
add wonderful gsd
add wonderful gwe
dfsfs i love this its incredible ...
fdsdf best sport everand the gane of the year❤️❤️❤️❤...
fdsfdgte3e best match ever its a masterpiece
fsad amazing ...
gsd wonderful gwe
gsd wonderful add
gwe wonderful end ever seen you ... gsd
gwe wonderful end ever seen you ... add
hgdfgre terrible destroys everything ...

Printing results of a function using a range of numbers

I am just learning python and I'm simply trying to print the results of a function using a range of numbers, but I am getting the error "The truth value of an array with more than one element is ambiguous."
print(t1) works and shows the range I want to use in the calculations.
print(some_function(55,t1)) produces the error
What am I missing?
Please note, I am doing this to help someone for an assignment and they can only use commands or functions that they have been shown, which is not a lot, basically just what's in the current code and arrays.
Thanks for any help
from pylab import *
def some_function(ff, dd):
if dd >=0 and dd <=300:
tt = (22/-90)*ff+24
elif dd >=300 and dd <=1000:
st = (22/-90)*(ff)+24
gg = (st-2)/-800
tt = gg*dd+(gg*-1000+2)
else:
tt = 2.0
return tt
t1=arange(0,12000,1000)
print(t1)
print(some_function(55,t1))
You are only making a minor error.
t1=arange(0,12000,1000)
print(t1)
[ 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000]
You have to loop through t1 and call the function for each value in the loop.
for x in t1:
print(some_function(55,x))
10.555555555555555
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
We are missing part of the loop in the calculation because of the values in t1. Let's adjust the range a bit.
t1=arange(0,2000,100)
print(t1)
[ 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
1400 1500 1600 1700 1800 1900]
And the resultant function:
for x in t1:
print(some_function(55,x))
10.555555555555555
10.555555555555555
10.555555555555555
10.555555555555555
8.416666666666668
7.347222222222222
6.277777777777779
5.208333333333334
4.138888888888889
3.0694444444444446
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0

Python PuLP performance issue - taking too much time to solve

I am using pulp to create an allocator function which packs the items in the trucks based on the weight and volume. It works fine(takes 10-15 sec) for 10-15 items but when I double the items it takes more than half hour to solve it.
def allocator(item_mass,item_vol,truck_mass,truck_vol,truck_cost, id_series):
n_items = len(item_vol)
set_items = range(n_items)
n_trucks = len(truck_cost)
set_trucks = range(n_trucks)
print("working1")
y = pulp.LpVariable.dicts('truckUsed', set_trucks,
lowBound=0, upBound=1, cat=LpInteger)
x = pulp.LpVariable.dicts('itemInTruck', (set_items, set_trucks),
lowBound=0, upBound=1, cat=LpInteger)
print("working2")
# Model formulation
prob = LpProblem("Truck allocation problem", LpMinimize)
# Objective
prob += lpSum([truck_cost[i] * y[i] for i in set_trucks])
print("working3")
# Constraints
for j in set_items:
# Every item must be taken in one truck
prob += lpSum([x[j][i] for i in set_trucks]) == 1
for i in set_trucks:
# Respect the mass constraint of trucks
prob += lpSum([item_mass[j] * x[j][i] for j in set_items]) <= truck_mass[i]*y[i]
# Respect the volume constraint of trucks
prob += lpSum([item_vol[j] * x[j][i] for j in set_items]) <= truck_vol[i]*y[i]
print("working4")
# Ensure y variables have to be set to make use of x variables:
for j in set_items:
for i in set_trucks:
x[j][i] <= y[i]
print("working5")
s = id_series #id_series
prob.solve()
print("working6")
This is the data i am running it on
items:
Name Pid Quantity Length Width Height Volume Weight t_type
0 A 1 1 4.60 4.30 4.3 85.05 1500 Open
1 B 2 1 4.60 4.30 4.3 85.05 1500 Open
2 C 3 1 6.00 5.60 9.0 302.40 10000 Container
3 D 4 1 8.75 5.60 6.6 441.00 1000 Open
4 E 5 1 6.00 5.16 6.6 204.33 3800 Open
5 C 6 1 6.00 5.60 9.0 302.40 10000 All
6 C 7 1 6.00 5.60 9.0 302.40 10000 Container
7 D 8 1 8.75 5.60 6.6 441.00 6000 Open
8 E 9 1 6.00 5.16 6.6 204.33 3800 Open
9 C 10 1 6.00 5.60 9.0 302.40 10000 All
.... times 5
trucks(this just the top 5 rows, I have 54 types of trucks in total):
Category Name TruckID Length(ft) Breadth(ft) Height(ft) Volume \
0 LCV Tempo 407 0 9.5 5.5 5.5 287.375
1 LCV Tempo 407 1 9.5 5.5 5.5 287.375
2 LCV Tempo 407 2 9.5 5.5 5.5 287.375
3 LCV 13 Feet 3 13.0 5.5 7.0 500.500
4 LCV 14 Feet 4 14.0 6.0 6.0 504.000
Weight Price
0 1500 1
1 2000 1
2 2500 2
3 3500 3
4 4000 3
where ItemId is this:
data["ItemId"] = data.index + 1
id_series = data["ItemId"].tolist()
PuLP can handle multiple solvers. See what ones you have with:
pulp.pulpTestAll()
This will give a list like:
Solver pulp.solvers.PULP_CBC_CMD unavailable.
Solver pulp.solvers.CPLEX_DLL unavailable.
Solver pulp.solvers.CPLEX_CMD unavailable.
Solver pulp.solvers.CPLEX_PY unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.COIN_CMD passed.
Solver pulp.solvers.COINMP_DLL unavailable.
Testing zero subtraction
Testing continuous LP solution
Testing maximize continuous LP solution
...
* Solver pulp.solvers.GLPK_CMD passed.
Solver pulp.solvers.XPRESS unavailable.
Solver pulp.solvers.GUROBI unavailable.
Solver pulp.solvers.GUROBI_CMD unavailable.
Solver pulp.solvers.PYGLPK unavailable.
Solver pulp.solvers.YAPOSIB unavailable.
You can then solve using, e.g.:
lp_prob.solve(pulp.COIN_CMD())
Gurobi and CPLEX are commercial solvers that tend to work quite well. Perhaps you could access them? Gurobi has a good academic license.
Alternatively, you may wish to look into an approximate solution, depending on your quality constraints.

How to groupby, cut, transpose then merge result of one pandas Dataframe using vectorisation

Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds

Categories