I have the following code I'm trying to run in Jupyter notebook line by line. But it keeps dying as soon as it arrives at the line where the pandas dataframe is being converted to NumPy.
#importing libraries
import sqlalchemy
import spacy
import numpy as np
import pandas as pd
#connecting to database and reading into dataframe with sqlalchemy
user_inputs = "SELECT * FROM t1"
rasa_questions = "SELECT * FROM o2"
server = 'DEM'
db = 's'
engine = sqlalchemy.create_engine('mssql+pyodbc://' + server + '/' + db + '?driver=SQL+Server')
user_inputs_df = pd.read_sql_query(user_inputs, engine)
rasa_questions_df = pd.read_sql_query(rasa_questions, engine)
#loading spacy
nlp = spacy.load("de_core_news_lg")
rasa_questions_list = rasa_questions_df["F"]
user_input_list = user_inputs_df["U"]
rasa_vector = [nlp(s).vector for s in rasa_questions_list]
user_vector = [nlp(s).vector for s in user_input_list]
similarity_scores = np.inner(rasa_vector, user_vector) / (np.linalg.norm(rasa_vector, axis=1) * np.linalg.norm(user_vector, axis=1))
data = []
for i in range(len(rasa_questions_list)):
for j in range(len(user_input_list)):
data.append([rasa_questions_list[i], user_input_list[j], similarity_scores[i][j]])
O2_Similarity_Scores = pd.DataFrame(data, columns=['RASA Frage', 'User Input', 'Similarity Score'])
print(O2_Similarity_Scores)
So, this is the line of code that makes the kernel go dead -
similarity_scores = np.inner(rasa_vector, user_vector) / (np.linalg.norm(rasa_vector, axis=1) * np.linalg.norm(user_vector, axis=1))
I am on Windows 10 and Python 3.9.12.
What am I doing wrong?
Expanding your comment to be readable:
dot_similarity_scores = np.matmul(rasa_vector, user_vector)
rasa_vector_norm = np.linalg.norm(rasa_vector, axis=0)
user_vector_norm = np.linalg.norm(user_vector, axis=1)
runs without error. Next
rasa_vector_norm_2d = np.reshape(rasa_vector_norm, (300, 1))
user_vector_norm_2d = np.reshape(user_vector_norm, (1, 300))
is ok too. But trying to compute
norm_product = rasa_vector_norm_2d * user_vector_norm_2d
getting an error
ValueError: operands could not be broadcast together with shapes (300,211537) (12234,300)
That last would be ok for matmul with the variables switched, but the shapes are wrong from element-wise multiplication.
But with the previous reshape, those arrays should be (300,1) and (1,300). With those the '*' would produce a (300,300) result.
Related
I'm trying to find similarity score using spacy model in German.
But my inputs are too big and I'm running out of memory at this point -
data = []
for i in range(len(rasa_questions_list)):
for j in range(len(user_input_list)):
data.append([rasa_questions_list[i], user_input_list[j], similarity_scores[i][j]])
The full code is below -
#importing libraries
import sqlalchemy
import spacy
import numpy as np
import pandas as pd
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
#connecting to database and reading into dataframe with sqlalchemy
user_inputs = "SELECT * FROM o2"
rasa_questions = "SELECT * FROM o1"
engine = sqlalchemy.create_engine('mssql+pyodbc://' + server + '/' + db + '?driver=SQL+Server')
user_inputs_df = pd.read_sql_query(user_inputs, engine)
rasa_questions_df = pd.read_sql_query(rasa_questions, engine)
#loading spacy
nlp = spacy.load("de_core_news_lg")
#declare and vectorise input arrays
rasa_questions_list = rasa_questions_df["Frage"]
user_input_list = user_inputs_df["User_Input"]
rasa_vector = [nlp(s).vector for s in rasa_questions_list]
user_vector = [nlp(s).vector for s in user_input_list]
#Convert to NumPy Arrays
rasa_vector = np.array(rasa_vector)
user_vector = np.array(user_vector)
#Trnaspose in order to use matmul function
user_vector = np.array(user_vector).T
dot_similarity_scores = np.matmul(rasa_vector, user_vector)
#normalise matrices
rasa_vector_norm = np.linalg.norm(rasa_vector, axis=0)
user_vector_norm = np.linalg.norm(user_vector, axis=1)
norm_product = np.multiply(rasa_vector_norm, user_vector_norm)
#broadcasting to match dot_similarity_scores shape
repetitions = (dot_similarity_scores.shape[0] + norm_product.shape[0] - 1) // norm_product.shape[0]
norm_product = np.repeat(norm_product, repetitions, axis=0)
truncate = dot_similarity_scores.shape[0]
norm_product = norm_product[:truncate]
norm_product = np.reshape(norm_product, (12234, 1))
similarity_scores = dot_similarity_scores / norm_product
data = []
for i in range(len(rasa_questions_list)):
for j in range(len(user_input_list)):
data.append([rasa_questions_list[i], user_input_list[j], similarity_scores[i][j]])
O2_Similarity_Scores = pd.DataFrame(data, columns=['RASA Frage', 'User Input', 'Similarity Score'])
print(O2_Similarity_Scores)
How can I optimise it to be able to successfully see the results? The first dataframe has 12234 inputs and the second has 215603. Thanks.
My code runs properly but it will not provide output as it should. I am not sure where the issue is occurring. Could someone help me correct it? Do you need the CSV too?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("/content/drive/MyDrive/replicates/Replicate 3 Gilts just measures.csv")
df.info()
df.head()
# removing the irrelevant columns
cols_to_drop = ["animal"]
df = df.drop(columns=cols_to_drop,axis=1)
# first five rows of data frame after removing columns
df.head()
deep_df = df.copy(deep = True)
numerical_columns = [col for col in df.columns if (df[col].dtype=='int64' or
df[col].dtype=='float64')]
df[numerical_columns].describe().loc[['min','max', 'mean','50%'],:]
df[df['i1000.0'] == df['i1000.0'].min()]
This is where the issue occurs
i1000_bucket = df.groupby(pd.cut(df["i1000.0"],bins=[10,20,30,40,50,60,70,80,90,100]))
number_bucket = df.groupby(pd.cut(df["i1000.0"],bins=[10,20,30,40,50,60,70,80,90,100]))
i1000_bucket = ((i1000_bucket.sum()["i1000.0"] / i1000_bucket.size())*100 , 2)
number_bucket = round((number_bucket.sum()["i1000.0"] / number_bucket.size())*100 , 2)
The graph appears but nothing actually plots
x = [str(i)+"-"+str(i+10) for i in range(10,91,10)]
plt.plot(x,number_bucket.values)
plt.xlabel("i1000.0")
plt.ylabel("p1000.0")
plt.title("1000.0 comparisons")
I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".
I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.
Here's the code:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)
col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)
l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))
expected_workers = pd.DataFrame(
np.array(l).reshape((3, 4)),
columns=observed_workers.columns,
index=observed_workers.index,
)
chi_squared_stat = (
((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)
This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.
With numpy/scipy:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
import io
from numpy import genfromtxt, outer
from scipy.stats.contingency import margins
observed = genfromtxt(io.StringIO(csv), delimiter=',', skip_header=True, usecols=range(1, 5))
row_sums, col_sums = margins(observed)
expected = outer(row_sums, col_sums) / observed.sum()
chi_squared_stat = ((observed - expected)**2 / expected).sum()
print(chi_squared_stat)
With pandas:
import io
import pandas as pd
csv = """\
work_group,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
df = pd.read_csv(io.StringIO(csv))
df_melt = df.melt(id_vars ='work_group', var_name='group', value_name='observed')
df_melt['col_sum'] = df_melt.groupby('group')['observed'].transform(np.sum)
df_melt['row_sum'] = df_melt.groupby('work_group')['observed'].transform(np.sum)
total = df_melt['observed'].sum()
df_melt['expected'] = df_melt.apply(lambda row: row['col_sum']*row['row_sum']/total, axis=1)
chi_squared_stat = df_melt.apply(lambda row: ((row['observed'] - row['expected'])**2) / row['expected'], axis=1).sum()
print(chi_squared_stat)
I am using anaconda environment , on windows with pycaret installed,and pycharm.
i want to run a basic toy example with pycaret (not using freely available datasets),
as a simple y=mx+c, where x is 1-d
here is my working code with scikit.
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
x= np.arange(0,1000,dtype = 'float64')
Y = (x*2) + 1
X = x.reshape(-1,1)
reg = LinearRegression().fit(X, Y)
# if predicting on same model,perfect score
score = reg.score(X,Y)
print('1- RSS/TSS: 1 for perfect regression=' + str(score))
print('coef =' + str(reg.coef_[0])) # slope
print('intercept =' + str(reg.intercept_)) # intercept
this gives expected results as below:
Now,I create Dataframe that i can pass to pycaret pacakge.
data1 = np.vstack((x,Y)).transpose()
# create dataframe as required by Pandas
N= data1.shape[0]
# add first row
dat2 = np.array(['','Col1','Col2'])
for i in range(N):
dat_row = list(data1[i,:].flatten())
nm = ['row'+ str(i)]
dat_row = nm + dat_row
dat2 = np.vstack ((dat2, dat_row) )
df= pd.DataFrame(data=dat2[1:,1:],
index=dat2[1:,0],
columns=dat2[0,1:])
print(df)
print('***************************')
columns = df.applymap(np.isreal).all()
print(columns)
print('***************************')
# now, using Pycaret
from pycaret.regression import *
exp_reg = setup(df, html= False,target='Col2')
print('********************************')
compare_models()
when i do so,
the numeric columns i created (x,y) are shown as categorical. This also recognized
by pyCaret as Categorical.see the figure below.
Why is this Categorical? Can i change it to be treated as numeric?
Once I press enter, finally, Pycaret gives me the error below:
any ideas for this error?
sedy
You can force the data type in PyCaret within setup function by using numeric_features and categorical_features param within the setup function.
For example:
clf1 = setup(data, target = 'target', numeric_features = ['X1', 'X2'])
I'm running Python 2.7.9. I have two numpy arrays (100000 x 142 and 100000 x 20) that I want to concatenate into 1, 100000 x 162 array.
The following is the code I'm running:
import numpy as np
import pandas as pd
def ratingtrueup():
actones = np.ones((100000, 20), dtype='f8', order='C')
actualhhdata = np.array(pd.read_csv
('C:/Users/Desktop/2015actualhhrating.csv', index_col=None, header=None, sep=','))
projectedhhdata = np.array(pd.read_csv
('C:/Users/Desktop/2015projectedhhrating.csv', index_col=None, header=None, sep=','))
adjfctr = round(1 + ((actualhhdata.mean() - projectedhhdata.mean()) / projectedhhdata.mean()), 5)
projectedhhdata = (adjfctr * projectedhhdata)
actualhhdata = (actones * actualhhdata)
end = np.concatenate((actualhhdata.T, projectedhhdata[:, 20:]), axis=1)
ratingtrueup()
I get the following value error:
File "C:/Users/PycharmProjects/TestProjects/M.py",
line 16, in ratingtrueup
end = np.concatenate([actualhhdata.T, projectedhhdata[:, 20:]], axis=1) ValueError: all the input array dimensions except for the
concatenation axis must match exactly
I've confirmed that both arrays are 'numpy.ndarry'.
Is there a way to I check the dimensions of the input array to see where I'm going wrong.
Thank you in advance.
I would add a (temporary) print line right before the concatenate:
actualhhdata = (actones * actualhhdata)
print(acutalhhdata.T.shape, projectedhhdata[:,20:].shape)
end = np.concatenate((actualhhdata.T, projectedhhdata[:, 20:]), axis=1)
For more of a production context, you might want to add some sort of test
e.g.
x,y=np.ones((100,20)),np.zeros((100,10))
assert x.shape[0]==y.shape[0], (x.shape,y.shape)
np.concatenate([x,y],axis=1).shape