I'm using the linearmodels package to estimate a Panel-OLS. As an example see:
import numpy as np
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data.year = data.year.astype(np.int64)
# MultiIndex, entity - time
data = data.set_index(['firm','year'])
from linearmodels import PanelOLS
mod = PanelOLS(data.invest, data[['value','capital']], entity_effect=True)
res = mod.fit(cov_type='clustered', cluster_entity=True)
I want to export the regression's output in a .tex file. Is there a convenient way of formatting the output with confidence stars and without the other information like the CIs? The question has been asked in the context of a standard OLS in here but this does not apply for a 'PanelEffectsResults' object, since I get the following error:
'PanelEffectsResults' object has no attribute 'bse'
Thanks in advance.
A bit late but here is what I use. In the example above I calculated two fixed effects regressions with their results stored in fe_res_VS and fe_res_CVS:
pd.set_option('precision', 4)
pd.options.display.float_format = '{:,.4f}'.format
Reg_Output_FAmount= pd.DataFrame()
#1)
Table1 = pd.DataFrame(fe_res_VS.params)
Table1['id'] = np.arange(len(Table1))#create numerical index for pd.DataFrame
Table1 = Table1.reset_index().set_index(keys = 'id')#set numercial index as new index
Table1 = Table1.rename(columns={"index":"parameter", "parameter":"coefficient 1"})
P1 = pd.DataFrame(fe_res_VS.pvalues)
P1['id'] = np.arange(len(P1))#create numerical index for pd.DataFrame
P1 = P1.reset_index().set_index(keys = 'id')#set numercial index as new index
P1 = P1.rename(columns={"index":"parameter"})
Table1 = pd.merge(Table1, P1, on='parameter')
Table1['significance 1'] = np.where(Table1['pvalue'] <= 0.01, '***',\
np.where(Table1['pvalue'] <= 0.05, '**',\
np.where(Table1['pvalue'] <= 0.1, '*', '')))
Table1.rename(columns={"pvalue": "pvalue 1"}, inplace=True)
SE1 = pd.DataFrame(fe_res_VS.std_errors)
SE1['id'] = np.arange(len(SE1))#create numerical index for pd.DataFrame
SE1 = SE1.reset_index().set_index(keys = 'id')#set numercial index as new index
SE1 = SE1.rename(columns={"index":"parameter", "std_error":"coefficient 1"})
SE1['parameter'] = SE1['parameter'].astype(str) + '_SE'
SE1['significance 1'] = ''
SE1 = SE1.round(4)
SE1['coefficient 1'] = '(' + SE1['coefficient 1'].astype(str) + ')'
Table1 = Table1.append(SE1)
Table1 = Table1.sort_values('parameter')
Table1.replace(np.nan,'', inplace=True)
del P1
del SE1
#2)
Table2 = pd.DataFrame(fe_res_CVS.params)
Table2['id'] = np.arange(len(Table2))#create numerical index for pd.DataFrame
Table2 = Table2.reset_index().set_index(keys = 'id')#set numercial index as new index
Table2 = Table2.rename(columns={"index":"parameter", "parameter":"coefficient 2"})
P2 = pd.DataFrame(fe_res_CVS.pvalues)
P2['id'] = np.arange(len(P2))#create numerical index for pd.DataFrame
P2 = P2.reset_index().set_index(keys = 'id')#set numercial index as new index
P2 = P2.rename(columns={"index":"parameter"})
Table2 = pd.merge(Table2, P2, on='parameter')
Table2['significance 2'] = np.where(Table2['pvalue'] <= 0.01, '***',\
np.where(Table2['pvalue'] <= 0.05, '**',\
np.where(Table2['pvalue'] <= 0.1, '*', '')))
Table2.rename(columns={"pvalue": "pvalue 2"}, inplace=True)
SE2 = pd.DataFrame(fe_res_CVS.std_errors)
SE2['id'] = np.arange(len(SE2))#create numerical index for pd.DataFrame
SE2 = SE2.reset_index().set_index(keys = 'id')#set numercial index as new index
SE2 = SE2.rename(columns={"index":"parameter", "std_error":"coefficient 2"})
SE2['parameter'] = SE2['parameter'].astype(str) + '_SE'
SE2['significance 2'] = ''
SE2 = SE2.round(4)
SE2['coefficient 2'] = '(' + SE2['coefficient 2'].astype(str) + ')'
Table2 = Table2.append(SE2)
Table2 = Table2.sort_values('parameter')
Table2.replace(np.nan,'', inplace=True)
del P2
del SE2
#Merging Tables and adding Stats
Reg_Output_FAmount= pd.merge(Table1, Table2, on='parameter', how='outer')
Reg_Output_FAmount = Reg_Output_FAmount.append(pd.DataFrame(np.array([["observ.", fe_res_VS.nobs, '', fe_res_CVS.nobs, '']]), columns=['parameter', 'pvalue 1', 'significance 1', 'pvalue 2', 'significance 2']), ignore_index=True)
Reg_Output_FAmount = Reg_Output_FAmount.append(pd.DataFrame(np.array([["Rsquared", "{:.4f}".format(fe_res_VS.rsquared), '', "{:.4f}".format(fe_res_CVS.rsquared), '']]), columns=['parameter', 'pvalue 1', 'significance 1', 'pvalue 2', 'significance 2']), ignore_index=True)
Reg_Output_FAmount= Reg_Output_FAmount.append(pd.DataFrame(np.array([["Model type", fe_res_VS.name, '', fe_res_CVS.name, '']]), columns=['parameter', 'pvalue 1', 'significance 1', 'pvalue 2', 'significance 2']), ignore_index=True)
Reg_Output_FAmount = Reg_Output_FAmount.append(pd.DataFrame(np.array([["DV", fe_res_VS.model.dependent.vars[0], '', fe_res_CVS.model.dependent.vars[0], '']]), columns=['parameter', 'pvalue 1', 'significance 1', 'pvalue 2', 'significance 2']), ignore_index=True)
Reg_Output_FAmount.fillna('', inplace=True)
resulting in a nice regression output looking like that:
parameter coefficient 1 pvalue 1 significance 1 coefficient 2 pvalue 2 significance 2
0 IV 0.0676 0.2269 0.0732 0.1835
1 IV_SE (0.0559) (0.055)
2 Control 0.3406 0.0125 ** 0.3482 0.0118 **
3 Control_SE (0.1363) 0.1383)
4 const 0.2772 0.0000 *** 0.2769 0.0000 ***
5 const_SE (0.012) (0.012)
6 observ. 99003 99003
7 Rsquared 0.12 0.14
8 Model type PanelOLS PanelOLS
9 DV FAmount FAmount
Have been struggling with the same problem for a few days. Very excited to share with my peers a very easy way to do it: include the significance stars, remove CIs.
Here it is:
Step 1: install linearmodels package.
Step 2: import compare function from linearmodels.panel
from linearmodels.panel import compare
Step3: Use compare function and specify the arguments as you want in compare. For instance, specifying stars = True will give you significance stars. Very convenient!
compare({'model_A_name': results of model_A, 'model_B_name': results of model_B, }, stars = True)
This small function saved my life! Enjoy it.
One more thing, please know that the stars are based on the p-value of the coefficient where 1, 2 and 3-stars correspond to p-values of 10%, 5% and 1%, respectively. I am not sure whether there is a way to make a customized stars measurement, like 1, 2 and 3-stars correspond to p-values of 5%, 1% and 0.1%.
The credit goes to the fantastic package developer and maintainer. Thank you all! Please see the file and get more information at:
~/opt/anaconda3/lib/python3.7/site-packages/linearmodels/panel/results.py
Related
I have a dataframe of patients and their gene expressions. I has this format:
Patient_ID | gene1 | gene2 | ... | gene10000
p1 0.142 0.233 ... bla
p2 0.243 0.243 ... -0.364
...
p4000 1.423 bla ... -1.222
As you see, that dataframe contains noise, with cells that are values other then a float value.
I want to remove every row that has a any column with non numeric values.
I've managed to do this using apply and pd.to_numeric like this:
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
The problem is that it's taking for ever to run, and I need a better and more efficient way of achieving this
EDIT: To reproduce something like my data:
arr = np.random.random_sample((3000,10000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(10000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
df['gene0'][2] = 'bla'
df['gene9998'][4] = 'bla'
Was right it is worth trying numpy :)
I got 30-60x times faster version (bigger array, larger improvement)
Convert to numpy array (.values)
Iterate through all rows
Try to convert each row to row of floats
If it fails (some NaN present), note this in boolean array
Create array based on the results
Code:
import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
def op_version(df):
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
return df.dropna()
def np_version(df):
keep = np.full(len(df), True)
for idx, row in enumerate(df.values[:, 1:]):
try:
row.astype(np.float)
except:
keep[idx] = False
pass # maybe its better to store to_remove list, depends on data
return df[keep]
#profile
def main():
arr = np.random.random_sample((3000, 5000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(5000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)],
columns=['Patient_ID']), df], axis=1)
df['gene0'][2] = 'bla'
df['gene998'][4] = 'bla'
df2 = df.copy()
df = op_version(df)
df2 = np_version(df2)
Note I decreased number of columns so it is more feasible for tests.
Also, fixed small bug in your example, instead of:
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
I think should be
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)], columns=['Patient_ID']),df],axis = 1)
I'm having an issue where calculated results in "Price Tier Index" column are showing up as "object" and not as an int (see code and screenshot below).
import pandas as pd
import numpy as np
import csv
import io
from IPython.display import display
from google.colab import files
uploaded = files.upload()
#this is a variable to be used in the new Price Tier Index column
men_bw_avg_eq_price = 0.28
df = pd.read_csv(io.BytesIO(uploaded["Men BW - competitive price tier analysis - REPULLED.csv"]))
df= df.fillna(0)
price_index = []
total_rows = len(df['Products'])
for i in range(total_rows):
pi_value = (df.loc[i, ['Avg EQ Price']]/men_bw_avg_eq_price) * 100
price_index.append(pi_value)
df['Price Tier Index'] = price_index
df[['Brand', 'Gender', 'Subcategory']] = df.Products.str.split("|", expand=True)
df_new = df[['Brand', 'Price Tier Index']]
display(df_new)
screenshot of the current issue
screenshot of what I'm looking to do
Thank you in advance for your help!
All problem is because you create list of dataframes instead of list of values
You need .values[0] or .to_list()[0] to convert data to list and get first value
first_value = df.loc[i, ['Avg EQ Price']].values[0]
pi_value = (first_value/men_bw_avg_eq_price) * 100
first_value = df.loc[i, ['Avg EQ Price']].to_list()[0]
pi_value = (first_value/men_bw_avg_eq_price) * 100
But you can calculate it without for-loop
df['Price Tier Index'] = (df['Avg EQ Price']/men_bw_avg_eq_price) * 100
Minimal working example with some data
import pandas as pd
#this is a variable to be used in the new Price Tier Index column
men_bw_avg_eq_price = 0.28
data = {
'Products': ['A|B|C','D|E|F','G|H|I'],
'Avg EQ Price': [4,5,6],
}
df = pd.DataFrame(data)
#df = pd.read_csv(io.BytesIO(uploaded["Men BW - competitive price tier analysis - REPULLED.csv"]))
df = df.fillna(0)
#price_index = []
#total_rows = len(df['Products'])
#for i in range(total_rows):
# #first_value = df.loc[i, ['Avg EQ Price']].values[0]
# first_value = df.loc[i, ['Avg EQ Price']].to_list()[0]
# pi_value = (first_value/men_bw_avg_eq_price) * 100
# price_index.append(pi_value)
price_index = (df['Avg EQ Price']/men_bw_avg_eq_price) * 100
df['Price Tier Index'] = price_index
df[['Brand', 'Gender', 'Subcategory']] = df.Products.str.split("|", expand=True)
df_new = df[['Brand', 'Price Tier Index']]
print(df_new)
Result:
Brand Price Tier Index
0 A 1428.571429
1 D 1785.714286
2 G 2142.857143
I have a dataframe with four columns:
Client ID
Date
Assets
Flows
Not all clients have data for the full set of dates. In such case, rows are missing. Put differently, I don't have the same number of rows for each client.
I would like to compute the following and add in additional columns:
Absolute and relative change in Assets over 12m and 3m
Sum of Flows over 12m and 3m
When statistics can't be computed (i.e. the first 11m), the column should be filled with nan.
I have tried with group, however can't find a way around the fact that the length of data for each client is different.
Here is an example of my data (first 4 columns) and the wished result (last 4 columns), done in Excel:
enter image description here
If you are interested in monthly changes you can add column "months_since" which would equal number of months since a certain date.
df.pivot("months_since", "Client ID", "Assets")
To get a matrix representation of your data. If certain client is a missing an observation he would have a nan value.
Then it is easy to compute sums/delta using df.rolling(12).sum() or df.diff(12)
import pandas as pd
df_data = pd.read_excel('sample2.xlsx')
df_result_change_assets_3m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID']).pct_change(3).unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 3m (%)'})
df_result_change_assets_12m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID']).pct_change(12).unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 12m (%)'})
df_result_change_usd_assets_3m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID'])
df_result_change_usd_assets_3m = df_result_change_usd_assets_3m - df_result_change_usd_assets_3m.shift(3)
df_result_change_usd_assets_3m = df_result_change_usd_assets_3m.unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 3m (USD)'})
df_result_change_usd_assets_12m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID'])
df_result_change_usd_assets_12m = df_result_change_usd_assets_12m - df_result_change_usd_assets_12m.shift(12)
df_result_change_usd_assets_12m = df_result_change_usd_assets_12m.unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 12m (USD)'})
df_data = df_data.merge(df_result_change_assets_3m, how = 'left', on = ['Client ID','Date'])
df_data = df_data.merge(df_result_change_assets_12m, how = 'left', on = ['Client ID','Date'])
df_data = df_data.merge(df_result_change_usd_assets_3m, how = 'left', on = ['Client ID','Date'])
df_data = df_data.merge(df_result_change_usd_assets_12m, how = 'left', on = ['Client ID','Date'])
df_data
I suggest you to split the problem in two: first you calculate the yoy on assets, than you can focus on flow. For the first task you can create two new columns to merge data using a self join. You can create the new columns using pd.tseries.offsets.MonthEnd method as follow:
data_df['date_12MonthsAfter'] = data_df['date'] + pd.tseries.offsets.MonthEnd(12)
data_df['date_3MonthsAfter'] = data_df['date'] + pd.tseries.offsets.MonthEnd(3)
Than you can merge the data with two self join that calculate the quantity you need. There are a couple of ways to do that, I did it so:
data_merged_12Months = (data_df.merge(data_df[['clientId', 'date_12MonthsAfter', 'assets']],
left_on = ['clientId', 'date'],
right_on = ['clientId', 'date_12MonthsAfter'],
how = 'left',
suffixes=['', '_prevYear'])).drop(['date_12MonthsAfter',
'date_3MonthsAfter',
'date_12MonthsAfter_prevYear'], axis = 1)
data_merged_3Months = (data_df.merge(data_df[['clientId', 'date_3MonthsAfter', 'assets']],
left_on = ['clientId', 'date'],
right_on = ['clientId', 'date_3MonthsAfter'],
how = 'left',
suffixes=['', '_prevQuarter'])).drop(['assets',
'flow',
'date_12MonthsAfter',
'date_3MonthsAfter',
'date_3MonthsAfter_prevQuarter'], axis = 1)
data_merged_assets = data_merged_12Months.merge(data_merged_3Months, on = ['clientId', 'date'])
data_merged_assets['perc yoy'] = (data_merged_assets['assets'] - data_merged_assets['assets_prevYear'])/data_merged_assets['assets_prevYear']
data_merged_assets['perc qoq'] = (data_merged_assets['assets'] - data_merged_assets['assets_prevQuarter'])/data_merged_assets['assets_prevQuarter']
For the flow calculation, of flow you have to replace the blacket cells with zeros, you can do it with method .str.replce('', '0') and than you have to trasform the column data type in int. To calculate the last 12 and 3 months sum I find this solution
data_merged_assets['flow'] = (data_merged_assets.str.replace('', '0')).astype(int)
data_flow_12Months = data_merged_assets.groupby(['clientId']).rolling(on = 'date', window=12, min_periods=12)["flow"].sum().reset_index()
data_flow_3Months = data_merged_assets.groupby(['clientId']).rolling(on = 'date', window=3, min_periods=3)["flow"].sum().reset_index()
data_flow = data_flow_12Months.merge(data_flow_3Months, on = ['clientId', 'date'], suffixes = ['_sumLast12Months', '_sumLast3Months'])
Finally, I merged the two datasets.
data_merged = data_merged_assets.merge(data_flow, on = ['clientId', 'date'])
What I need
I need to create new columns in pandas dataframes, based on a set of nested if statements. E.g.
if city == 'London' and income > 10000:
return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
return 'group 2'
elif borrower_age > 50:
return 'group 3'
else:
return 'group 4'
This is actually a simplifcation - in most cases I'd need to create something like 10 or more possible outputs, not 4 as above, but you hopefully get the gist.
The issue
My problem is that I have not found a way to make the code acceptably fast.
I understand that, if the choice were binary, I could use something like numpy.where() , but I have not found a way to vectorise the code or anyway to make it fast enough.
I suppose I could probably nest a number of np.where statements and that would be faster, but then the code would be harder to read and much more prone to errors.
What I have tried
I have tried the following:
+────────────────────────────────────────────────+──────────────+
| Method | Time (secs) |
+────────────────────────────────────────────────+──────────────+
| dataframe.apply | 29 |
| dataframe.apply on a numba-optimised function | 31 |
| sqlite | 16 |
+────────────────────────────────────────────────+──────────────+
"sqlite" means: loading the dataframe into a sqlite in-memory database, creating the new field there, and then exporting back to a dataframe
Sqlite is faster but still unacceptably slow: the same thing on a SQL Server running on the same machine takes less than a second. I'd rather not rely on an external SQL server, though, because the code should be able to run even on machines with no access to a SQL server.
I also tried to create a numba function which loops through the rows one by one, but I understand that numba doesn't support strings (or at least I couldn't get that to work).
Toy example
import numpy as np
import pandas as pd
import sqlite3
import time
import numba
start = time.time()
running_time = pd.Series()
n = int(1e6)
df1 = pd.DataFrame()
df1['income']=np.random.lognormal(0.4,0.4, n) *20e3
df1['loan balance'] = np.maximum(0, np.minimum(30e3, 5e3 * np.random.randn(n) + 20e3 ) )
df1['city'] = np.random.choice(['London','Leeds','Truro','Manchester','Liverpool'] , n )
df1['city'] = df1['city'].astype('|S80')
df1['borrower age'] = np.maximum(22, np.minimum(70, 30 * np.random.randn(n) + 30 ) )
df1['# children']=np.random.choice( [0,1,2,3], n, p= [0.4,0.3,0.25,0.05] )
df1['rate'] = np.maximum(0.5e-2, np.minimum(10e-2, 1e-2 * np.random.randn(n) + 4e-2 ) )
running_time['data frame creation'] = time.time() - start
conn = sqlite3.connect(":memory:", detect_types = sqlite3.PARSE_DECLTYPES)
cur = conn.cursor()
df1.to_sql("df1", conn, if_exists ='replace')
cur.execute("ALTER TABLE df1 ADD new_field nvarchar(80)")
cur.execute('''UPDATE df1
SET new_field = case when city = 'London' AND income > 10000 then 'group 1'
when city = 'Manchester' or city = 'Leeds' then 'group 2'
when 'borrower age' > 50 then 'group 3'
else 'group 4'
end
''')
df_from_sql = pd.read_sql('select * from df1', conn)
running_time['sql lite'] = time.time() - start
def my_func(city, income, borrower_age):
if city == 'London' and income > 10000:
return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
return 'group 2'
elif borrower_age > 50:
return 'group 3'
else:
return 'group 4'
df1['new_field'] = df1.apply( lambda x: my_func( x['city'], x['income'], x['borrower age'] ) , axis =1)
running_time['data frame apply'] = time.time() - start
#numba.jit(nopython = True)
def my_func_numba_apply(city, income, borrower_age):
if city == 'London' and income > 10000:
return 'group 1'
elif city == 'Manchester' or city == 'Leeds':
return 'group 2'
elif borrower_age > 50:
return 'group 3'
else:
return 'group 4'
df1['new_field numba_apply'] = df1.apply( lambda x: my_func_numba_apply( x['city'], x['income'], x['borrower age'] ) , axis =1)
running_time['data frame numba'] = time.time() - start
x = np.concatenate(([0], running_time))
execution_time = pd.Series(np.diff(x) , running_time.index)
print(execution_time)
Other questions I have found
I have found a number of other questions, but none which directly addresses my point.
Most other questions were either easily to vectorise (e.g. just two choices, so np.where works well) or they recommended a numba-based solution, which in my case actually happens to be slower.
E.g.
Speed up applying function to a list of pandas dataframes
This one with dates, so not really applicable How to speed up apply method with lambda in pandas with datetime
Joins, so not really applicable speed up pandas apply or using map
Try with numpy.select:
conditions = [df["city"].eq("London") & df["income"].gt(10000),
df["city"].isin(["Manchester", "Leeds"]),
df["borrower_age"].gt(50)]
choices = ["Group 1", "Group 2", "Group 3"]
df["New Field"] = np.select(conditions, choices, "Group 4")
Or have the conditions as a dictionary and use that in the np.select:
conditions = {"Group 1": df1["city"].eq("London") & df1["income"].gt(10000),
"Group 2": df1["city"].isin(["Manchester", "Leeds"]),
"Group 3": df1["borrower age"].gt(50)}
df["New Field"] = np.select(conditions.values(), conditions.keys(), "Group 4")
I have a dataframe that contains a bunch of people's text descriptions. Other than that, I also have 4 descriptions a,b,c,d. For each person's text description, I wish to compare them to each of the 4 descriptions by using cosine similarity and store these scores in the same dataframe in 4 new columns: a, b, c, d.
How can I do this in a panda way, without using for loops? I was thinking of using the apply function but I don't know how to reference to the 'text' column as well as the 4 descriptions a,b,c,d in the apply function.
Thank you very much for any help!!
What I have tried:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]
description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])
def trying(cell,jd):
vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
jd_vector = vectorizer.transform(jd)
person_vector = vectorizer.transform(cell['text'])
score = cosine_similarity(jd_vector,person_vector)
return score
df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))
This gives me an error:
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
The output should look something like this:
person text a b c d
0 person 1 [table, car, mouse] 0.3 0.2 0.5 0.7
1 person 2 [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2 person 3 [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3 person 4 [queen, king, joker, phone] 0.2 0.4 0.3 0.5
I can't post comment yet, but to solve the error :
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
You need to pass the parameter like this :
df['a'] = df['a'].apply(trying, args=(description_a))
The first argument will be the column vector in your case, and the other arguments will then be taken in order from ther args list.
Hope this help.
How about this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']
description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']
descriptors = {
'a' : description_a,
'b' : description_d,
'c' : description_c,
'd' : description_d
}
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
vocabulary_data =[
person_one,
person_two,
person_three,
person_four,
description_a,
description_b,
description_c,
description_d,
]
data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)
def similarity(row, desc):
a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
return a.item()
for key, description in descriptors.items():
df[key] = df.apply(lambda x: similarity(x, description), axis=1)
I used one for loop, but only for filling different descriptions. The main "computation" is done by apply.