How to calculate the similarity measure of text document? - python

I have CSV file that looks like:
idx messages
112 I have a car and it is blue
114 I have a bike and it is red
115 I don't have any car
117 I don't have any bike
I would like to have the code that reads the file and performs the similarity difference.
I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.
based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".
It would be great if you can provide your help with code that I can run in my side as well. Thanks

I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
K = len(df)
similarity = np.empty((K,K), dtype=float)
for i, ac in enumerate(df['messages']):
for j, bc in enumerate(df['messages']):
if i > j:
continue
if i == j:
sim = 100
else:
sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
# for comparison of 2 strings.
similarity[i, j] = sim
similarity[j, i] = sim
df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)
Output: df_sim
id 112 114 115 117
id
112 100.0 78.0 51.0 50.0
114 78.0 100.0 47.0 54.0
115 51.0 47.0 100.0 83.0
117 50.0 54.0 83.0 100.0

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608
Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

Wrong value in text comparison

I am having some difficulties in finding text matching in the below dataset (note that Sim is my current output and it is generated by running the code below. It shows the wrong match).
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
As shown above, Sim does not give the ID who wrote the text that match.
For example, add should match with gsd and vice versa. But my output says that add matches with gwe and this is not true.
The code I am using is the following:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
Could you please help me understand the error in my code? Unfortunately I cannot see it.
My expected output would be as follows:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
as it is set a perfect match (=1) in sim function.
Initial assumption
First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.
Syntactic problems
So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The sim() function should read as follows:
def sim (nm, df):
matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
notice the df instead of dataset as well as the == instead of the =. I also removed the redundant parentheses for better readability.
Semantic problems
If i then run your code and print t (which does not seem to be the end result), this gives me the following:
txt0 txt1 txt2 txt3 txt4 txt5 txt6 txt7 txt8 txt9
0 1.0 27.0 12.0 45.0 45.0 12.0 12.0 12.0 27.0 64.0
1 27.0 1.0 33.0 33.0 42.0 33.0 33.0 33.0 52.0 44.0
2 12.0 33.0 1.0 22.0 100.0 100.0 100.0 100.0 22.0 33.0
3 45.0 33.0 22.0 1.0 41.0 22.0 22.0 22.0 40.0 30.0
4 45.0 42.0 100.0 41.0 1.0 100.0 100.0 100.0 35.0 47.0
5 12.0 33.0 100.0 22.0 100.0 1.0 100.0 100.0 22.0 33.0
6 12.0 33.0 100.0 22.0 100.0 100.0 1.0 100.0 22.0 33.0
7 12.0 33.0 100.0 22.0 100.0 100.0 100.0 1.0 22.0 33.0
8 27.0 52.0 22.0 40.0 35.0 22.0 22.0 22.0 1.0 34.0
9 64.0 44.0 33.0 30.0 47.0 33.0 33.0 33.0 34.0 1.0
which seems correct to me, as fuzz.partial_ratio("wonderful end ever seen you", "wonderful") returns 100 (as a partial match is already considered a score of 100).
For consistency reasons you could change
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
to
t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
as all elements should perfectly match themselves. So when you said
But my output says that add matches with gwe and this is not true.
this would be true in the sense that fuzz.partial_ratio(), you might want to consider using fuzz.ratio() instead. Also, there might be an error when converting t to the new Sim column, but there seems to be no code in the provided example.
Alternative implementation
Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:
import re
import pandas as pd
from fuzzywuzzy import fuzz
data = """
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
"""
rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"]) # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID') # Assuming that the "ID" column holds a unique ID
comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']] # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100] # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID') # Cleanup
result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())
gives:
Text Sim
ID
add wonderful gsd
add wonderful gwe
dfsfs i love this its incredible ...
fdsdf best sport everand the gane of the year❤️❤️❤️❤...
fdsfdgte3e best match ever its a masterpiece
fsad amazing ...
gsd wonderful gwe
gsd wonderful add
gwe wonderful end ever seen you ... gsd
gwe wonderful end ever seen you ... add
hgdfgre terrible destroys everything ...

Using pandas to identify nearest objects

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars dataframe along with the following new columns:
data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)
The final cars dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.
so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:
#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])
#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}'
for col in trucks.columns}),
cars.assign(back_end=lambda x: x['end']),
on='back_end', direction='forward')\
.query('end>back_end')\
.assign(back_distance=lambda x: x['start']-x['back_end'])
#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
trucks.rename(columns={col:f'across_{col}'
for col in trucks.columns}),
on=['across_start'], direction='backward')\
.query('end<=across_end')
#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}'
for col in trucks.columns}),
cars.assign(front_start=lambda x: x['start']),
on='front_start', direction='backward')\
.query('start<front_start')\
.assign(front_distance=lambda x: x['front_start']-x['end'])
# merge all back to cars
df_f = cars.merge(df_back, how='left')\
.merge(df_across, how='left')\
.merge(df_front, how='left')
and you get
print (df_f)
id start end back_id back_start back_end back_distance across_start \
0 C2 110 125 T4 25.0 85.0 25.0 NaN
1 C1 200 215 T1 115.0 175.0 25.0 200.0
2 C3 240 255 NaN NaN NaN NaN 240.0
across_id across_end front_id front_start front_end front_distance
0 NaN NaN T1 115.0 175.0 -10.0
1 T2 260.0 NaN NaN NaN NaN
2 T2 260.0 T3 280.0 340.0 25.0

Python: Finding multiple linear trend lines in a scatter plot

I have the following pandas dataframe -
Atomic Number R C
0 2.0 49.0 0.040306
1 3.0 205.0 0.209556
2 4.0 140.0 0.107296
3 5.0 117.0 0.124688
4 6.0 92.0 0.100020
5 7.0 75.0 0.068493
6 8.0 66.0 0.082244
7 9.0 57.0 0.071332
8 10.0 51.0 0.045725
9 11.0 223.0 0.217770
10 12.0 172.0 0.130719
11 13.0 182.0 0.179953
12 14.0 148.0 0.147929
13 15.0 123.0 0.102669
14 16.0 110.0 0.120729
15 17.0 98.0 0.106872
16 18.0 88.0 0.061996
17 19.0 277.0 0.260485
18 20.0 223.0 0.164312
19 33.0 133.0 0.111359
20 36.0 103.0 0.069348
21 37.0 298.0 0.270709
22 38.0 245.0 0.177368
23 54.0 124.0 0.079491
The trend between r and C is generally a linear one. What I would like to do if possible is find an exhaustive list of all the possible combinations of 3 or more points and what their trends are with scipy.stats.linregress so that I can find groups of points that fit linearly the best.
Which would ideally look something like this for the data, (Source) but I am looking for all the other possible trends too.
So the question, how do I feed all the 16776915 possible combinations (sum_(i=3)^24 binomial(24, i)) of 3 or more points into lingress and is it even doable without a ton of code?
My following solution proposal is based on the RANSAC algorithm. It is method to fit a mathematical model (e.g. a line) to data with heavy of outliers.
RANSAC is one specific method from the field of robust regression.
My solution below first fits a line with RANSAC. Then you remove the data points close to this line from your data set (which is the same as keeping the outliers), fit RANSAC again, remove data, etc until only very few points are left.
Such approaches always have parameters which are data dependent (e.g. noise level or proximity of the lines). In the following solution and MIN_SAMPLES and residual_threshold are parameters which might require some adaption to the structure of your data:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
MIN_SAMPLES = 3
x = np.linspace(0, 2, 100)
xs, ys = [], []
# generate points for thee lines described by a and b,
# we also add some noise:
for a, b in [(1.0, 2), (0.5, 1), (1.2, -1)]:
xs.extend(x)
ys.extend(a * x + b + .1 * np.random.randn(len(x)))
xs = np.array(xs)
ys = np.array(ys)
plt.plot(xs, ys, "r.")
colors = "rgbky"
idx = 0
while len(xs) > MIN_SAMPLES:
# build design matrix for linear regressor
X = np.ones((len(xs), 2))
X[:, 1] = xs
ransac = linear_model.RANSACRegressor(
residual_threshold=.3, min_samples=MIN_SAMPLES
)
res = ransac.fit(X, ys)
# vector of boolean values, describes which points belong
# to the fitted line:
inlier_mask = ransac.inlier_mask_
# plot point cloud:
xinlier = xs[inlier_mask]
yinlier = ys[inlier_mask]
# circle through colors:
color = colors[idx % len(colors)]
idx += 1
plt.plot(xinlier, yinlier, color + "*")
# only keep the outliers:
xs = xs[~inlier_mask]
ys = ys[~inlier_mask]
plt.show()
In the following plot points shown as stars belong to the clusters detected by my code. You also see a few points depicted as circles which are the points remaining after the iterations. The few black stars form a cluster which you could get rid of by increasing MIN_SAMPLES and / or residual_threshold.

Portfolio Selection in Python with constraints from a fixed set

I am working on a project where I am trying to select the optimal subset of players from a set of 125 players (example below)
The constraints are:
a) Number of players = 3
b) Sum of prices <= 30
The optimization function is Max(Sum of Votes)
Player Vote Price
William Smith 0.67 8.6
Robert Thompson 0.31 6.7
Joseph Robinson 0.61 6.2
Richard Johnson 0.88 4.3
Richard Hall 0.28 9.7
I looked at the scipy optimize package but I can't find anywhere a way to constraint the universe to this subset. Can anyone point me if there is a library that would do that?
Thanks
The problem is well suited to be formulated as mathematical program and can be solved with different Optimization libraries.
It is known as the exact k-item knapsack problem.
You can use the Package PuLP for example. It has interfaces to different optimization software packages, but comes bundled with a free solver.
easy_install pulp
Free solvers are often way slower than commercial ones, but I think PuLP should be able to solve reasonably large versions of your problem with its standard solver.
Your problem can be solved with PuLP as follows:
from pulp import *
# Data input
players = ["William Smith", "Robert Thompson", "Joseph Robinson", "Richard Johnson", "Richard Hall"]
vote = [0.67, 0.31, 0.61, 0.88, 0.28]
price = [8.6, 6.7, 6.2, 4.3, 9.7]
P = range(len(players))
# Declare problem instance, maximization problem
prob = LpProblem("Portfolio", LpMaximize)
# Declare decision variable x, which is 1 if a
# player is part of the portfolio and 0 else
x = LpVariable.matrix("x", list(P), 0, 1, LpInteger)
# Objective function -> Maximize votes
prob += sum(vote[p] * x[p] for p in P)
# Constraint definition
prob += sum(x[p] for p in P) == 3
prob += sum(price[p] * x[p] for p in P) <= 30
# Start solving the problem instance
prob.solve()
# Extract solution
portfolio = [players[p] for p in P if x[p].varValue]
print(portfolio)
The runtime to draw 3 players from 125 with the same random data as used by Brad Solomon is 0.5 seconds on my machine.
Your problem is discrete optimization task because of a) constraint. You should introduce discrete variables to represent taken/not taken players. Consider the following Minizinc pseudocode:
array[players_num] of var bool: taken_players;
array[players_num] of float: votes;
array[players_num] of float: prices;
constraint sum (taken_players * prices) <= 30;
constraint sum (taken_players) = 3;
solve maximize sum (taken_players * votes);
As far as I know, you can't use scipy to solve such problems (e.g. this).
You can solve your problem in these ways:
You can generate Minizinc problem in Python and solve it by calling external solver. It seems to be more scalable and robust.
You can use simulated annealing
Mixed integer approach
The second option seems to be simpler for you. But, personally, I prefer the first one: it allows you introducing a wide range of various constraints, problem formulation feels more natural and clear.
#CaptainTrunky is correct, scipy.minimize will not work here.
Here is an awfully crappy workaround using itertools, please ignore if one of the other methods has worked. Consider that to draw 3 players from 125 creates 317,750 combinations, n!/((n - k)! * k!). Runtime on the main loop ~ 6m.
from itertools import combinations
df = DataFrame({'Player' : np.arange(0, 125),
'Vote' : 10 * np.random.random(125),
'Price' : np.random.randint(1, 10, 125)
})
df
Out[109]:
Player Price Vote
0 0 4 7.52425
1 1 6 3.62207
2 2 9 4.69236
3 3 4 5.24461
4 4 4 5.41303
.. ... ... ...
120 120 9 8.48551
121 121 8 9.95126
122 122 8 6.29137
123 123 8 1.07988
124 124 4 2.02374
players = df.Player.values
idx = pd.MultiIndex.from_tuples([i for i in combinations(players, 3)])
votes = []
prices = []
for i in combinations(players, 3):
vote = df[df.Player.isin(i)].sum()['Vote']
price = df[df.Player.isin(i)].sum()['Price']
votes.append(vote); prices.append(price)
result = DataFrame({'Price' : prices, 'Vote' : votes}, index=idx)
# The index below is (first player, second player, third player)
result[result.Price <= 30].sort_values('Vote', ascending=False)
Out[128]:
Price Vote
63 87 121 25.0 29.75051
64 121 20.0 29.62626
64 87 121 19.0 29.61032
63 64 87 20.0 29.56665
65 121 24.0 29.54248
... ...
18 22 78 12.0 1.06352
23 103 20.0 1.02450
22 23 103 20.0 1.00835
18 22 103 15.0 0.98461
23 14.0 0.98372

Categories