Wrong value in text comparison - python

I am having some difficulties in finding text matching in the below dataset (note that Sim is my current output and it is generated by running the code below. It shows the wrong match).
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
As shown above, Sim does not give the ID who wrote the text that match.
For example, add should match with gsd and vice versa. But my output says that add matches with gwe and this is not true.
The code I am using is the following:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
Could you please help me understand the error in my code? Unfortunately I cannot see it.
My expected output would be as follows:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
as it is set a perfect match (=1) in sim function.

Initial assumption
First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.
Syntactic problems
So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The sim() function should read as follows:
def sim (nm, df):
matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
notice the df instead of dataset as well as the == instead of the =. I also removed the redundant parentheses for better readability.
Semantic problems
If i then run your code and print t (which does not seem to be the end result), this gives me the following:
txt0 txt1 txt2 txt3 txt4 txt5 txt6 txt7 txt8 txt9
0 1.0 27.0 12.0 45.0 45.0 12.0 12.0 12.0 27.0 64.0
1 27.0 1.0 33.0 33.0 42.0 33.0 33.0 33.0 52.0 44.0
2 12.0 33.0 1.0 22.0 100.0 100.0 100.0 100.0 22.0 33.0
3 45.0 33.0 22.0 1.0 41.0 22.0 22.0 22.0 40.0 30.0
4 45.0 42.0 100.0 41.0 1.0 100.0 100.0 100.0 35.0 47.0
5 12.0 33.0 100.0 22.0 100.0 1.0 100.0 100.0 22.0 33.0
6 12.0 33.0 100.0 22.0 100.0 100.0 1.0 100.0 22.0 33.0
7 12.0 33.0 100.0 22.0 100.0 100.0 100.0 1.0 22.0 33.0
8 27.0 52.0 22.0 40.0 35.0 22.0 22.0 22.0 1.0 34.0
9 64.0 44.0 33.0 30.0 47.0 33.0 33.0 33.0 34.0 1.0
which seems correct to me, as fuzz.partial_ratio("wonderful end ever seen you", "wonderful") returns 100 (as a partial match is already considered a score of 100).
For consistency reasons you could change
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
to
t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
as all elements should perfectly match themselves. So when you said
But my output says that add matches with gwe and this is not true.
this would be true in the sense that fuzz.partial_ratio(), you might want to consider using fuzz.ratio() instead. Also, there might be an error when converting t to the new Sim column, but there seems to be no code in the provided example.
Alternative implementation
Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:
import re
import pandas as pd
from fuzzywuzzy import fuzz
data = """
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
"""
rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"]) # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID') # Assuming that the "ID" column holds a unique ID
comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']] # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100] # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID') # Cleanup
result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())
gives:
Text Sim
ID
add wonderful gsd
add wonderful gwe
dfsfs i love this its incredible ...
fdsdf best sport everand the gane of the year❤️❤️❤️❤...
fdsfdgte3e best match ever its a masterpiece
fsad amazing ...
gsd wonderful gwe
gsd wonderful add
gwe wonderful end ever seen you ... gsd
gwe wonderful end ever seen you ... add
hgdfgre terrible destroys everything ...

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608
Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

How to apply a function that splits multiple numbers to the fields of a column in a dataframe in Python?

I need to apply a function that splits multiple numbers from the fields of a dataframe.
In this dataframe there a all the kids' measurements that are needed for a school: Name, Height, Weight, and Unique Code, and their dream career.
The name is only formed of alpha-characters. But some kids might have both first name and middle name. (e.g. Vivien Ester)
The height is known to be >= 100 cm for every child.
The weight is known to be < 70 kg for every child.
The unique code is known to be any number, but it is associated with the letters 'AX', for every child. But the AX may not always be stick to the number (e.g. 7771AX), it might be a space next to it. (e.g. 100 AX)
Every kid has its dream career
They could appear in any order, but they always follow the rules from above. However, for some kids some measurements could not appear (e.g.: height or unique code or both are missing or all are missing).
So the dataframe is this:
data = { 'Dream Career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor', 'Fashion Designer', 'Teacher', 'Architect'],
'Measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1', 'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20', 'Oliver 40.5', 'Julien 35.1 678 AX 111.1', 'Bee 20.0 100.80 88AX']
}
df = pd.DataFrame (data, columns = ['Dream Career','Measurements'])
And it looks like this:
Dream Career Measurements
0 Scientist Rachel 24.3 100.25 100 AX
1 Astronaut 100.5 Tuuli 30.1
2 Software Engineer Michael 28.0 7771AX 113.75
3 Doctor Vivien Ester 40AX 115.20
4 Fashion Designer Oliver 40.5
5 Teacher Julien 35.1 678 AX 111.1
6 Architect Bee 20.0 100.80 88AX
I try to split all of these measurements into different columns, based on the specified rules.
So the final dataframe should look like this:
Dream Career Names Weight Height Unique Code
0 Scientist Rachael 24.3 100.25 100AX
1 Astronaut Tuuli 30.1 100.50 NaN
2 Software Engineer Michael 28.0 113.75 7771AX
3 Doctor Vivien Ester NaN 115.20 40AX
4 Fashion Designer Oliver 40.5 NaN NaN
5 Teacher Julien 35.1 111.10 678AX
6 Architect Bee 10.0 100.80 88AX
I tried this code and it works very well, but only on single strings. And I need to do this while in the dataframe and still keep every's kid's associate dream career (so the order is not lost).
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
def get_weight_height(s):
nums = re.findall(num_rx, s)
height = np.nan
weight = np.nan
if (len(nums) == 0):
height = np.nan
weight = np.nan
elif (len(nums) == 1):
if float(nums[0]) >= 100:
height = nums[0]
weight = np.nan
else:
weight = nums[0]
height = np.nan
elif (len(nums) == 2):
if float(nums[0]) >= 100:
height = nums[0]
weight = nums[1]
else:
height = nums[1]
weight = nums[0]
return height, weight
class_code = {'Small': 'AX', 'Mid': 'BX', 'High': 'CX'}
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def extract_measurements(string, substring_name):
height = np.nan
weight = np.nan
unique_code = np.nan
name = ''
if hasNumbers(string):
num_rx = r'[-+]?\.?\d+(?:,\d{3})*\.?\d*(?:[eE][-+]?\d+)?'
nums = re.findall(num_rx, string)
if (substring_name in string):
special_match = re.search(rf'{num_rx}(?=\s*{substring_name}\b)', string)
if special_match:
unique_code = special_match.group()
string = string.replace(unique_code, '')
unique_code = unique_code + substring_name
if len(nums) >= 2 & len(nums) <= 3:
height, weight = get_weight_height(string)
else:
height, weight = get_weight_height(string)
name = " ".join(re.findall("[a-zA-Z]+", string))
name = name.replace(substring_name,'')
return format(float(height), '.2f'), float(weight), unique_code, name
And I apply it like this:
string = 'Anya 101.30 23 4546AX'
height, weight, unique_code, name = extract_measurements(string, class_code['Small'])
print( 'name is: ', name, '\nh is: ', height, '\nw is: ', weight, '\nunique code is: ', unique_code)
The results are very good.
I tried to apply the function on the dataframe, but I don't know how, I tried this as I got inspired from this and this and this... but they are all different than my problem:
df['height'], df['weight'], df['unique_code'], df['name'] = extract_measurements(df['Measurements'], class_code['Small'])
I cannot figure out how to apply it on my dataframe. Please help me.
I am at the very beginning, I highly appreciate all the help if you could possibly help me!
Use apply for rows (axis=1) and choose 'expand' option. Then rename columns and concat to the original df:
pd.concat([df,(df.apply(lambda row : extract_measurements(row['Measurements'], class_code['Small']), axis = 1, result_type='expand')
.rename(columns = {0:'height', 1:'weight', 2:'unique_code', 3:'name'})
)], axis = 1)
output:
Dream Career Measurements height weight unique_code name
-- ----------------- -------------------------- -------- -------- ------------- ------------
0 Scientist Rachel 24.3 100.25 100 AX 100 100 100AX Rachel
1 Astronaut 100.5 Tuuli 30.1 100 100 nan Tuuli
2 Software Engineer Michael 28.0 7771AX 113.75 100 100 7771AX Michael
3 Doctor Vivien Ester 40AX 115.20 100 100 40AX Vivien Ester
4 Fashion Designer Oliver 40.5 100 100 nan Oliver
5 Teacher Julien 35.1 678 AX 111.1 100 100 678AX Julien
6 Architect Bee 20.0 100.80 88AX 100 100 88AX Bee
(note I stubbed def get_weight_height(string) function because your coded did not include it, to always return 100,100)
#piterbarg's answer seems efficient given the original functions, but the functions seem verbose to me. I'm sure there's a simpler solution here that what I'm doing, but what I have below replaces the functions in OP with I think the same results.
First changing the column names to snake case for ease:
df = pd.DataFrame({
'dream_career': ['Scientist', 'Astronaut', 'Software Engineer', 'Doctor',
'Fashion Designer', 'Teacher', 'Architect'],
'measurements': ['Rachel 24.3 100.25 100 AX', '100.5 Tuuli 30.1',
'Michael 28.0 7771AX 113.75', 'Vivien Ester 40AX 115.20',
'Oliver 40.5', 'Julien 35.1 678 AX 111.1',
'Bee 20.0 100.80 88AX']
})
First the strings in .measurements are turned into lists. From here on list comphrehensions will be applied to each list to filter values.
df.measurements = df.measurements.str.split()
0 [Rachel, 24.3, 100.25, 100, AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678, AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
The second step is filtering out the 'AX' from .measurements and appending 'AX' to all integers. This assumes this example is totally reproducible and all the height/weight measurements are floats, but a different differentiator could be used if this isn't the case.
df.measurements = df.measurements.apply(
lambda val_list: [val for val in val_list if val!='AX']
).apply(
lambda val_list: [str(val)+'AX' if val.isnumeric() else val
for val in val_list]
)
0 [Rachel, 24.3, 100.25, 100AX]
1 [100.5, Tuuli, 30.1]
2 [Michael, 28.0, 7771AX, 113.75]
3 [Vivien, Ester, 40AX, 115.20]
4 [Oliver, 40.5]
5 [Julien, 35.1, 678AX, 111.1]
6 [Bee, 20.0, 100.80, 88AX]
Name: measurements, dtype: object
.name and .unique_code are pretty easy to grab. With .unique_code I had to apply a second lambda function to insert NaNs. If there are missing values for .name in the original df the same thing will need to be done there. For cases of multiple names, these are joined together separated with a space.
df['name'] = df.measurements.apply(
lambda val_list: ' '.join([val for val in val_list if val.isalpha()])
)
df['unique_code'] = df.measurements.apply(
lambda val_list: [val for val in val_list if 'AX' in val]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
For height and weight I needed to create a column of numerics first and work off that. In cases where there are missing values I'm having to come back around to deal with those.
import re
df['numerics'] = df.measurements.apply(
lambda val_list: [float(val) for val in val_list
if not re.search('[a-zA-Z]', val)]
)
df['height'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val < 70]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
df['weight'] = df.numerics.apply(
lambda val_list: [val for val in val_list if val >= 100]
).apply(
lambda x: np.nan if len(x)==0 else x[0]
)
Finally, .measurements and .numerics are dropped, and the df should be ready to go.
df = df.drop(columns=['measurements', 'numerics'])
dream_career name unique_code height weight
0 Scientist Rachel 100AX 24.3 100.25
1 Astronaut Tuuli NaN 30.1 100.50
2 Software Engineer Michael 7771AX 28.0 113.75
3 Doctor Vivien Ester 40AX NaN 115.20
4 Fashion Designer Oliver NaN 40.5 NaN
5 Teacher Julien 678AX 35.1 111.10
6 Architect Bee 88AX 20.0 100.80

Data Frame - changing of the nested variable

We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?
The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object

How do I create new columns by combining data in existing columns?

I have a dataset that includes 5 columns Excuse formatting:
id Price Service Rater Name Cleanliness
401013357 5 3 A 1
401014972 2 1 A 5
401022510 3 4 B 2
401022510 5 1 C 9
401022510 3 1 D 4
401022510 2 2 E 2
I would like for there to be only one row for each ID. Therefore, I need to create columns for each of the raters' names and ratings categories (e.g. Rater Name Price, Rater Name Service, Rater name Cleanliness), each in its own column. Thank you.
I've explored groupby but cannot figure out how to manipulate these into new columns. Thank you!
Here's the code and data I'm actually using:
import requests
from pandas import DataFrame
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])
From this, I get
formattedSpread overUnder spread
id provider
401013357 consensus UMass -21 68.0 -21.0
401014972 consensus Rice -22.5 58.5 -22.5
401022510 Caesars Colorado State -17.5 57.5 -17.5
consensus Colorado State -17 57.5 -17.0
numberfire Colorado State -17 58.5 -17.0
teamrankings Colorado State -17 58.0 -17.0
401013437 numberfire Wyoming -5 47.0 5.0
teamrankings Wyoming -5 47.0 5.0
401020671 consensus Ball State -19.5 61.5 -19.5
401019470 Caesars UCF -22.5 NaN 22.5
consensus UCF -22.5 NaN 22.5
numberfire UCF -24 70.0 24.0
teamrankings UCF -24 70.0 24.0
401013328 numberfire Minnesota -21.5 47.0 -21.5
teamrankings Minnesota -21.5 49.0 -21.5
The outcome I am looking for is for each of the 4 different providers to have three columns each, so that it's caesars_formattedSpread, caesars_overUnder, Caesars spread, numberfire_formattedSpread, numberfire_overUnder, numberfire_spread, etc.
When I run unstack as suggested, I don't get what I expect. Instead I get:
formattedSpread 0 UMass -21
1 Rice -22.5
2 Colorado State -17.5
3 Colorado State -17
4 Colorado State -17
5 Colorado State -17
6 Wyoming -5
7 Wyoming -5
8 Ball State -19.5
9 UCF -22.5
10 UCF -22.5
11 UCF -24
12 UCF -24
* Edited, based on the edited question *
Given that your dataframe is df:
df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()
The problem with your edited code, is that you don't assign dflinestable.set_index(['id', 'provider']) to anything. So when you then use dflinestable.unstack(), you are unstacking the original dflinestable.
So with your entire code, it should be:
import requests
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df
# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]
This will give you a DataFrame like (abbreviated):
Caesars_formattedSpread ... teamrankings_spread
id ...
401012246 Alabama -24 ... -23.5
401012247 Arkansas -34 ... NaN
401012248 Auburn -1 ... -1.5
401012249 NaN ... NaN
401012250 Georgia -44 ... NaN

How to calculate the similarity measure of text document?

I have CSV file that looks like:
idx messages
112 I have a car and it is blue
114 I have a bike and it is red
115 I don't have any car
117 I don't have any bike
I would like to have the code that reads the file and performs the similarity difference.
I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.
based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".
It would be great if you can provide your help with code that I can run in my side as well. Thanks
I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
K = len(df)
similarity = np.empty((K,K), dtype=float)
for i, ac in enumerate(df['messages']):
for j, bc in enumerate(df['messages']):
if i > j:
continue
if i == j:
sim = 100
else:
sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
# for comparison of 2 strings.
similarity[i, j] = sim
similarity[j, i] = sim
df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)
Output: df_sim
id 112 114 115 117
id
112 100.0 78.0 51.0 50.0
114 78.0 100.0 47.0 54.0
115 51.0 47.0 100.0 83.0
117 50.0 54.0 83.0 100.0

Categories