Python NLTK: How to find similarity between user input and excel data - python

So I'm trying to create an python chatbot, I have an excel file with hundreds of rows which looks like below:
QuestionID Question Answer Document
Q1 Where is London? In the UK Google
Q2 How many football 22 Google
players on the pitch?
Now when the user inputs a question, such as "Where is London?" or "Where is London" I want it to return all the text in that row.
I can successfully print what is in the excel file, but I'm not sure how to go through all the rows and find the row which is similar or matches the users question.
text = []
with open("dataset.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
text.append((row['Question'], row['Answer'], row['Document'] ))
print(text)

You could use pandas
`import pandas as pd
# loading data into pandas DataFrame from file
df = pd.read_csv('dataset.csv')
# given question
question = "How many football players on the pitch?"
# df.loc[(condition)] gives you all the rows that satisfy your condition
# in this case the questions have to match entirely.
rows_with_similiar_question = df.loc[df["Question"] == question]
# prints the first answere
print(rows_with_similiar_question['Answer'].values[0])`
More on pandas and dataframes here

You probably wouldn't want to do an exact match, since that means it will be case sensitive, and need exact punctuation, no misspellings, etc.
I would look at using fuzzywuzzy to find a match score. Then you can return the solution that best matches the question:
Example:
from fuzzywuzzy import fuzz
import pandas as pd
lookup_table = pd.DataFrame({
'QuestionID':['Q1','Q2','Q3'],
'Question':['Where is London?','Where is Rome?', 'How many football players on the pitch?'],
'Answer':['In the UK','In Italy', 22],
'Document':['Google','Google','Google']})
question = 'how many players on a football pitch?'
lookup_table['score'] = lookup_table.apply(lambda x : fuzz.ratio(x.Question, question), axis=1)
lookup_table = lookup_table.sort_values('score', ascending=False)
The result table:
print (lookup_table.to_string())
QuestionID Question Answer Document score
2 Q3 How many football players on the pitch? 22 Google 71
0 Q1 Where is London? In the UK Google 34
1 Q2 Where is Rome? In Italy Google 27
Give the answer to the top choice:
print (lookup_table.iloc[0]['Answer'])
22
or since you want the row
print (lookup_table.head(1))
QuestionID Question Answer Document score
2 Q3 How many football players on the pitch? 22 Google 71

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

How to split two first names that together in two different words in python

I am trying to split misspelled first names. Most of them are joined together. I was wondering if there is any way to separate two first names that are together into two different words.
For example, if the misspelled name is trujillohernandez then to be separated to trujillo hernandez.
I am trying to create a function that can do this for a whole column with thousands of misspelled names like the example above. However, I haven't been successful. Spell-checkers libraries do not work given that these are first names and they are Hispanic names.
I would be really grateful if you can help to develop some sort of function to make it happen.
As noted in the comments above not having a list of possible names will cause a problem. However, and perhaps not perfect, but to offer something try...
Given a dataframe example like...
Name
0 sofíagomez
1 isabelladelgado
2 luisvazquez
3 juanhernandez
4 valentinatrujillo
5 camilagutierrez
6 joséramos
7 carlossantana
Code (Python):
import pandas as pd
import requests
# longest list of hispanic surnames I could find in a table
url = r'https://namecensus.com/data/hispanic.html'
# download the table into a frame and clean up the header
page = requests.get(url)
table = pd.read_html(page.text.replace('<br />',' '))
df = table[0]
df.columns = df.iloc[0]
df = df[1:]
# move the frame of surnames to a list
last_names = df['Last name / Surname'].tolist()
last_names = [each_string.lower() for each_string in last_names]
# create a test dataframe of joined firstnames and lastnames
data = {'Name' : ['sofíagomez', 'isabelladelgado', 'luisvazquez', 'juanhernandez', 'valentinatrujillo', 'camilagutierrez', 'joséramos', 'carlossantana']}
df = pd.DataFrame(data, columns=['Name'])
# create new columns for the matched names
lastname = '({})'.format('|'.join(last_names))
df['Firstname'] = df.Name.str.replace(str(lastname)+'$', '', regex=True).fillna('--not found--')
df['Lastname'] = df.Name.str.extract(str(lastname)+'$', expand=False).fillna('--not found--')
# output the dataframe
print('\n\n')
print(df)
Outputs:
Name Firstname Lastname
0 sofíagomez sofía gomez
1 isabelladelgado isabella delgado
2 luisvazquez luis vazquez
3 juanhernandez juan hernandez
4 valentinatrujillo valentina trujillo
5 camilagutierrez camila gutierrez
6 joséramos josé ramos
7 carlossantana carlos santana
Further cleanup may be required but perhaps it gets the majority of names split.

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Why can't I webscrape the table that I want?

I am new to BeautifulSoup and I wanted to try out some web scraping. For my little project, I wanted to get the Golden State Warrior win rate from Wikipedia. I was planning to get the table that had that information and make it into a panda so I could graph it over the years. However, my code selects the Table Key table instead of the Seasons table. I know this is because they are the same type of table (wikitable), but I don't know how to solve this problem. I am sure that there is an easy explanation that I am missing. Can someone please explain how to fix my code and explain how I could choose which tables to web scrape in the future? Thanks!
c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
cells=row.findAll('td')
if len(cells)==13:
c_year = c_year.append(cells[0].find(text=True))
c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)
Use pd.read_html to get all the tables
This function returns a list of dataframes
tables[0] through tables[17], in this case
import pandas as pd
# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')
print(len(tables))
>>> 18
tables[0]
0 1
0 AHC NBA All-Star Game Head Coach
1 AMVP All-Star Game Most Valuable Player
2 COY Coach of the Year
3 DPOY Defensive Player of the Year
4 Finish Final position in division standings
5 GB Games behind first-place team in division[b]
6 Italics Season in progress
7 Losses Number of regular season losses
8 EOY Executive of the Year
9 FMVP Finals Most Valuable Player
10 MVP Most Valuable Player
11 ROY Rookie of the Year
12 SIX Sixth Man of the Year
13 SPOR Sportsmanship Award
14 Wins Number of regular season wins
# display all dataframes in tables
for i, table in enumerate(tables):
print(f'Table {i}')
display(table)
print('\n')
Select specific table
df_i_want = tables[x] # x is the specified table, 0 indexed
# delete tables
del(tables)

Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs

I have a dataframe of publications that have the following rows:
publication_ID , title, author_name, date
12344, Design style, Jake Kreath, 20071208
12334, Power of Why, Samantha Finn, 20150704
I ask the user for a string and use that string to search through the titles.
The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID.
This is my code so far:
import pandas as pd
from pandas import DataFrame
publications = pd.read_csv(filepath, sep= "|")
search_term = input('Enter the term you are looking for: ')
def stringDataFrame(publications, title, regex):
newdf = pd.DataFrame()
for idx, search_term in publications['title'].iteritems():
if re.search(regex, search_term):
newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)
return newdf
print(newdf.stringDataFrame)
Use a combination of .str.contains and .loc
publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]
Just be careful, because if your title is 'nightlife' and someone searches for 'night' this will return a match. If that's not your desired behavior then you may need .str.split instead.
As jpp points out, str.contains is case sensitive. One simple fix is to just ensure everything is lowercase.
title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]
now Lord, LoRD, lord and all other permutations will return a valid match, and your original DataFrame has the capitalization unchanged.
Full example but you should accept the answer above by #ALollz
import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term)]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5
per your error you can filter out all np.nan values by as part of the logic using the new code below:
import pandas as pd
import numpy as np
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5

Categories