Why can't I webscrape the table that I want? - python

I am new to BeautifulSoup and I wanted to try out some web scraping. For my little project, I wanted to get the Golden State Warrior win rate from Wikipedia. I was planning to get the table that had that information and make it into a panda so I could graph it over the years. However, my code selects the Table Key table instead of the Seasons table. I know this is because they are the same type of table (wikitable), but I don't know how to solve this problem. I am sure that there is an easy explanation that I am missing. Can someone please explain how to fix my code and explain how I could choose which tables to web scrape in the future? Thanks!
c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
cells=row.findAll('td')
if len(cells)==13:
c_year = c_year.append(cells[0].find(text=True))
c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)

Use pd.read_html to get all the tables
This function returns a list of dataframes
tables[0] through tables[17], in this case
import pandas as pd
# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')
print(len(tables))
>>> 18
tables[0]
0 1
0 AHC NBA All-Star Game Head Coach
1 AMVP All-Star Game Most Valuable Player
2 COY Coach of the Year
3 DPOY Defensive Player of the Year
4 Finish Final position in division standings
5 GB Games behind first-place team in division[b]
6 Italics Season in progress
7 Losses Number of regular season losses
8 EOY Executive of the Year
9 FMVP Finals Most Valuable Player
10 MVP Most Valuable Player
11 ROY Rookie of the Year
12 SIX Sixth Man of the Year
13 SPOR Sportsmanship Award
14 Wins Number of regular season wins
# display all dataframes in tables
for i, table in enumerate(tables):
print(f'Table {i}')
display(table)
print('\n')
Select specific table
df_i_want = tables[x] # x is the specified table, 0 indexed
# delete tables
del(tables)

Related

Data/Table Scraping from Website using Python

I'm trying to scrape a data from a table on a website.
However, I am continuously running into "ValueError: cannot set a row with mismatched columns".
The set-up is:
url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]
Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).
Now, I wanted to fill in the empty dataframe through:
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(my_data)
my_data.loc[length] = row
However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data).
I would really appreciate any help to solve this problem and to fill in the empty dataframe.
Thanks in advance.
Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html, which returns a list of DataFrames after parsing every table tag within the HTML.
Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.
dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())
allTime = dfList[0].set_index(['rank', 'Youtuber'])
# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])
columns_ordered = [
'started', 'category', 'subscribers', 'subscribers/year',
'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]
If the [columns_ordered] part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'.
combinedDf should look like
You can try to use pd.read_html to read the table into a dataframe:
import pandas as pd
url = "https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en"
df = pd.read_html(url)[0]
print(df)
Prints:
rank Youtuber subscribers video views video count category started
0 1 ✿ Kids Diana Show 106000000 86400421379 1052 People & Blogs 2015
1 2 Movieclips 58500000 59672883333 39903 Film & Animation 2006
2 3 Ryan's World 34100000 53568277882 2290 Entertainment 2015
3 4 Toys and Colors 38300000 44050683425 901 Entertainment 2016
4 5 LooLoo Kids - Nursery Rhymes and Children's Songs 52200000 30758617681 605 Music 2014
5 6 LankyBox 22500000 30147589773 6913 Comedy 2016
6 7 D Billions 24200000 27485780190 582 NaN 2019
7 8 BabyBus - Kids Songs and Cartoons 31200000 25202247059 1946 Education 2016
8 9 FGTeeV 21500000 23255537029 1659 Gaming 2013
...and so on.

How can I scrape tables that seem to be hidden by jquery?

I'm trying to scrape these words with their meanings on this website, I scraped the first table, but even after revealing word list 2 by clicking on it, bs4 can't find that table (or any other of the hidden tables). Is there anything different I'm meant to do for toggled/hidden elements like this?
Here's what I used to access the first table:
root = "https://www.graduateshotline.com/gre-word-list.html#x2"
content = requests.get(root).text
soup = BeautifulSoup(content,'html.parser')
table = soup.find_all('table',attrs={'class':'tablex border1'})[0]
print(table)
import pandas as pd
df = pd.read_html('https://www.graduateshotline.com/gre/load.php?file=list2.html',
attrs={'class': 'tablex border1'})[0]
print(df)
Output:
0 1
0 multifarious varied; motley; greatly diversified
1 substantiation giving facts to support (statement)
2 feud bitter quarrel over a long period of time
3 indefatigability not easily exhaustible; tirelessness
4 convoluted complicated;coiled; twisted
.. ... ...
257 insensible unconscious; unresponsive; unaffected
258 gourmand a person who is devoted to eating and drinking...
259 plead address a court of law as an advocate
260 morbid diseased; unhealthy (e.g.. about ideas)
261 enmity hatred being an enemy
[262 rows x 2 columns]

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Python NLTK: How to find similarity between user input and excel data

So I'm trying to create an python chatbot, I have an excel file with hundreds of rows which looks like below:
QuestionID Question Answer Document
Q1 Where is London? In the UK Google
Q2 How many football 22 Google
players on the pitch?
Now when the user inputs a question, such as "Where is London?" or "Where is London" I want it to return all the text in that row.
I can successfully print what is in the excel file, but I'm not sure how to go through all the rows and find the row which is similar or matches the users question.
text = []
with open("dataset.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
text.append((row['Question'], row['Answer'], row['Document'] ))
print(text)
You could use pandas
`import pandas as pd
# loading data into pandas DataFrame from file
df = pd.read_csv('dataset.csv')
# given question
question = "How many football players on the pitch?"
# df.loc[(condition)] gives you all the rows that satisfy your condition
# in this case the questions have to match entirely.
rows_with_similiar_question = df.loc[df["Question"] == question]
# prints the first answere
print(rows_with_similiar_question['Answer'].values[0])`
More on pandas and dataframes here
You probably wouldn't want to do an exact match, since that means it will be case sensitive, and need exact punctuation, no misspellings, etc.
I would look at using fuzzywuzzy to find a match score. Then you can return the solution that best matches the question:
Example:
from fuzzywuzzy import fuzz
import pandas as pd
lookup_table = pd.DataFrame({
'QuestionID':['Q1','Q2','Q3'],
'Question':['Where is London?','Where is Rome?', 'How many football players on the pitch?'],
'Answer':['In the UK','In Italy', 22],
'Document':['Google','Google','Google']})
question = 'how many players on a football pitch?'
lookup_table['score'] = lookup_table.apply(lambda x : fuzz.ratio(x.Question, question), axis=1)
lookup_table = lookup_table.sort_values('score', ascending=False)
The result table:
print (lookup_table.to_string())
QuestionID Question Answer Document score
2 Q3 How many football players on the pitch? 22 Google 71
0 Q1 Where is London? In the UK Google 34
1 Q2 Where is Rome? In Italy Google 27
Give the answer to the top choice:
print (lookup_table.iloc[0]['Answer'])
22
or since you want the row
print (lookup_table.head(1))
QuestionID Question Answer Document score
2 Q3 How many football players on the pitch? 22 Google 71

Nested string in a list - need to split nested string to help turn it into a dataframe

I'm working on a web scraping project with nba stats. When I am scraping, I can get all of the information. However, all of the stats are returning as one string, which, turned into a dataframe, puts all the stats in one column. I'm attempting to split this string. and replace it in it's own nested area.
Hopefully the image explains this better.
I am webscraping from https://stats.nba.com/players/traditional/?sort=PTS&dir=-1 using selenium because I am planning on clicking though all of the pages
code I've done so far
here is the function I'm working on:
In the last line I would like to replace z[2] with the split version I've created. When I try z[2] = z[2].split(' ') I get the error AttributeError: 'list' object has no attribute 'split'
new_split = []
for i in player:
player_stats.append(i.text.split('\n'))
for z in player_stats:
new_split.append(z[2].split(' '))```
You didn't mention where you're getting your data from. (I've updated the url in my code. It's still the same API, which returns information for all 457 players, so there is no need to use selenium to navigate to the other pages). The official nba website seems to be offering their data in JSON format, which is always desirable when web scraping:
import requests
import json
# url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2019-20&SeasonType=Regular+Season&StatCategory=PTS"
url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=&Weight="
response = requests.get(url)
response.raise_for_status()
data = json.loads(response.text)
players = []
for player_data in data["resultSet"]["rowSet"]:
player = dict(zip(data["resultSet"]["headers"], player_data))
players.append(player)
for player in players[:10]:
print(f"{player['PLAYER']} ({player['TEAM_ABBREVIATION']}) is rank {player['RANK']} with a GP of {player['GP']}")
Output:
James Harden (HOU) is rank 1 with a GP of 18
Giannis Antetokounmpo (MIL) is rank 2 with a GP of 19
Luka Doncic (DAL) is rank 3 with a GP of 18
Bradley Beal (WAS) is rank 4 with a GP of 17
Trae Young (ATL) is rank 5 with a GP of 18
Damian Lillard (POR) is rank 6 with a GP of 18
Karl-Anthony Towns (MIN) is rank 7 with a GP of 16
Anthony Davis (LAL) is rank 8 with a GP of 18
Brandon Ingram (NOP) is rank 9 with a GP of 15
LeBron James (LAL) is rank 10 with a GP of 19
Note: I have no idea what a "GP" is - I just picked that for demonstration. Here's a screenshot of Chrome's network logger, showing a small part of the expanded JSON resource (EDIT The json response from the new url looks exactly the same, except some of the headers are different, like "TEAM" -> "TEAM_ABBREVIATION"):
You can see the values - which you're struggling to extract out of one giant string - nicely separated into separate elements. The code I posted above creates key-value pairs using the headers ("PLAYER_ID", "RANK", etc. found in data["resultSet"]["headers"]) and these values.
If the second column is a string, you could try to split this string into
a list, turn each element of this list into a series, and then concat
this new data frame with the first two columns of the original data frame.
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
Example:
df = pd.DataFrame({"0": [1, 2],
"1": ["Name1", "Name2"],
"2":[["HOU 30 80"], ["LA 30 50"]]})
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
0 1 0 1 2
0 1 Name1 HOU 30 80
1 2 Name2 LA 30 50

Categories