How to scrape table in specific subsection of a page? - python

I'm trying to scrape a specific table from a page containing multiple tables. The url I'm using includes the subsection where the table is located.
So far I tried scraping all tables and select the one I need manually
wikiurl = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
table_class = "toccolours"
table = soup.find_all('table', table_class) # find all tables
# and pick right one
df=pd.read_html(str(table[15]))
Is it possible to use the information in the url #Strikeforce_Challengers:_Britt_vs._Sayers to only scrape the table in this section?

You are on the way - Simply split() url once by #, last element from result by _ and join() the elements to use them in the css selector with :-soup-contains():
table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')
pd.read_html(str(table))[0]

Related

BeautifulSoup only identifying 2 of 5 tables

I'm working on my first python project and hit a snag. I'm trying to use BeautifulSoup to scrape data from some tables on this site: https://www.basketball-reference.com/awards/awards_2020.html
When I use the following code, I am able to get data from the first two tables but the other three aren't recognized (i.e. len(tables) =2 when it should =5)
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/awards/awards_{}.html'.format(awardyear)
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tables = soup.find_all('table')
len(tables)
When I print soup, all the tables are in the html so I'm not sure why the last three aren't recognized. I've spent some time trying to spot a difference between the tables that are/aren't being recognized, but have come up empty so far.
This is happening because the other 3 tables are within HTML comments <!-- .... -->.
You can extract the tables checking if the tags are of the type Comment:
import requests
from bs4 import BeautifulSoup, Comment
URL = "https://www.basketball-reference.com/awards/awards_2020.html"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
# Find all comments
comments = soup.find_all(text=lambda t: isinstance(t, Comment))
comment_soup = BeautifulSoup(str(comments), "html.parser")
print("The length of tables:", len(soup.find_all("table")))
print("The length of tables within comments:", len(comment_soup.find(class_="table_outer_container")))
Output:
The length of tables: 2
The length of tables within comments: 3

Scraping table returns only “table” and not the contents of the table

Imgae description is here:
Scraping table returns only “table” and not the contents of the table.
Here is my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://data.eastmoney.com/gdhs/detail/600798.html"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table')
print(table)
You found the table just fine with the code. Because the table is composed of multiple elements (tr/td), you have to loop through those to get the inner text of the table cells.
# This grabs the first occurrence of a table on the web page. If you want the second occurrence of a table on the web page, use soup.find_all('table')[1], etc.
table = soup.find_all('table')[0]
# Use a splice if there are table headers. If you want to include the table headers, use table('tr')[0:]
for row in table('tr')[1:]:
print(row('td')[0].getText().strip())

How to access specific table shown in inspect using Python and BeautifulSoup for web scraping

I am working on web scraping using Python and BeautifulSoup. My purpose is to pull members data from https://thehia.org/directory?&tab=1. There are around 1685 records.
When I view the page source on my Chrome, I cannot find the table. Seems it dynamically pulls the data. But when I use the inspect option of Chrome, I can find the "membersTable" table in the div that I need.
How can I use BeautifulSoup to access that membersTable that I can access in the inspect.
You can mimic the POST request the page makes for content then use hjson to handle unquoted keys in string pulled out of response
import requests, hjson
import pandas as pd
data = {'formId': '3721260'}
r = requests.post('https://thehia.org/Sys/MemberDirectory/LoadMembers', data=data)
data = hjson.loads(r.text.replace('while(1); ',''))
total = data['TotalCount']
structure = data['JsonStructure']
members = hjson.loads(structure)
df = pd.DataFrame([[member[k][0]['v'] for k in member.keys()] for member in members['members'][0]]
,columns = ['Organisation', 'City', 'State','Country'])
print(df)
Try this one
import requests
from bs4 import BeautifulSoup
url = "https://thehia.org/directory?&tab=1"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'membersTable'})
row_list = []
for row in table.findAll('tr',{'class':['normal']}):
data= []
for cell in row.findAll('td'):
data.append(cell.text)
row_list.append(data)
print(row_list)

Download table from wunderground with beautiful soup

I would like to Weather History & Observations table from the following link:
https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=
This is the code I have so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
link = 'https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo='
resp = requests.get(link)
c = resp.text
soup = BeautifulSoup(c)
I would like to know what is the next step to access the table info at the bottom of the page (assuming this is a good website format to allow this to happen).
Thank you
You can use find_all
table = soup.find('table', class_="responsive obs-table daily")
rows = table.find_all('tr')

Wikipedia table scraping using python

I am trying to scrape tables from wikipedia. I wrote a table scraper that downloads a table and saves it as a pandas data frame.
This is the code
from bs4 import BeautifulSoup
import pandas as pd
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
print soup
# Create an object of the first object
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
rank=[]
country=[]
pop=[]
date=[]
per=[]
source=[]
for row in table.find_all('tr')[1:]:
col=row.find_all('td')
col1=col[0].string.strip()
rank.append(col1)
col2=col[1].string.strip()
country.append(col2)
col3=col[2].string.strip()
pop.append(col2)
col4=col[3].string.strip()
date.append(col4)
col5=col[4].string.strip()
per.append(col5)
col6=col[5].string.strip()
source.append(col6)
columns={'Rank':rank,'Country':country,'Population':pop,'Date':date,'Percentage':per,'Source':source}
# Create a dataframe from the columns variable
df = pd.DataFrame(columns)
df
But it is not downloading the table. The problem is in this section
table = soup.find("table", {"class":"wikitable sortable jquery-tablesorter"})
print table
where output is None
As far as I can see, there is no such element on that page. The main table has "class":"wikitable sortable" but not the jquery-tablesorter.
Make sure you know what element you are trying to select and check if your program sees the same elements you see, then make your selector.
The docs says you need to specify multiple classes like so:
soup.find("table", class_="wikitable sortable jquery-tablesorter")
Also, consider using requests instead of urllib2.

Categories