I'm working on my first python project and hit a snag. I'm trying to use BeautifulSoup to scrape data from some tables on this site: https://www.basketball-reference.com/awards/awards_2020.html
When I use the following code, I am able to get data from the first two tables but the other three aren't recognized (i.e. len(tables) =2 when it should =5)
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/awards/awards_{}.html'.format(awardyear)
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tables = soup.find_all('table')
len(tables)
When I print soup, all the tables are in the html so I'm not sure why the last three aren't recognized. I've spent some time trying to spot a difference between the tables that are/aren't being recognized, but have come up empty so far.
This is happening because the other 3 tables are within HTML comments <!-- .... -->.
You can extract the tables checking if the tags are of the type Comment:
import requests
from bs4 import BeautifulSoup, Comment
URL = "https://www.basketball-reference.com/awards/awards_2020.html"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
# Find all comments
comments = soup.find_all(text=lambda t: isinstance(t, Comment))
comment_soup = BeautifulSoup(str(comments), "html.parser")
print("The length of tables:", len(soup.find_all("table")))
print("The length of tables within comments:", len(comment_soup.find(class_="table_outer_container")))
Output:
The length of tables: 2
The length of tables within comments: 3
Related
I'm trying to scrape a specific table from a page containing multiple tables. The url I'm using includes the subsection where the table is located.
So far I tried scraping all tables and select the one I need manually
wikiurl = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
table_class = "toccolours"
table = soup.find_all('table', table_class) # find all tables
# and pick right one
df=pd.read_html(str(table[15]))
Is it possible to use the information in the url #Strikeforce_Challengers:_Britt_vs._Sayers to only scrape the table in this section?
You are on the way - Simply split() url once by #, last element from result by _ and join() the elements to use them in the css selector with :-soup-contains():
table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')
pd.read_html(str(table))[0]
I'm trying to extract the "10-K" url and append it into a list from the following site:
https://www.sec.gov/Archives/edgar/data/320193/000091205701544436/0000912057-01-544436-index.htm
Picture 1
So basically I'm trying to extract the first under the first that does not have as its sub category.
Am trying to create a loop to loop this code in multiple similar-like links, but guess I'm trying to resolve this issue first for now.
Any ideas?
Hope this answers your requirement.
import requests
from bs4 import BeautifulSoup
URL = "https://www.sec.gov/Archives/edgar/data/320193/000091205701544436/0000912057-01-544436-index.htm"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
rows = soup.findAll("td")
href_list = []
for ele in rows:
a_Tag = ele.findChildren("a")
if a_Tag:
href_list.append(a_Tag)
print(href_list)
I'm not sure I understand you'r quastion but if I got i rigth this can help you
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.sec.gov/Archives/edgar/data/320193/000091205701544436/0000912057-01-544436-index.htm")
s = BeautifulSoup(page.content, "html.parser")
print(s.find("table").findChild("a")["href"])
i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)
I am trying to scrape 2 tables from a webpage simultaneously.
BeautifulSoup finds the first table no problem, but no matter what I try it cannot find the second table, here is the webpage: Hockey Reference: Justin Abdelkader.
It is the table underneath the Playoffs header.
Here is my code.
sauce = urllib.request.urlopen('https://www.hockey-reference.com/players/a/abdelju01/gamelog/2014', timeout=None).read()
soup = bs.BeautifulSoup(sauce, 'html5lib')
table = soup.find_all('table')
print(len(table))
Which always prints 1.
If I print(soup), and use the search function in my terminal I can locate 2 seperate table tags. I don't see any javascript that would be hindering BS4 from finding the tag. I have also tried finding the table by id and class, even the parent div of the table seems to be unfindable. Does anyone have any idea what I could be doing wrong?
Because of javascript loading additional information
Today requests_html can load with html page also javascript content.
pip install requests-html
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.hockey-reference.com/players/a/abdelju01/gamelog/2014')
r.html.render()
res = r.html.find('table')
print(len(res))
4
The second table seems to be inside a HTML comment tag <--... <table class=.... I guess that's why BeautifulSoup doesn't find it.
Looks like that table is a widget — click "Share & more" -> "Embed this Table", you'll get a script with link:
https://widgets.sports-reference.com/wg.fcgi?css=1&site=hr&url=%2Fplayers%2Fa%2Fabdelju01%2Fgamelog%2F2014&div=div_gamelog_playoffs
How can we parse it?
import requests
import bs4
url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=hr&url=%2Fplayers%2Fa%2Fabdelju01%2Fgamelog%2F2014&div=div_gamelog_playoffs'
widget = requests.get(url).text
fixed = '\n'.join(s.lstrip("document.write('").rstrip("');") for s in widget.splitlines())
soup = bs4.BeautifulSoup(fixed)
soup.find('td', {'data-stat': "date_game"}).text # => '2014-04-18'
Voila!
You can reach Comment line with bs4 Comment like :
from bs4 import BeautifulSoup , Comment
from urllib import urlopen
search_url = 'https://www.hockey-reference.com/players/a/abdelju01/gamelog/2014'
page = urlopen(search_url)
soup = BeautifulSoup(page, "html.parser")
table = soup.findAll('table') ## html part with no comment
table_with_comment = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in table_with_comment]
## print table_with_comment print all comment line
start = '<table class'
for c in range(0,len(table_with_comment)):
if start in table_with_comment[c]:
print table_with_comment[c] ## print comment line has <table class
I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:
soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')
Printing body will also include other stuff under the main div tag.
Try this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
print(item.text.strip())
Use this:
import requests ; from bs4 import BeautifulSoup
res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
print(item.text)
I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python
However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.
What I find most obvious to do right now is something like this:
Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer