Trouble parsing HTML with BeautifulSoup or golang colly

Trouble parsing HTML with BeautifulSoup or golang colly - python

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):
I attempt to target the div.section_content:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})
Printing the last line shows some other divs, but not the one with the pitching data.
However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):
>>> "Pitching" in soup.text
True
Here is an abbreviated version of one of the golang attempts:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("www.baseball-reference.com"),
)
c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
fmt.Println(e.ChildText("div.section_content"))
})
c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")
}
}

It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.
For example:
for element in soup(text=lambda text: isinstance(text, Comment)):
comment = element.extract()
comment_soup = BeautifulSoup(comment)
# work with comment_soup

Related

Beautiful Soup web scraping complex html for data

Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you

As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data

How do I get the text within HTML tags not included in .content?

I want to scrape the text from pages like this: https://www.ncbi.nlm.nih.gov/protein/p22217 into a string. In particular the block of text in DBSOURCE
I've seem multiple suggestions for using soup.findall(text=true) and the like but it comes up with nothing. Anything from before at least 2018 or so also seems to be outdated (I'm using python 3.7). I think the problem is that the content I want is outside the range of r.text and r.content; when I search with ctrl F the part I'm looking for just isn't there.
from bs4 import BeautifulSoup
import requests
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
data = r.content
soup = BeautifulSoup(data, "html.parser")
PageInfo = soup.find("pre", attrs={"class":"genbank"})
print(PageInfo)
The result of this and other attempts is "None". No error message, it just doesn't return anything.

You can use this instead as the page depends on xmlhttprequests
Code :
from bs4 import BeautifulSoup
import requests,re
url = "https://www.ncbi.nlm.nih.gov/protein/P22217"
r = requests.get(url)
soup = BeautifulSoup(r.content,features='html.parser')
pageId = soup.find('meta', attrs={'name':'ncbi_uidlist'})['content']
api = requests.get('https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id={}'.format(pageId))
data = re.search(r'DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD',api.text)
print(data.group(1).strip())
Demo Code : Here
Explanation :
The request to url will help getting the id of the product you are asking for where exist in the meta of the pages.
by getting the id the second request will use the website api to get you the description you are asking for. A regex pattern wil be used to separate the wanted part and the unwanted part.
Regex :
DBSOURCE([\w\s\n\t.:,;()-_]*)KEYWORD
Demo Regex : Here

The page is doing XHR call in order to get the information you are looking for.
The call is to https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=135747&db=protein&report=genpept&conwithfeat=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000
and it returns
<div class="sequence">
<a name="locus_P22217.3"></a><div class="localnav"><ul class="locals"><li>Comment</li><li>Features</li><li>Sequence</li></ul></div>
<pre class="genbank">LOCUS TRX1_YEAST 103 aa linear PLN 18-SEP-2019
DEFINITION RecName: Full=Thioredoxin-1; AltName: Full=Thioredoxin I;
Short=TR-I; AltName: Full=Thioredoxin-2.
ACCESSION P22217
VERSION P22217.3
**DBSOURCE** UniProtKB: locus TRX1_YEAST, accession P22217;
class: standard.
extra accessions:D6VY45
created: Aug 1, 1991.
...
So do HTTP call from your code in order to get the data.

Scraping links in Pattern library for Python

I found code similar to this in a course I was taking. This code gets all of the links of a certain format that are mentioned in the source code of the webpage. I understand everything, except for the last line. The last line says the following:
print link.attrs.get('href', '')
This works, however I'm unsure as to how the instructor figured out how to do this. I've looked through the documentation and I can't figure out what .get does. Could someone please let me know how I can find this information.
Documentation for Pattern Library: http://www.clips.ua.ac.be/pages/pattern-web
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
pattern = 'http://www.realclearpolitics.com/epolls/????/governor/??/*-*.html'
dom = web.Element(xml)
all_links = dom.by_tag('a')
for link in all_links:
print link.attrs.get('href', '')

It would get all the href "hyperlinks" in that page. You can BeautifulSoup package which is more convinient
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html")
soup = BeautifulSoup(xml, "lxml") # lxml is just the parser for reading the html
soup.find_all('a href') # this is the line that does what you want

Python:Getting text from html using Beautifulsoup

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:
I am using the following code:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
print(item_name.string)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:
[<h4 data-bind="text: rankingText"></h4>]
but in the html of the link when inspecting this is like:
<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:
Its clear that the text is missing. How can I overpass that?
Edit:
Printing the soup variable in the terminal I can see that this value exists:
So there should be a way to access through soup.
Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.

If you aren't going to try browser automation through selenium as #Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:
import re
import json
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)
print profile["ranking"], profile["rankingText"]
Prints:
1 1st

The data is databound using javascript, as the "data-bind" attribute suggests.
However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:
<script type="text/javascript"
profile: {
...
"ranking": 96,
"rankingText": "96th",
"highestRanking": 3,
"highestRankingText": "3rd",
...
So you could use that instead.

I have solved your problem using regex on the plain text:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
#soup = BeautifulSoup(plainText, "html.parser")
pattern = re.compile("ranking\": [0-9]+")
name = pattern.search(plainText)
ranking = name.group().split()[1]
print(ranking)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number

This could because of dynamic data filling.
Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.
<h4 data-bind="text: rankingText"></h4>
Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

Python: Print Specific line of text out of TD tag

This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)

I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.

I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble parsing HTML with BeautifulSoup or golang colly - python

Related

Beautiful Soup web scraping complex html for data

How do I get the text within HTML tags not included in .content?

Scraping links in Pattern library for Python

Python:Getting text from html using Beautifulsoup

Python: Print Specific line of text out of TD tag

Categories

Resources