BS4 get info from class with weird name - python

Got this weird html from the Steam Community market search:
<span class=\"normal_price\">$2.69 USD<\/span>
How to extract data with bs4? This is not working:
soup.find("span", attrs={"class": "\"normal_price\""})

You have HTML embedded in a JSON string, which must escape the quotes. Rather than manually extract that data, parse the JSON first:
import json
data = json.loads(json_data)
html = data['results_html']
If you are using the requests library, the response can be decoded for you:
response = requests.get('http://steamcommunity.com/market/search/render/?query=appid:730&start=0&count=3&currency=3&l=english&cc=pt')
html = response.json()['results_html']
after which you can parse this with BeautifulSoup just fine:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> html = requests.get('http://steamcommunity.com/market/search/render/?query=appid:730&start=0&count=3&currency=3&l=english&cc=pt').json()['results_html']
>>> BeautifulSoup(html, 'lxml').find('span', class_='normal_price').span
<span class="normal_price">$2.69 USD</span>

Related

Why am I unable to web-scrape URL from a hyperlink in this website?

I tried to extract URL from a hyperlink in this web: https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/
I used the following Python code:
import requests
from bs4 import BeautifulSoup
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
The problem is this code does not return any URL.
I want to get all of this urls:
You are unable to parse it as the data is dynamically loaded. As you can see in the following image, the HTML data that is being written to the page doesn't actually exist when you download the HTML source code. The JavaScript later parses the window.__SITE variable and extracts the data from there:
However, we can replicate this in Python. After downloading the page:
import requests
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
You can use re (regex) to extract the encoded page source:
import re
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
Afterwards, you can use urllib to URL-decode the text, and json to parse the JSON string data:
from urllib.parse import unquote
from json import loads
json_data = loads(unquote(encoded_data))
You can then parse the JSON tree to get to the HTML source data:
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
At that point, you can use your own code to parse the HTML:
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
If you put it all together, here's the final script:
import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup
# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
# Decode and load as dictionary
json_data = loads(unquote(encoded_data))
# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
# Get links
links = soup.find_all('a')
# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
The links are generated dynamically by javascript code and the data can be found un the structure below.
<script id="site-injection">
window.__SITE="your data is here"
</script>
So you need to grab this script element and parse the value of window.__SITE

BeautifulSoup shows strange text

I am trying to scrape data from a Bengali (language) website.
When I inspect element on that website, everything is as it should.
code:
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content, "lxml")
print(soup.prettify())
Part of the output:
<strong>
সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾
</strong>
সà¦à¦°à¦¾à¦à¦° à¦à¦¿à¦à§à¦à¦¾à¦¸à¦¾ >> should be >>"সচরাচর জিজ্ঞাসা"
I am not sure if it is ASCII or not. I used https://onlineasciitools.com/convert-ascii-to-unicode to convert that text into Unicode. As per this website, it may be ASCII. But I checked an ASCII table online and none of those characters were in it. So now I need to convert those text into readable stuff. Any help?
You should just decode the content, like this:
request.content.decode('utf-8')
Yes, its work. You need to decode('utf-8') request response.
import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.content.decode('utf-8'), "lxml")
my_data = soup.find('div', {'class':'col-md-6 col-sm-6 col-xs-12 slider-button-center xs-mb-15'})
print(my_data.get_text(strip=True, separator='|'))
print output:
্বাস্থ্য বিষয়ক সেবা|(ডাক্তার, হাসপাতাল, ঔষধ, টেস্ট)|খাদ্য ও জরুরি সেবা|(খাদ্য, অ্যাম্বুলেন্স, ফায়ার সার্ভিস)|সচরাচর জিজ্ঞাসা|FAQ
The request returned by requests.get() returns both the raw byte content (request.content) and and the content decoded by the encoding declared in the content.
request.encoding is the actual encoding (which may not be UTF-8), and request.text is the already-decoded content.
Example using request.text instead:
import requests
from bs4 import BeautifulSoup
request = requests.get("https://corona.gov.bd/")
soup = BeautifulSoup(request.text, "lxml")
print(soup.find('title'))
<title>করোনা ভাইরাস ইনফো ২০১৯ | Coronavirus Disease 2019 (COVID-19) Information Bangladesh | corona.gov.bd</title>

Parsing HTML contained in API call response

I'm having some trouble figuring out how to parse HTML that's contained within the response of an API call in Python 3.7 (requests + BS4).
Say I want to parse out the article URLs from a response like this one.
I'm able to get the "rendering" entry of the response which seemingly contains the HTML I'd like to parse, however, when I pass the text along to Beautiful Soup's HTML parser, it does not seem to work as expected (unable to locate HTML tags of any kind):
import requests
from bs4 import BeautifulSoup
url = """https://www.washingtonpost.com/pb/api/v2/render/feature/?service=prism-query&contentConfig={%22url%22:%22prism://prism.query/ap-articles-by-site-id,/world%22,%22offset%22:0,%22limit%22:5}&customFields={%22isLoadMore%22:false,%22offset%22:0,%22maxToShow%22:50,%22dedup%22:true}&id=f00boImX29Vv3s&rid=&uri=/world/"""
r = requests.get(url).json()
soup = BeautifulSoup(r['rendering'], 'html.parser')
links_html = soup.find_all("div", attrs={"class":"headline x-small normal-style text-align-inherit "})
links = []
for div in links_html:
links.append(div.find('a', href = True)['href'])
Am I wrong in my assumption that the "rendering" entry in the response is raw HTML?
You want to use the json library (or in hindsight, Request.json()), because whatever link you're visiting isn't actually a website, but what seems to be an api on top of it that gives you the html along with encoding, content type, and some other things that won't be necessary.
Here's how I did it.
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("https://www.washingtonpost.com/pb/api/v2/render/feature/?service=prism-query&contentConfig=%7B%22url%22:%22prism://prism.query/ap-articles-by-site-id,/world%22,%22offset%22:0,%22limit%22:5%7D&customFields=%7B%22isLoadMore%22:false,%22offset%22:0,%22maxToShow%22:50,%22dedup%22:true%7D&id=f00boImX29Vv3s&rid=&uri=/world/")
>>> bs = BeautifulSoup(r.content, 'html.parser')
>>> first_div = bs.find("div", class_="moat-trackable")
>>> first_div
>>> import json
>>> html_dict = json.loads(r.content)
>>> html_dict
{'rendering': '<div class="moat-trackable ...'}
>>> html_dict.keys()
dict_keys(['rendering', 'encoding', 'contentType', 'pageResources', 'externalResources', 'httpHeaders'])
>>> bs = BeautifulSoup(html_dict["rendering"], 'html.parser')
>>> first_div = bs.find("div", class_="moat-trackable")
>>> first_div
<div class="moat-trackable

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

python beautifulsoup get html tag content

How can I get the content of an html tag with beautifulsoup? for example the content of <title> tag?
I tried:
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
soup = BeautifulSoup(url)
result = soup.findAll('title')
for each in result:
print(each.get_text())
But nothing happened. I'm using python3.
You need to fetch the website data first. You can do this with the urllib.request module. Note that HTML documents only have one title so there is no need to use find_all() and a loop.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url ='http://www.websiteaddress.com'
data = urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
result = soup.find('title')
print(result.get_text())

Categories